Skip to main content

The Chesapeake Digital Preservation Group: "Link Rot" and Legal Resources on the Web, 2012
 

"Link Rot" and Legal Resources on the Web: A 2012 Analysis by the Chesapeake Digital Preservation Group




Contents



Introduction

Data Show Link Rot in 38 Percent of Online Publications within Five Years

Link Rot and Top-Level Domains, 2008-2012

Link Rot in 2012

Link Rot and Top-Level Domains in 2012

Link Rot by Year of Capture

Contact



Introduction


The Chesapeake Digital Preservation Group has completed its fifth annual investigation of link rot among the original URLs for online law- and policy-related materials archived though the group's efforts.


The Chesapeake Digital Preservation Group is a collaborative digital preservation program for legal materials, reports, and documents posted to the web. The group is comprised of four member libraries—two academic law libraries, the Georgetown Law and Harvard Law School Libraries, and the State Law Libraries of Maryland and Virginia—and is part of the Legal Information Archive.


Access to web-published content can be lost as websites are routinely updated, reorganized, or deleted over time. In the five years since the program began, the Chesapeake Group has built a digital archive collection comprising more than 8,600 digital items and 3,700 titles, almost all originally posted to the web but captured and preserved within the group's digital archive.


Every year, the Chesapeake Group investigates whether or not the documents in the archive can still be found at the original web addresses from which they were captured. The group analyzes two samples of web addresses, or URLs, pulled from the archive's records.


The first sample includes 579 original URLs for content captured from 2007-2008. This sample is revisited every year to document link rot and explore how it changes over time.


The other sample is new and represents the full content of the archive at the time the study is conducted. This second sample provides an up-to-date snapshot of link rot among the original URLs for all the content currently in the archive. In 2012, this sample included 830 original URLs for materials captured from 2007-2012.





Data Show Link Rot in 38 Percent of Online Publications within Five Years


In 2012, 218 out of 579 URLs in the sample no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group. In other words, link rot has increased to 37.7 percent within five years.


In 2008, the sample was analyzed for the first time as part of an evaluation of the archiving program, and link rot was found to be present in 48, or 8.3 percent, of the 579 URLs comprising the sample. At the time, a total of 1,266 web-based titles had been captured and archived. A random sample of 579 titles from the archive was generated for the analysis, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.


One year later, in 2009, the sample was analyzed a second time. Link rot was found to be present in 83 out of the original sample of 579 URLs. Within two years of capture, 14.3 percent of the archived titles had disappeared from their original URLs.


By the third year, in 2010, the prevalence of link rot had increased to 160 out of 579 URLs, to a whopping 27.9 percent. Link rot continued to increase in 2011, but by a slower margin, reaching 30.4 percent by the fourth year. The new 2012 data show an increase of 7.3 percent compared to 2011, to 37.7 percent, more in line with our findings of annual increases from 2008 and 2009.


Increases in link rot from 2008 through 2012 are illustrated in Figure 1 and Table 1, below.



Figure 1


Link Rot, April 2012




Table 1



Year

Content Missing

Working URLs

% Link Rot

2008

48

531

8.3%

2009

83

496

14.3%

2010

160

419

27.6%

2011

176

403

30.4%

2012

218

361

37.7%








Link Rot and Top-Level Domains, 2008-2012


More than 90 percent of the top-level domains in the sample are state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 41 percent, 32 percent, and 17 percent of the sample, respectively. Other top-level domains, comprising approximately 7 percent of the sample, combined, include .edu, .com, and .net, which respectively represent 2.9, 2.2, and 1.9 percent of the sample. Less than 3 percent of the sample consists of .mil, .us, .info, .uk, .au, .ca, and .int top-level domains. The sample also includes one IP address.


In 2012, the content at .org domains showed the highest increase in link rot. More than 43 percent of the materials posted to organization domains disappeared from the original documented web addresses. Link rot on government web pages also increased in 2012: up to 36 percent at .gov domains and nearly 34 percent at .state.[state code].us domains. Education domains also showed an increase to more than 41 percent in 2012 after decreasing slightly in 2011, and network domain link rot rose to more than 36 percent.


A list of all top-level domains found in the sample, along with link rot detected in 2008, 2009, 2010, and 2011 is available in Table 2.



Table 2


Top-Level Domain

Total in Sample

Link Rot Frequency 2008

Link Rot Frequency 2009

Link Rot Frequency 2010

Link Rot Frequency 2011

Link Rot Frequency 2012

.state.__.us

240

26 (10.8%)

38 (15.8%)

77 (32.1%)

73 (30.4%)

81 (33.8%)

.org

184

7 (8.3%)

21 (11.4%)

41 (22.3%)

57 (31%)

80 (43.5%)

.gov

100

10 (10%)

13 (13%)

25 (25%)

31 (31%)

36 (36%)

.edu

17

2 (11.8%)

6 (35.3%)

6 (35.3%)

3 (17.6%)

7 (41.2%)

.com

13

2 (15.4%)

2 (15.4%)

4 (30.8%)

4 (30.8%)

5 (38.5%)

.net

11

0

1 (9.1%)

3 (27.3%)

3 (27.3%)

4 (36.4%)

.mil

3

0

1 (33.3%)

1 (33.3%)

1 (33.3%)

1 (33.3%)

.us

3

0

0

0

0

0

.info

2

1 (50%)

1 (50%)

1 (50%)

2 (100%)

2 (100%)

.uk

2

0

0

1 (50%)

1 (50%)

1 (50%)

.au

1

0

0

0

0

0

.ca

1

0

0

0

0

0

.int

1

0

0

0

0

0

[IP address]

1

0

0

1 (100%)

1 (100%)

1 (100%)

TOTAL

579

48 (8.3%)

83 (14.3%)

160 (27.6%)

176 (30.4%)

218 (37.7%)





Link Rot in 2012: A Snapshot


For the present analysis, a new, separate sample of URLs was generated. In 2012, the collection included 8,627 digital items and 3,734 titles. To ensure statistically relevant results at a 95 percent confidence level and confidence interval of +/- 3, a random sample of 830 titles were selected for the 2012 study. Three of the titles selected for the sample were discarded because they were directly deposited by the content creators and therefore had no original web addresses; as the Chesapeake Group has increased contact with content producers over the years, a small fraction of the content archived is now deposited by the creators for archiving, rather than posted to the web for capture.


Out of the 827 titles in the sample that were captured from the web, link rot was found to be present in 214, nearly 26 percent (25.9%), of the original URLs. The ratio of working URLs to those with link rot for 2012 is illustrated in Figure 2 below, compared to samples studied in 2008, 2009, 2010, and 2011.



Figure 2


Link Rot by Year (2008-2012)





Link Rot and Top-Level Domains in 2012


In 2012, the number of titles in the archive with URLs from organization (.org) top-level domains surpassed those from government (.gov) and state government (state.[state code].us) domains. Roughly 87 percent of the top-level domains in the sample were organization, state-government, and government URLs, which represented 38.1 percent, 26 percent, and 22.7 percent of the sample, respectively. Of these three top-level domains, link rot was present in 25.7 percent of URLs with organization top-level domains, 32.6 percent of URLs with state top-level domains, and 23.9 percent of URLs with government top-level domains.


URLs with .edu, .com, and .net top-level domains, combined, represented 10 percent of the sample, and were found to have inactivity levels of 13, 19.2, and 9.1 percent, respectively. Table 3 provides a comparison of all top-level domains found in the 2012 sample, as well as previous years' samples, along with their inactivity rates.



Table 3


 

2008 Sample

2009 Sample

2010 Sample

2011 Sample

2012 Sample

Top-Level Domain

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

.state.__.us

240

26 (10.8%)

235

37 (15.7%)

256

78 (30.5%)

224

57 (25.4%)

215

70 (32.6%)

.org

184

7 (8.3%)

212

29 (13.7%)

224

45 (20.1%)

290

45 (15.5%)

315

81 (25.7%)

.gov

100

10 (10%)

155

17 (11%)

159

25 (15.7%)

167

32 (19.2%)

188

45 (23.9%)

.edu

17

2 (11.8%)

23

6 (26%)

28

2 (7.1%)

40

7 (17.5%)

46

6 (13%)

.com

13

2 (15.4%)

22

1 (4.5%)

28

5 (17.9%)

36

6 (16.7%)

26

5 (19.2%)

.net

11

0

12

0

22

3 (13.6%)

10

3 (30%)

11

1 (9.1%)

.mil

3

0

4

0

5

2 (40%)

2

1 (50%)

2

1 (50%)

.us

3

0

5

0

2

0

15

2 (13.3%)

5

1 (20%)

.info

2

1 (50%)

3

2 (66.7%)

2

0

5

2 (40%)

3

2 (66.7%)

.uk

2

0

3

1 (33.3%)

3

2 (66.7%)

2

1 (50%)

5

1 (20%)

.au

1

0

1

0

1

0

1

0

--

--

.af

--

--

--

--

--

--

--

--

1

1 (100%)

.at

--

--

--

--

--

--

2

0

1

0

.be

--

--

--

--

--

--

--

--

1

0

.ca

1

0

1

0

--

--

2

0

1

0

.ch

--

--

--

--

--

--

1

0

--

--

.int

1

0

2

0

2

0

3

0

4

0

.eu

--

--

1

0

2

1 (50%)

3

1 (33.3%)

3

0

[IP address]

1

0

1

0

2

2 (100%)

--

--

--

--

TOTAL

579

48 (8.3%)

680

93 (13.7%)

736

165 (22.4%)

803

157 (19.6%)

827

214 (25.9%)






Link Rot by Year of Capture


For the first time, the Chesapeake Group documented the year of capture for all of the URLs in our 2012 sample. Not surprisingly, the data show that a URL's risk for link rot increases with time; 40 percent of URLs captured in 2007 have succumbed to link rot, while all of the materials captured in the first few months of 2012 remain at their original web addresses. In fact, the data analyzing link rot by year of capture in our 2012 sample was strikingly similar to the increase in link rot documented by our annual link rot findings for our original 2007-2008 sample. See Tables 4 and 5 below for comparison.



Table 4: Link Rot in 2012 Sample by Year of Capture


Year of Capture

Total

Link Rot

Working URLs

% Linkrot

2007

251

101

150

40%

2008

130

40

90

31%

2009

143

37

106

26%

2010

190

29

161

15%

2011

93

7

86

8%

2012

20

0

20

0

TOTAL

827

214

613

25.9%




Table 5: Annual Increase in Link Rot for 2008 Sample


Year

Content Missing

Working URLs

% Link Rot

2012

218

361

37.7%

2011

176

403

30.4%

2010

160

419

27.6%

2009

83

496

14.3%

2008

48

531

8.3%





Contact


Sarah Rhodes, Digital Collections Librarian
Georgetown University Law Library
111 G St., NW
Washington, DC 20001
Phone: 202-662-4065
E-mail




Select the collections to add or remove from your search
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
 
OK