The Chesapeake Project Legal Information Archive has completed its third annual analysis of link rot among the original URLs for law- and policy-related materials published to the Web and archived though the Chesapeake Project. The Chesapeake Project was launched in 2007 by the Georgetown University Law Library and the State Law Libraries of Maryland and Virginia as a collaborative digital archive for the preservation of important Web-published legal materials, which often disappear as Web site content is rearranged or deleted over time. More about the Chesapeake Project.
In the three years since the archive was launched, the Chesapeake Project law libraries have built a collection comprising more than 5,700 digital items and 2,300 titles, all of which were originally posted to the Web.
For this study, the term "link rot" is used to describe a URL that no longer provides direct access to files matching the content originally harvested from the URL and currently preserved in the Chesapeake Project’s digital archive. In some instances, a 404 or "not found" message indicates link rot at a URL; in others, the URL may direct to a site hosted by the original publishing organization or entity, but the specific resource has been removed or relocated from the original or previous URL.
All of the Web resources described in this report that have disappeared from their original locations on the Web remain accessible via permanent archive URLs here at legalinfoarchive.org, thanks to the Chesapeake Project's efforts.
The Chesapeake Project conducted its first link rot assessment at the project’s one-year mark in 2008 as part of its first-year evaluation. During the project’s first year, 1,266 born-digital online titles were harvested from the Web and preserved within the digital archive. A random sample of 579 titles was selected for the link rot study, ensuring results at a 95 percent confidence level and confidence interval of +/- 3. When this sample was first analyzed in March 2008, link rot was found to be present in 48 of 579 URLs.
One year later, in 2009, the sample was analyzed a second time as part of the project’s second-year evaluation. The second analysis demonstrated that link rot was present in 83 out of the original sample of 579 URLs. Within 12 to 24 months of harvest, 14.3 percent of the archived titles had disappeared from their original URLs, compared to the March 2008 analysis, which had shown link rot among the sample URLs to be 8.3 percent.
The present analysis of the sample showed that by March 2010, the prevalence of link rot had increased to 160 out of 579 URLs. Within two to three years of harvest, link rot among the sample URLs had increased to 27.9 percent, compared to 14.3 percent in 2009 and 8.3 percent in 2008.
In other words, link rot increased from about one in every 12 archived titles in 2008, to one in every seven titles in 2009, and finally to about one in every 3.5 titles in 2010.
The ratio in the sample of URLs with link rot to working URLs, as of March 2008, March 2009, and March 2010, is illustrated below.
More than 90 percent of the top-level domains in the sample were state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 41 percent, 32 percent, and 17 percent of the sample, respectively. Other top-level domains, which accounted for approximately 7 percent of the sample, combined, were .edu, .com, and .net, which respectively represented 2.9, 2.2, and 1.9 percent of the sample. Less than 3 percent of the sample was represented by a combination of .mil, .us, .info, .uk, .au, .ca, and .int top-level domains. The sample also included one IP address.
In the original 2008 analysis, link rot was present in 10.8 percent of URLs with state top-level domains, 10 percent of URLs with government top-level domains, and 8.3 percent of URLs with organization top-level domains. Although education (.edu) and commercial (.com) URLs represented a much smaller portion of the sample, both top-level domains were found to have relatively high inactivity levels of 11.8 and 15.4 percent in 2008, respectively.
In 2009, the prevalence of link rot increased among URLs with state, government, organization, education, network (.net), military (.mil), and information-oriented (.info) top-level domains. Among URLs in the sample with state top-level domains, link rot increased by 5 percent from 2008 to 2009. While commercial and education URLs were shown to have relatively little increase in link rot between 2008 and 2009, the 2009 analysis demonstrated a significant increase among URLs with organization top-level domains, from 11.8 percent to 35.3 percent over the one-year period, while no increase in link rot among commercial URLs was observed.
The current 2010 analysis of the sample showed link rot to be present in more than 32 percent, nearly one-third, of the URLs with a state-government top-level domain. The prevalence of link rot among these state URLs more than doubled in the year following the 2009 analysis, and it nearly tripled in the two years following the original 2008 analysis of the sample. Link rot was found in more than 22 percent of URLs with an organization top-level domain, nearly double the link rot observed in 2009, and almost six times the link rot found among URLs with an organization top-level domain in the 2008 analysis. Twenty-five percent of government URLs were found to have link rot in 2010, an increase from 13 percent in 2009 and 10 percent in 2008. Commercial and network URLs both experienced a jump in link rot, from about 15 percent in both 2008 and 2009 to nearly 30 percent among .com domains, and from zero in 2008 and approximately 9 percent in 2009, to more than 27 percent among .net domains in 2010. The single IP address and.uk top-level domain in the sample also succumbed to link rot in 2010. It is worth mentioning that these top-level domains represented a small fraction of the sample.
A list of all top-level domains found in the sample, along with link rot detected in 2008, 2009, and 2010, is available in the table below.
|Top-Level Domain||Total in Sample||Link Rot Frequency, 2008||Link Rot Frequency, 2009||Link Rot Frequency, 2010|
|.state.__.us||240||26 (10.8%)||38 (15.8%)||77 (32.1%)|
|.org||184||7 (8.3%)||21 (11.4%)||41 (22.3%)|
|.gov||100||10 (10%)||13 (13%)||25 (25%)|
|.edu||17||2 (11.8%)||6 (35.3%)||6 (35.3%)|
|.com||13||2 (15.4%)||2 (15.4%)||4 (30.8%)|
|.net||11||0||1 (9.1%)||3 (27.3%)|
|.mil||3||0||1 (33.3%)||1 (33.3%)|
|.info||2||1 (50%)||1 (50%)||1 (50%)|
|[IP address]||1||0||0||1 (100%)|
For the present analysis, a separate sample representing the content in the archive as of the project’s three-year mark was gathered. In the three years since the project began, 2,372 born-digital online titles were harvested from the Web and preserved within the digital archive. A random sample of 736 titles was selected for the link rot study, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.
Out of these 736 titles randomly selected for the 2010 sample, link rot was found to be present in 165 URLs. In other words, 22.4 percent of the original URLs of all titles harvested and archived during the first three years of the Chesapeake Project had succumbed to link rot by March 2010. The ratio of working URLs to those with link rot is illustrated below.
In 2010, 86.8 percent of the top-level domains were state-government (state.[state code].us), organization (.org), and government (.gov) URLs, which represented 34.8 percent, 30.4 percent, and 21.6 percent of the sample, respectively. Of these three top-level domains, link rot was present in 30.5 percent of URLs with state top-level domains, 20.1 percent of URLs with organization top-level domains, and 15.7 percent of URLs with government top-level domains.
URLs with .com. and .net top-level domains were found to have inactivity levels of 17.9 and 13.6 percent, respectively, while .edu URLs were found to have a lower inactivity rate of 7.1 percent. A list of all top-level domains found in the 2010 sample, along with their inactivity rates, is available in the table below.
|Top-Level Domain||Total in Sample
|Link Rot Frequency, |
|[IP address]||2||2 (100%)|
Sarah Rhodes, Digital Collections Librarian
Georgetown University Law Library
111 G St., NW
Washington, DC 20001