More than 100 million Web pages from President Bush's second term will be preserved for historians, researchers and the public, thanks to a joint effort announced on Thursday of government agencies and non-profit libraries.
Comment on this article in The Forum.The Library of Congress and Government Printing Office, in partnership with the California Digital Library, University of North Texas Libraries and Internet Archive, will harvest and archive all Web sites that could change under a new presidential administration. The total amount of data in the collection, which will focus on executive and legislative branch sites, is expected to reach 10 to 12 terabytes.
"These sites either change quickly - immediately following election - or change closer to the swearing in," said Kris Carpenter, director of the Web group at the nonprofit Internet Archive. "We want to preserve the most important information for future researchers."
For example, committees that are made up of presidential appointees and elected officials change with a new administration, so there's a need to preserve information about their members, areas of responsibility, policies and accomplishments. Some changes are significant and others are more subtle, Carpenter said, "but they all can be very telling for researchers looking back and asking, 'How did this influence specific actions of the current administration, and the administrations that followed?'"
Beyond content, researchers will be able to analyze how information was positioned on a Web page, what was placed alongside it, and the significance that could have had in the communication of the overall message.
The Library of Congress will focus on preservation of congressional Web sites, and the Internet Archive will conduct a comprehensive "crawl" of the .gov domain, essentially taking snapshots of all pertinent sites. The University of North Texas and California Digital llibraries will each conduct more in-depth crawls of specific government agencies, and the Government Printing Office will offer advice on the preservation process. Automated tools will assist in collection, though an inventory will be taken manually to ensure no information is missed.
"We're using technologies and processes that allow us to render the materials as they're presented to the user," Carpenter said. "This is critically important - we're not modifying them in any way." Once the project is complete, researchers and the public will be able to navigate the archived pages the same way they do other Web destinations: by typing the address, viewing the page and browsing through materials. The content will be indexed to enable full text searches.
Similar projects took place in 2000 and 2004, to document the Web pages of President Clinton's first term, and the first half of the Bush administration. The 2004 end-of-term collection has about 75 million addresses for Internet resources, known as Uniform Resource Identifiers, or URIs.
This project is larger in scope though, in part, because records have grown bigger. In 2004, the average government Web record was seven times larger than the average .com record. And the records have likely grown more over the past four years given increases in the number of data-rich files, such as images, .pdf documents, and videos. The Internet Archive conducts monthly harvests of several federal .gov sites and has seen a 15 percent increase in the collection size in the past two years alone.