An ever-increasing amount of the world's cultural and intellectual output is currently being created in digital formats, much of it publicly available on the World Wide Web. The Library of Congress recognized the cultural heritage value of Web-based materials at an early date and has actively been exploring ways to collect, preserve and provide access to these materials.
Web Archiving is the process of collecting documents from the Internet and bringing them under local control for the purpose of preserving the documents in an archive. The act of running the Web archiving software is often referred to as "harvesting" or "crawling." Web archivists use a variety of software tools to identify the scope of a particular Web archiving activity, and they use Web crawlers to do the job of pulling the Web resources down to local storage. The terms Web capture, Web harvesting, Web archiving, and Web collecting are all used roughly interchangeably.
Since 2000, the Library of Congress has been capturing Web sites and developing thematic, event-driven Web archives on topics such as the 2000, 2002 and 2004 national elections, the Iraq War and the events of Sept. 11, 2001. These activities were originally done under the auspices of the MINERVA (Mapping the INternet Electronic Resources Virtual Archive) Web Archiving Project and are now administered by the Library's Web Capture team. The Web Capture team and cataloging and public services staff from around the Library are studying methods to evaluate, select, collect, catalog, provide access to and preserve these Web-based materials for future generations of researchers.
In addition to the internal activities of the Web Capture team, the Library works with a diverse set of partners to continue to expand its Web archiving activities beyond its initial work. The Library is a member of the International Internet Preservation Consortium (external link), an organization that collects and preserves a rich body of Internet content from around the world. The Library has also funded two projects directly addressing Web archiving issues: the University of California Digital Library's Web at Risk project concentrates on preserving the nation's political cultural heritage, while the University of Illinois at Urbana-Champaign's ECHO DEPository project is developing a suite of Web resource evaluation tools called the Web Archives Workbench. The Library also works closely with the Internet Archive (external link), collaborating on software tool development and testing the storage, maintenance and access of collected Web content.
In 2005, the Library moved to extend its collecting scope beyond event-based Web captures by partnering with recommending officers, curators and specialists in various Library divisions to conduct a pilot project titled Selecting and Managing Content Captured from the Web (SMCCW).
The SMCCW initiative involved 25 staff members working on a series of distinct Web archiving projects. One project involved the capture of Web resources related to the crisis in Darfur. Another focused on expanding and enhancing visual materials from the Prints and Photographs Division. A third group worked to expand the Manuscript Division collections by harvesting the Web sites of existing Library donors, including civil rights and political advocacy groups, professional and honorary organizations and research and educational organizations. A final group worked to develop a process to determine which types of Web sites, unrelated to the existing thematic collections, were worthy of archiving.
Based on prior experience, the Web Capture team knew that curators not previously familiar with Web archiving needed a basic introduction to the field before beginning to develop a Web collection plan. For that reason, the Web Capture team provided training through a series of workshops that took place over the course of the entire project.
Although some staff initially expressed skepticism about the research potential of many Web sites, closer examination revealed a wide range of official documents, research studies, audio and video recordings, press releases, agendas and conference proceedings, blogs, electronic newsletters and other sources documenting people, events and activities likely to be of lasting research interest. It was also enlightening to document how organizations, many of which had been established early in the 20th century, were incorporating technology and using the Web to reach new audiences and carry forth their mission into the 21st century.
The Library contracted with the Internet Archive to perform weekly and monthly crawls, beginning in February 2006 and ending in November 2006, using the open source Heritrix crawler (http://crawler.archive.org (external link)). Across the four projects, a total of 294 seed URLs were ultimately harvested, with more than 114.5 million objects (html files, images, pdfs, etc.) collected.
The final training workshop on the management of Web content digital collections (nicknamed "So We Got It—Now What?") provided an opportunity to evaluate lessons learned and begin to discuss the issues related to the management of the archived materials over the long-term.
The SMCCW project helped widen the expertise and understanding throughout the Library of how to select and manage content captured from the Web. The SMCCW project, along with the other external partnership efforts, are examples of how the Library continues its efforts to improve its collection and preservation of valuable cultural heritage Web resources for the benefit of future generations.
More information about the Library's Web capture activities can be found at //www.loc.gov/webcapture/index.html.