|Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact|
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Geospatial | Generic
Web Sites and Pages >> Quality and Functionality Factors
Table of Contents
• Macro archiving
• Micro archiving
• Out of scope
• Normal rendering for archived Web sites
• Functionality: documentation of harvesting context
• Functionality: efficiency at scale
• Functionality: support for stewardship
The formats discussed here are those that might hold the results of a crawl of a Web site or set of Web sites, a dynamic action resulting from the use of a software package (e.g., Heritrix) that calls up Web pages and captures them in the form disseminated to users.
The goal for a Web archiving activity is typically to collect Web pages, each with such embedded resources as images, sounds, and the like, in as complete a manner as possible and to capture the link structure in a way that allows the researcher to identify what was linked to and if the linked resource has also been captured to link to it. The focus of a Web archiving activity may be guided by the concept of a Web site. The terms Web page and Web site must be understood in a flexible manner. A useful definition for page is provided in Web Archive Metrics: Definitions and Framework (draft, December 2005), prepared by the Library of Congress Web Capture team for the International Internet Preservation Consortium (IIPC): "a page is a set of one or more Web resources expected to be rendered simultaneously, which can be identified by the URI of the item that embeds the other resources in the set." The same document suggests the following definition for site: "an intellectually related set of resources often (but not always) bounded by technical division, such as content from a domain, which may include several related domains, or a subset of content from a host." In practice, the boundaries for a Web site are often hard to define.
For consideration of functionality required for the digital formats used for the captured Web sites, it is useful to provide examples of scenarios and categorizations that have been used to describe Web archiving activities. In Archiving Websites: General Considerations and Strategies, Niels Brügger distinguishes between micro and macro archiving. 
Macro archiving is usually of the open "surface" Web and includes the intent of supporting study of the Web as itself, including its link structure and changes or trends over time. This is not simply a matter of archiving the content of selected Web pages and preserving the ability to follow links when the linked pages have also been harvested.
The need for efficiency at scale (for both capture and subsequent processing) is likely to dominate other functionality factors. The ability to reproduce dynamic elements of a Web-based presentation may be considered less significant.
Other functionality factors that may be significant for macro archiving include: the ability to combine and de-duplicate the results of crawls at different times or by different institutions (e.g., different national libraries); the ability to extract subsets; support for very efficient indexing for access by URL and chronology for simulating the original Web experience; support for indexing the full text of pages; and the retention of the original URLs for harvested content and links in order to relate pages and other content objects and to analyze link structures.
At this writing, most macro archiving activities use one of two related formats designed for Web archiving at scale: ARC and WARC. The former was developed by the Internet Archive to support its work; WARC is a refined and extended format that is based on ARC and was approved in May 2009 as ISO 28500:2009, Information and documentation -- WARC file format.
Niels Brügger's Archiving Websites focuses on micro archiving. In addition to discussion of harvesting, he highlights the challenges of "archiving the dynamics of the Internet," including not only the dynamics of updating, but also the experience of dynamic elements embedded in pages, some of which may rely on human interaction. Brügger's team tested nine programs for capturing complete individual Web sites. Results are reported at http://cfi.au.dk/fileadmin/www.cfi.au.dk/publikationer/archiving_underside/archiving.pdf.
Brügger's team also tested software for recording screen shots or interactive Web browsing activity. Since the resulting content objects are still images or video, the sustainability of capture formats for recording are dealt with elsewhere on this site. This type of capture is very labor-intensive.
Out of scope
Normal rendering for archived Web sites
Functionality: documentation of harvesting context
Functionality: efficiency at scale
Since simulation of the original Web experience in terms of following links found in pages is a part of normal rendering, the format must permit efficient indexing by original URL and the date and time of harvesting.
Web sites may be crawled periodically, e.g., once a week or once a month. In many instances, much of the content will be unchanged from the previous crawl. At this time, there are few effective tools for the elimination of duplicate content. Nevertheless, the possibility of avoiding duplication in the future has led specialists in the field to define an action ("duplicate detection event") and to establish a related requirement, i.e., that archiving formats be capable of storing relevant metadata that can point to no-longer duplicate data in another location, e.g., the dataset from a preceding crawl.
Functionality: support for stewardship
In July 2004, the Danish Royal Library, as part of planning for the Danish national Web archiving program (Netarkivet), produced the report, Archive Format and Metadata Requirements . This report recommended extending ARC to allow richer metadata, rather than using an XML-based structure or creating a new format from scratch. This report was influential in the development of WARC.
The ability to record metadata about harvested resources based on analysis of the harvested content can support preservation activities and enhance access. For example, documenting the character encoding, or recording whether a harvested file is technically valid might support future preservation activities. Access for researchers could be enhanced by assigning topical subject terms based on textual analysis. The May 2006 report Use Cases for Access to Internet Archives identifies the need for capabilities to extract a subset of a Web archive for a researcher to use for specialized analysis . The ability to assign terms could support subset generation.
The resources harvested from the Web may be in a wide variety of digital formats, some widely used and others relatively obscure. In the future, the format used for some resource (for example an embedded image) may no longer be supported by browsers. It may be appropriate for custodial institutions to transform such images into a supported format and store the transformed images as part of the Web archive.
1 It is worth noting that the U.S. Copyright Office, part of the Library of Congress, does sometimes receive file sets for Web sites as a part of the creator's copyright registration and deposit process. Although policies have not been established, the writers of this document do not anticipate that these file sets will be selected for the Library's permanent collections. In contrast, harvested Web sites are being added to the collections today; see http://www.loc.gov/webarchiving/.
1. Brügger, Niels. Archiving Websites: General Considerations and Strategies.. Aarhus, Denmark: The Centre for Internet Research, 2005. http://cfi.au.dk/fileadmin/www.cfi.au.dk/publikationer/archiving_underside/archiving.pdf
2. Boyko, Andrew and Michael Ashenfelder. Web Archive Metrics : Definitions and Framework (Working draft - IIPC internal review), Washington DC, USA: Library of Congress, October 2005.
5. Christensen, Steen Sloth. Archive Format and Metadata Requirements. Copenhagen, Denmark: Royal Library of Denmark, July 2004. http://netarchive.dk/publikationer/Archival_format_requirements-2004.pdf
6. IIPC. Use Cases for Access to Internet Archives. May 2006. http://netpreserve.org/publications/iipc-r-003.pdf
Back to top