June 4, 2009 -- The WARC file format (external link) is now approved as an international standard: ISO 28500:2009.
For years, heritage organizations have tried to find the most appropriate ways to collect and monitor World Wide Web material using web-scale tools. At the same time, these organizations were concerned with the requirements for archiving large numbers of born-digital and digitized files. They needed a container format that enabled one file to carry a large and varied number of data objects for storage, management and exchange.
The WARC format (external link) meets this need and is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is an extension of the ARC format (external link), which has been used since 1996 to store files harvested on the web. WARC format offers new possibilities, notably the recording of HTTP request headers, the recording of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.
The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (external link), whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC formed a Standards Working Group to develop a document for the International Organization for Standardization to approve.
Over a period of four years, the working group, with the Bibliothèque nationale de France (external link) as convener, collaborated closely with IIPC experts to improve the original draft. The group will continue to maintain the standard and prepare its future revision.
Standardization offers a guarantee of durability and evolution for the WARC format. It will help web archiving entering into the mainstream activities of heritage institutions and other branches, by fostering the development of new tools and ensuring the interoperability of collections.
Several applications are already WARC compliant, such as the Heritrix (external link) web crawler, the ARC tools (external link) for data management and exchange, the Wayback Machine (external link), NutchWAX (external link) and other search tools (external link) for access. The international recognition of the WARC format and its applicability to every kind of digital object will provide strong incentives to use it within and beyond the web-archiving community.