Back to Digital Preservation Pioneers
In Michael L. Nelson’s view of digital preservation’s future, data is well behaved but promiscuous. Dead Web sites will be brought back to life, digital data will be born archivable, files will describe themselves, and data will spread freely among the masses for safekeeping.
Michael’s idealism stems from his work as assistant professor of Computer Science at Old Dominion University (ODU) in Norfolk, Va. His ODU team participated in the Library of Congress’s landmark Archive Ingest and Handling Test (AIHT) (PDF, 11KB) project, which researched the challenge of what Michael calls "archival forensics": ingesting, making sense of and improving the preservability of a set of donated digital archives.
ODU was the smallest institution in the AIHT project and the only non-library. The others – Harvard, Stanford, Johns Hopkins and the Library of Congress – have a larger infrastructure than ODU and their own repository systems. As such, they also had more limitations than ODU. "With your own system your tools get to be all about your own site and your own way of doing things," Michael said. "The tools that you have influence your approach to the problem." ODU may have had a smaller staff and fewer resources, but it also had fewer responsibilities.
That was to ODU's advantage. Michael said, "When you have nothing, you have complete flexibility." So Old Dominion's AIHT research centered on self-archiving objects, which travel bundled with their own metadata.
The AIHT experience had a profound effect on Michael's thinking about alternative approaches to digital archives, an effect that extended past the end of the project to newer, related research. One subsequent project led ODU to explore novel ways to exploit the power of the major Web crawlers. In the process they discovered how to resurrect extinct Web sites.
ODU wondered, since Google, Yahoo and the Internet Archive have more disk drives than ODU will ever have, and more resources and superior crawl technology, what if ODU crawled content that the others had already crawled? Crawl the crawlers. Michael and his colleagues had the novel idea of reconstructing extinct sites from the crawler's caches.
All the major crawlers maintain a cache, a fast-serving area where copies of the recently crawled items are temporarily stored. "We have this public Web service called Warrick (external link) that traces its origin back to the AIHT project," Michael said. Warrick helps users reconstruct or recover their Web sites by enabling them to search the caches of the Internet Archive, Google, Yahoo and Live Search. ODU gathers the best results from the crawlers and combines them into a representation of the original Web site.
Another ODU tool with AIHT roots identifies file types. When a Web server serves up page contents, the server declares (in the background) what type of file it thinks each element on the page is. The problem is that servers can – and occasionally do – get file types wrong (how do you know for certain that a file labeled "something.jpeg" really is a JPEG?).
A tool such as Harvard’s JHOVE (external link) analyzes files and helps confirm the file type, which in turn helps a curator inventory the Web content. The common practice with JHOVE is to identify a file after it is downloaded; ODU focused on identifying a file before it leaves the server. The result of its attempt at a "born archival" approach is CRATE (external link). Michael described CRATE this way: "We run every [file] through a set of tools, such as JHOVE, and record what each tool said about it. But we’re not going to verify that it was right. Our idea was that 'correctness' is a property that will be determined in the future." When each Web-site file is served up by the Web server, CRATE declares what the file is and what the various tools said about it.
So you can get Web content in two different ways. "I can ask in a regular HTTP way [and display the Web page] and get the file ready to use right now," Michael said. "Or I can get a version of the file that's already packaged and has all this descriptive preservation metadata, which I can store and put in a repository."
All this is fine for the elite group of professional Web preservationists, but what do these tools mean for the rest of the world? Michael said that preservation tools must be simplified for wider public acceptance. "Formats and protocols are rarely about which one is better. It’s more about which one everyone else is using....Simpler formats are going to have greater utility."
Though Michael is very much a part of the digital preservation priesthood – writing and coding deep in the inner sanctum of academia – he believes in widespread public use of digital tools and in the wisdom of crowds. He compared the gap between the cataloging community and the Web community regarding the organization of online information with the public response to the Library's Flickr project, particularly with how users are tagging (external link) the Library’s content.
Michael observed, "The world will archive your stuff. All you need to do is make it available....This is one of the important shifts in the thinking about repositories....If I lock it away and never use it, then I’m not likely to have copies of it in the future. But if I make it widely available I’m actually highly likely to have lots of copies later."
And users will continue to create their own finding aids, which add value to the resources. Michael contends that this redistribution of content to different locations is the essence of digital preservation. He said, " 'Use' equals 'archiving.'"