Back to Digital Preservation Pioneers
"Can I see some identification please?"
It is a question we routinely hear at the airport, the bank and just about any situation where you have to prove your identity. Identity fraud causes serious problems and, though we may not like the inconvenience, we appreciate the security.
So too in digital preservation it is crucial for files to prove their identities, to show us some ID. If you do not know what a file is, or if it is damaged and does not meet the requirements for its stated file format, you may not be able to read or hear or see its content. What you have is a pile of well-preserved but meaningless bits. To save something for the ages you have to validate up front that it is what it says it is.
Stephen Abrams knows quite a bit about digital format validation as his work with tools and standards such as JHOVE, the Global Digital Format Registry (external link) and PDF/A demonstrates. According to Abrams, there are two fundamental levels at which you can approach the preservation of a digital object: preserving its bits or preserving its information content.
Bit-level preservation is common in the digital-preservation community. "If somebody gave you a set of bits, there are well-understood strategies for ensuring that you can return a faithful copy of those original bits in the future," he said. Solutions include storing redundant copies on different media in different geographic locations, auditing the copies to verify that they are identical and periodically migrating the bits from older to newer media.
However, preserving the abstract content encoded into those bits is more complicated. "The preserved bits are only useful if we are able to transform them back into some human-sensible form -- text or picture or sound," Abrams stated. "And you can't do that unless you understand the format of the bits, since a format is precisely the set of syntactic and semantic rules governing the mapping between abstract content and bits."
Abrams did much of his pioneering work as an engineer for the Harvard University Library. He recalls that, in discussions between the HUL and its counterparts at the Massachusetts Institute of Technology about digital-preservation repositories, they came to a realization that "format" was at the core of any talk about digital preservation. In response, the HUL and JSTOR designed JHOVE to process digital objects.
Abrams, who now works at the California Digital Library, is collaborating with partners at Portico and Stanford University to prototype JHOVE2, which should be available in early 2010. JHOVE2 will be able to process more sophisticated digital objects than JHOVE currently can. "In JHOVE1 there's an assumption that a single digital object is always manifest in a single file, and is always an instance of a single format," Abrams said. But that is not always true. Digital objects are becoming more complex and comprised of several elements. A TIFF file, for instance, holds raster image data but may also contain an embedded color profile or embedded XMP metadata. The JHOVE2 team wants to break from the assumption of "one object, one file, one format," and accept a model of one object potentially containing any number of files and formats.
According to Abrams, there are a few steps in the format-verification process. "Once you've confirmed that something fits a format you have to know what that format is, what it means," he said. What does it really mean, for example, to say that an object is a TIFF?
To fill the need for a definitive resource with comprehensive, up-to-date descriptions of the file formats, the HUL created the Global Digital Format Registry. "The GDFR is meant to be a distributed and replicated registry of format information, populated and vetted by experts and enthusiasts world-wide -- a centralized clearinghouse that could be used to manage a set of technical information about formats," Abrams said.
The GDFR is a peer-to-peer network of independent, but cooperating registries that communicate with each other to synchronize their content. "Each node on the network would be a full replica of the information that's available in all the other nodes," explained Abrams. Such redundancy and content distribution increases the chances of longevity for the information.
Abrams was also instrumental in the creation of the PDF/A standard for long-term archiving of electronic documents. The main aspect of PDF/A is that it defines a completely self contained package, where everything necessary to redisplay the static visual appearance of a file is found inside the file itself: all fonts, color information and descriptive metadata. Audio and video content is not permitted, as well as JavaScript and other executables.
He attended the early meetings of the PDF/A committee and saw that a lot of the initial requirements were coming from the records-management part of the archival world. Abrams knew that the needs of records management are different and often more constraining than the needs of libraries, and he spoke up. "I was, perhaps, a bit too free with my opinions on how to balance the various needs of libraries and archives, so they put me in charge," he said wryly. "That's what happens when you're too vocal, I guess." He became the ISO project leader and document editor for the initial release of the PDF/A standard, ISO 19005-1.
PDF/A is derived from Adobe's popular PDF format. Adobe was involved from the beginning and they were supportive of the effort. "The current PDF specification document is quite formidable at over 1300 pages long," said Abrams. "Having access to high-level Adobe technical experts was extremely helpful." Creating an international standard is time consuming, and though PDF/A was pushed through on what ISO calls its fast track, it took two years to gain approval.
Abrams encourages people to preserve documents in the PDF/A format but he explains, "There is nothing in the PDF/A standard that claims it is the best or the only way in which to preserve electronic documents. However, once you have made the determination that you want to use a PDF-based approach, PDF/A is the best way to make that as amenable to preservation efforts as possible."
Abrams is now the Senior Manager for Digital Preservation Technology at the California Digital Library. He has broad oversight over CDL's preservation infrastructure, including its digital repository system, persistent identification mechanism and the Web Archiving Service, newly developed as part of the NDIIPP-funded Web-at-Risk project.
One of his tasks is to help re-evaluate the CDL repository and to define high-level requirements. He emphasizes that a repository is a set of services that can be deployed flexibly to meet preservation obligations at a continuum of service levels. CDL has a growing importance both inside and outside the UC system and Abrams is considering how best to meet the resulting needs and expectations.
Like many software engineers, Abrams is practical and focused on doing preservation today and not so much speculating about the state of preservation in 100 years. "You really can't do long term preservation in the sense of a current action that will ensure usability forever," he said. "Preservation is about a series of small incremental actions each effective for a short time horizon -- say 5 or ten years -- that collectively add up to forever."
But along with that day-to-day practicality, he also has a visionary grasp on the abstract content, the core stuff that is being archived. He said, "We always talk about preservation of digital objects, but really the object itself isn't the important thing. Ultimately, an object is just a set of bits, but it's the underlying information content that those bits represent that matters. Over time we need to manage the gap between what we were originally given and what we need to be useful today. If we can quantify that gap then we can figure out how to convert the one into the other."