Back to Digital Preservation Pioneers
Archivist Richard Pearce-Moses has the daunting challenge of building a system to curate the valuable digital records and data from the state of Arizona. By his example he is leading the way for 21st century archivists and librarians, and their future work with digital content.
Richard, who is deputy director for technology and information resources at the Arizona State Library, Archives and Public Records, and his team began with a project to capture state agencies’ publications from those agencies’ Web sites. The first challenge was to build a complete list of state Web sites. Using technology and clever sleuthing, they built a list of all URLs on some 50,000 Web pages on key state sites. A script analyzed those links to identify some 1,500 unique domains. Finally, the team checked each domain to see if it was a state agency Web site that needed to be captured.
The prototype system immediately identified some 200 state Web sites. A quarter of those sites had not been previously identified. Based on that success, programmers at the Online Computer Library Center (OCLC) developed the Domain Tool (external link) under a Library of Congress NDIIPP grant. Continued use of that tool has helped identify when sites become obsolete and has raised the total number of Arizona State Web sites identified to nearly 500.
“We realized Web sites are very much like [traditional] archival collections,” says Richard. “A Web site contains documents of common provenance. Further, the documents on those sites are typically organized into directories, which are analogous to the archival notion of a series.” Looking at Web sites as archival collections allows the contents to be managed as aggregates. “We don’t have the resources to manage half a million Web pages, but we can manage a few hundred Web sites each with an average of a few dozen directories.”
This work has resulted in an approach known as the Arizona Model (external link), which adapts traditional archivists’ methods from the tangible world of paper assets to the intangible world of electronic assets.
Richard emphasizes that, despite a site’s digital environment and the abundance of clever software tools, Web archiving cannot be completely automated … for now. A human must make choices at key points in the process. No technology can replace a human’s judgment.
For example, software can display the structure of a Web site by analyzing the URLs on a server, but only a human can define a site’s intellectual or conceptual boundaries. A large organization may have several servers that make up the organization’s Web site. Or, a department’s Web site may consist of subdirectories within the larger organization’s Web site.
Despite the many perplexing variations in Web-site organization at the server level, as implemented by amateur and professional Web developers alike, an archivist can browse the file structure and make some judgments about how to characterize the content. Directories often have terse names that a machine cannot interpret. Richard notes, “The software can’t figure out that the directory GDTF refers to the Governor’s Drought Task Force, but it takes a person about 10 seconds.”
If storage space is an issue, an archivist or librarian can decide what is worth adding to the collection. . Richard cites an example. “The Arizona Corporation Commission (external link) has a high-level site and within it a directory called ‘Annual Reports,’ which has got to be interesting, right?” But he cautions that even the contents of an interestingly named directory may not actually merit preservation. He has come across promisingly named directories filled only with ephemeral information. After such a “scouting mission,” staff can configure the Web crawl software to harvest some parts of the site and not others.
Richard is realistic when it comes to the skill set digital librarians and archivists need. “What does it meant to be a curator in a digital era? How do you bring files in and make them available to the general public?” Richard asks. “The ‘what’ and the ‘why’ of our jobs remain the same: We must still select, acquire, organize, provide access and preserve our collections. But ‘how’ we do that changes.”
Archivists and librarians do not need to become programmers, he insists. “Minimally, they must know enough to be familiar with the opportunities and limits of technology, and the more technical skills they learn, the better.”
It should not be expected that the current crop of archive and library students automatically have the needed technological skills. Richard warns, “People who grew up with PCs may be computer savvy, but mostly as end users of applications rather than as developers.” He points toward practical programs, such as the University of Arizona School of Information Resources and Library Science’s Graduate Certificate in Digital Information Management, as much-needed sources of comprehensive supplemental technological education.
The Library of Congress project in which Richard is involved, the Persistent Digital Archives and Library System (PeDALS), is an end-to-end digital archive system. Richard explains, “It uses middleware to describe business rules, from ingest to access.” PeDALS is based on the Open Archival Information System reference model and its three types of information “packages”:
- the Submission Information Package for ingest
- the Archival Information Package for storage and management
- the Dissemination Information Package for making the content available to users. “We capture the object as an object, an authoritative version,” Richard says. “But we add metadata to give it a context and description, and may create transformations so that we can administer, discover and preserve the records.”
PeDALS also integrates with LOCKSS technology. “LOCKSS functions as a secure dark archive to preserve the master copy,” Richard says. “It keeps redundant copies in geographically dispersed replications and does integrity checking on them. If it finds flaws in a copy, it will replace the bad copy with a good copy.”
When finished, PeDALS promises to be a valuable archival tool. But Richard still stresses the human components of appraisal and craftsmanship in the process. “Appraisal is a judging of perceived value of the information, and craftsmanship is the judicious use of the tools,” Richard says. “It all comes down to craftsmanship.”