David Minor wants you to trust him. He seems honest enough, so it might be worth it to take a pass, depending on the circumstances. But he also wants you to trust his organization, the San Diego Supercomputer Center (SDSC), with managing, storing and preserving some of your most valuable digital assets. Now how willing are you to trust him?
This is the question the Library set out to answer when the NDIIPP program engaged SDSC in the Data Transfer and Storage Project, a one-year demonstration project to test the feasibility of engaging external partners as service providers to fill digital management needs. Minor and Richard Moore, SDSC division director for Production Services came to the Library on Sept. 19, 2007, to present their findings.
Many cultural heritage institutions and government agencies are suffering from tight budgets and a shortage of resources and expertise, with many forced to look outside their walls to find the needed expertise. This is especially true in the digital preservation environment, where technical knowledge is rare, and the costs of long-term information management remain speculative. Additionally, the geographic distribution of materials, capabilities and expertise has the benefit of helping to ensuring that digital materials survive into the future by maximizing the benefits of cooperation and maintaining a level of redundancy that prevents total loss from happening due to the breakdown of any one node.
However, for an outsourcing relationship to work successfully, trust has to be established between the entities involved. The Library is interested in exploring the various mechanisms by which trust can be established between digital preservation partners, and hopes to document these "metrics of trust" so that they can be utilized by other cultural heritage organizations needing to explore outsourcing as a way to supplement their own preservation workload. The trust metrics are abstract notions that result from the answers to questions such as:
- Does new infrastructure improve process?
- Is the solution useful for other organizations?
- Has a long-term solution been found?
Minor touched on these issues during his presentation, describing how SDSC and the Library spent a year testing and researching methods to transfer and preserve the Library's data. They used the Library's Election 2004 Web archive collection and digitized images from the The Prokudin-Gorskii photograph collection as test data. The Election 2004 Web archive a Web site collection mining government, political party, media, advocacy groups, blogs and other categories of Web sites to paint a detailed picture of the presidential election of 2004. The Prokudin-Gorskii collection offers a vivid portrait of the people and architecture of the Russian Empire on the eve of World War I.
"When we started this project, it wasn't necessarily clear where we wanted to go or what steps we wanted to take," Minor noted during his presentation. From the beginning, the Library leveraged SDSC's physical infrastructure, procedures and expertise and took advantage of its innovations and expertise in supporting cyberinfrastructure. Minor described cyberinfrastructure as "the computers, the data storage, the network and the experts, brought together with some kind of glue, taking each of the pieces and synthesizing them together to make a better whole." SDSC has been working for two decades to support the scientific research community by providing cyberinfrastructure services such as data storage, making the institution an excellent partner to explore the Library's needs.
One of the initial technical questions that arose was whether to ship the digital material from the Library to SDSC on hard drives or transfer it over the Internet. The main difficulty was the overwhelming size of the data, as the materials totaled approximately 5 terabytes (a terabyte is 1 trillion bytes of data, or approximately the size of a thousand Encyclopaedia Britannicas).
The decision to try transferring the material over the Internet became an opportunity for the Library to test its connection to Internet2, the high-speed research computing network. "When you're working with a network," Minor said, "it's at least a two-part effort, and it has to be done with both sides working together. … In many ways, networks are like wind-up toys: If you walk away from them, they'll slow down eventually." The staff of SDSC assisted the Library in optimizing its operating environment in order to help the Library's optimize its connection to Internet2 network. SDSC also used its expertise in networking to constantly monitor the connection between the Library and SDSC to make sure that it was operating smoothly.
The next step was to create a robust storage environment at SDSC that would replicate, as closely as possible, the environment at the Library. SDSC worked closely with Phil Michel from the Prints and Photographs (P&P) division to understand how P&P curators cataloged, organized, edited, and manipulated the digital photographs, and SDSC used this information to explore a design that secured a similar working environment for the replicated collections at SDSC.
SDSC has developed a data preservation environment software tool called the Storage Resource Broker (SRB), which was ideal for these purposes. This tool, Minor said, "allowed us to take in the initial copy, create a master manifest or list, and then replicate the copies out to a variety of geographically distributed locations. This tool also allowed us to change an individual item in one location and easily replicate it out to the other locations." The SRB automates the process to ensure that the appropriate files are distributed to the appropriate locations, taking advantage of the benefits of geographic dispersion.
SDSC had to make sure that the complex environment it was setting up could be clearly monitored by the Library, and that the data stored in the SRB was accessible to the curators at the Library. This transparency of technical design generated ongoing discussion about the nature of the collaboration, further bolstered the mutual trust, and emphasized even more strongly the need to ensure human input in the design process for the replication system.
A good deal of this discussion revolved around how people learn to trust machines through the ways machines report on their own behavior. The SRB is capable of numerous monitoring, logging and reporting functions, and automatically sends e-mail reports to the system administrators on its activities. These messages, depending upon their quantity, content and understandability, can either encourage or discourage trust in the system. One goal of the project was to analyze the self-reporting of the machines to evaluate their effectiveness in building trust in users. For example, after each scheduled data-integrity check, the SRB would send a notification message to the Library. Users noted that the clarity of these messages had the effect of increasing the user's trust in the system.
As Minor noted, the technical issues are vexing, but it's really the human factors that are the most challenging (as he noted, technical "systems are really good at detecting change, but they are very bad at detecting intent"). "There are tough questions here," he concluded with a laugh, noting that this type of trust development is "multi-institutional and multifaceted, but it's not magic. It's a product of the collaborative work we're doing."