Back to Digital Preservation Pioneers
A Partnership Born of Urgency and Civic Responsibility
The breezy name "CyberCemetery" doesn't quite convey the significance of this University of North Texas (UNT) Libraries service: archiving expired federal Web sites and making them permanently accessible to the public. The university has been a partner in the National Digital Information Infrastructure and Preservation since 2004. The CyberCemetery (external link) is working with the California Digital Library on a project called the Web-at-Risk (external link), which is developing Web archiving tools that will be used by libraries to capture, curate and preserve collections of Web-based government and political information.
Cathy Hartman, assistant dean of the UNT Libraries for digital and information technologies, is deeply aware of its significance though. She helped found the CyberCemetery as a result of her drive to inform people about what their government was doing and her realization that the Web sites of many federal commissions disappear forever when those commissions end.
In 1997, the U.S. Government Printing Office (GPO) was giving talks about archiving its digital publications and looking for creative ideas regarding digital preservation. The urgency of archiving digital government publications was clear to Cathy, and it was equally clear to her that UNT Libraries could and should do something about it.
Cathy got support from her dean in collaborating with the GPO. A partnership with the GPO was born and UNT Libraries began archiving expired federal Web sites, beginning with the Advisory Commission on Intergovernmental Relations (ACIR). The name "the CyberCemetery" stuck after a while.
UNT Libraries, part of the Federal Depository Library Program, learned after posting the ACIR Web site that the public then expected services for earlier published commission materials. The commission was established in 1959, as its Web site states, "to give continuing study to the relationship among local, state and national levels of government." The UNT Libraries then put all of ACIR's published documents – more than three decades of reports – online.
Under the terms of UNT's memo of understanding with the GPO, staff members of UNT''s depository library must:
- Verify all files for authenticity. UNT must review the archived files to make sure they are the same as they were on the original site. Cathy says that law librarians in particular are concerned with authenticity, but of a different sort. "As they [lawyers] present material in a court of law, judges are wary about things like printouts from Web pages. Data could be easily changed." The GPO and others are reconsidering how they publish on the Web and they are investigating authentication methods, such as digital signatures.
- Maintain a consistent Web address. Consistent Web addresses have not been an issue so far and Cathy does not expect them to become an issue. In fact, though UNT Libraries just switched to a new content management system and has moved some of its data, the technologists at the CyberCemetery feel that the move will not affect their collections' persistent URLs. They deal with it all the time with their other digital collections.
- Store the files on a server. They have enough space so far but, like everyone else in the digital preservation community, the CyberCemetery team is researching storage and data-transfer options for the inevitable expansion of their collections.
- Provide free public access to the information. Because the Federal Depository Library Program emphasizes permanent public access, part of the CyberCemetery's mission is to do what it can to make sure the archived Web sites work for users just as they did when the sites were live. This sometimes involves modifying Web site code, which is in direct opposition to most cultural-preservation institutions' "hands off" approach of leaving the code alone (as evidence of what the code looked like in the particular year the site was harvested).
This last point is particularly important: UNT Libraries must guarantee access and usability. Cathy gave an example of one site in which the code was so poorly written that the next-generation version of Microsoft Explorer simply couldn't display the site. CyberCemetery's technicians modified the site's code and made the site accessible.
Most of the Federal Web sites that the CyberCemetery archives are commissions. Cathy points out that "in the past, federal commissions might just publish final reports. Now they publish hearings, and it is becoming common for commissions to post videos of those hearings on their Web sites." The videos complicate the archiving process. Crawling, harvesting and accessing video from Web sites is convoluted and tricky.
Cathy notes that the 9-11 Commission Web archives were donated to the CyberCemetery intact. Each of the 12 public hearings was recorded on video and posted to the commission's Web site, resulting in dozens of hours of digital video in multiple formats. The donation of the site to the CyberCemetery eliminated the technical issues regarding harvesting the videos and left only the access issues. Again, the CyberCemetery technicians modified the archived Web code a little and enabled the site to display the archived video of the hearings.
Just-in-Time Harvesting
So far the CyberCemetery has only had to digitize its first collection, the ACIR, which comprised primarily pre-Web paper publications. The CyberCemetery has harvested most of the Web sites with standard Web crawl tools. Staffers have gone through several tools, including Teleport Pro and HTTrack, and they now use the Internet Archive's Heritrix crawler.
When a federal commissions forms, that formation is announced in the Federal Register. The CyberCemetery notes each announcement and follows the commission's progress. Many commissions exist for only about two years, and the CyberCemetery tracks the life span of each site as best it can.
In most cases, the CyberCemetery has to be alert to the upcoming termination of a Web site and crawl it before it disappears. There are a variety of ways to find out about a terminated site. The GPO may inform the CyberCemetery or its professional peers might notify do so. Another way: Government document specialists across the country may pass on the information. All serve as extra eyes watching out for the CyberCemetery.
Cathy says that the CyberCemetery has missed some sites. Some just fall off the archivists' radar, similar to the GPO's "fugitive government publications." Just as some government documents never pass through the GPO for distribution to the depository libraries (for example, maybe they're printed by an agency itself on a small scale rather than by the GPO), there may be small, fugitive Web sites that the CyberCemetery is not aware of, and thus the opportunity to archive those sites is missed.
And unlike Internet Archives' repeated, periodic crawls, the CyberCemetery grabs the site only once, at the end of its online life. But the nature of a federal commission site is such that the final version is usually the most complete. Most commissions do not change or delete content over time; they accumulate it. Everything stays online and more gets added to it. Cathy says, "We aren't losing anything at all; we capture it all at the closure. With the commissions, they begin their work, they have hearings, testimonies, white papers, and then they publish multiple publications. It's all added to the Web site as it's produced. Then they publish their final report and they're gone soon after the final report."
So, as soon as the CyberCemetery staffers discover that a site is disappearing, they wait until it is just about to permanently blink off and then they harvest it. And if the site is complex and packed with content, if it presents some technological challenges to Web harvesting, CyberCemetery tries to work with the site owners to transfer all of the content intact.
As in the case of the 9-11 Commission Web site, some site owners will donate a copy of their site, which not only makes the CyberCemetery's job easier, it helps to assure the completeness of the site. Those donors clearly share the CyberCemetery's commitment to the cause of properly archiving Federal Web sites and providing continuing access for all users.
For example, Al Gore's national performance review Web site, the National Partnership for Reinventing Government (NPRG), was donated to the CyberCemetery. The site is extensive and contains a lot of information related to benchmarking for federal services. Cathy says appreciatively that when it came time to donate the terminated site to the CyberCemetery, "They [the site owners] were excited that the site would be preserved, and they packaged it up and added language indicating that the NPRG was closing and information about how to reach NPRG alumni. They did a nice job of packaging."
Technological Growth
Though the content at CyberCemetery is not especially large, it will continue to grow along with UNT Libraries' other digital content. One of the issues that the CyberCemetery will have to face is large-scale data movement.
The organization is accustomed to shipping data on drives, as most institutions are, but that cannot last. Drives can get corrupted, they can get damaged in shipment, and shipping is laborious and time consuming. And the actual travel time of the drive seems "old school" compared with near-instantaneous transfer over high-bandwidth networks. However, shipping data on drives is a familiar and somewhat dependable data-transfer method, while the transition to a reliable high-speed network transfer infrastructure can be complex and daunting.
Cathy is hopeful that the newly formed Texas Digital Library, of which UNT Libraries is a participant, may be able to help UNT Libraries and other Texas universities leverage their combined advanced technology to the benefit of all their members. Together they are exploring issues of large-scale data storage, dispersed storage, metadata standards, data transfer, curation and preservation of digital scholarly information.
Pooling resources makes sense. It is an inescapable fact that no one institution can do it all. In fact, in 2006 the National Archives and Records Administration (NARA) recognized that the CyberCemetery had archived sites that NARA didn't have copies of. After some negotiation they formed a partnership and the CyberCemetery is now affiliated with NARA.
One of the unexpected outcomes of being the final resting place for defunct federal Web sites, Cathy said, is that, "once you put the Web site up, people find you." For some researchers, the CyberCemetery becomes the default expert resource for the federal Web sites it has archived. CyberCemetery gets a lot of calls and its staff help users track down information related to the archived sites. Often the CyberCemetery has to track down former members of the commissions for help.
But that is not a downside or a negative issue; it has become part of the responsibility that the CyberCemetery took on. Cathy's thinking was always that the UNT Libraries needed to do their part in this area of digital preservation. They looked on it as their role and contribution. She says that digital preservation and stewardship is the next step for all libraries and it is undoubtedly where they need to go as they evolve. Libraries need to think about building and curating digital collections for the next hundred years and beyond.
Libraries have always assumed that responsibility, especially university libraries. "That's our role," Cathy says. "Who else is going to do it?"