Back to Digital Preservation Pioneers
A Meeting of Visionaries
It has been almost 10 years since Vicky Reich and David Rosenthal co-founded LOCKSS (which stands for "Lots of Copies Keep Stuff Safe") with the purpose of enabling libraries to preserve their own digital collections. Since then LOCKSS has continually evolved to meet the needs of librarians and has been widely adopted as an economical, easy-to-use digital archiving solution. And it has evolved in a few directions that Vicky and David did not foresee.
In the late 1990s, Vicky Reich – who by then had years of library administrative experience – was assistant director of Stanford's HighWire Press. She noticed that an increasing number of important publications were showing up solely in digital format, not in print, and she became concerned that librarians might lose custody of those copies, especially since many librarians tended to know very little about preserving digital assets. David Rosenthal, a distinguished Silicon Valley engineer with a long history at Sun Microsystems, became involved in LOCKSS from the technical side, focusing on preservation of the bits and bytes.
Vicky and David were also concerned that, unless libraries were proactive in creating their own digital collections, they might defer the digital storage responsibility to other large nonlibrary entities – outside institutions – and could end up losing custodianship of those collections and loose their democratic society role as memory organizations. Libraries needed a solution that would enable them to store their collections themselves, but it was important that such a system be both affordable and easy to use. And so LOCKSS was born.
How LOCKSS Works
LOCKSS is an open-source system (meaning that, if necessary, users can modify the software's source code) of networked data replicas – shared copies of e-journals – that allows the participants, through a peer-to-peer connection, to access reliably preserved data. In this case, "peer to peer" is not the same as freewheeling, Napster-like file sharing; users of one library's LOCKSS system can only access content from that library's collection.
To get started, a user needs only to download files from the LOCKSS Web site, save it to a CD to create a "boot CD" and boot one's LOCKSS computer from that CD. It is best if the computer on which LOCKSS resides is devoted to the sole purpose of running LOCKSS and storing the digital collections. Creating such a "LOCKSS box" is easy. It just requires a personal computer with about 1 GB of memory, a CD drive, either a floppy disk drive or a USB flash memory drive and at least 250 GB of storage.
In this way, an average, off-the-shelf computer becomes a digital archiving appliance. As for maintenance, the LOCKSS box is always running and connected to the Internet around the clock, and it has a built-in security system to monitor its own health and defenses.
David says that the box is constantly "collecting new content, auditing its content against other LOCKSS boxes and repairing any damage, [and] monitoring reader's accesses to preserved content and transparently stepping in to supply it if the publisher can't or won't supply it."
Though it is possible to form an alliance independently with other LOCKSS users for free, it is more economical to join the LOCKSS Alliance. The Alliance is governed by a board of directors and staffed by project team members. Fees support continuing software development and grant the member access to software perks and Alliance activities and workshops.
Any library can join. Virtual private networks allow groups of institutions to collaboratively partner with one another to securely preserve their digital collections. Each subscribing library preserves its copies of publications as it sees fit, taking into account the publishers' restrictions. The publishers' agreements, of course, include restrictions on what librarians can and cannot do with the content, which is similar to the paper subscription agreement. The objective is that libraries can't redistribute those publications to other institutions.
With permission from the publisher, LOCKSS enables a library to collect, preserve and distribute to its readers copies of the material to which the library has subscribed. Each participating library has a Web crawler that:
- checks a publishers' site
- confirms that the publisher has granted permission for the crawler to download the publication
- downloads new releases of the publication
Each LOCKSS box knows where all the other copies are, even though each participating library may subscribe to different journals.
LOCKSS is a self-auditing system: It looks for other replicas of those specified publications among a user's peers on the network. It continually compares the content it has collected with the same content collected by other LOCKSS boxes and repairs any differences, replacing the publication with another copy either from the publisher or from another peer. So there is no need to back up copies. This makes publishers happy as well, as it reduces the number of copies of their publications.
These limits on the use of preserved copyright material have been effective in persuading copyright owners to grant the necessary permissions. And LOCKSS also benefits publishers in that the system:
- offers a way to archive materials for the long term, even past the life span of the publisher's company
- grants perpetual access for qualified users
- maintains the business relationship between library and publishers
- maintains the look and feel of the material
- does not compete with their business model for standard library acquisitions.
Evolution of LOCKSS and Outgrowth Projects
As the library community contributed to LOCKSS' early development, one of the issues to address early on was "bloating" of the digital object, such as an e-journal, with too much descriptive metadata. Fulfilling the added metadata requirements was staff-intensive and it ran up costs. The solution was for LOCKSS to simply preserve the journal itself. Librarians, independent of LOCKSS, are free to add as much or as little metadata as they want.
Another concern that LOCKSS addressed and solved was format migration. Can a journal preserved today be viewed years from now with future software? What if a user's browser in, say, 2020 doesn't recognize an outdated format from 2007? LOCKSS solves that problem in the background with an automated format migration process that will convert a journal to the current format.
So David and Vicky's vision worked quite well, and librarians are becoming custodians of their own digital data. But what has grown out of LOCKSS that they didn't foresee? For one thing they didn't anticipate the interest in LOCKSS archiving for government documents or for electronic theses and dissertations.
And they were surprised by the interest expressed by those that publish journals (usually humanities) for fun and for free and want someone to take custodianship and archive their publications for safekeeping. These are publications such as the "Bryn Mawr Classical Review," "Journal of Religion and Society," and "MIT e-Journal of Middle East Studies." This important content has a high rate of dissapearance from the Web. As a bonus to LOCKSS alliance members, they automatically get custody of those publications, and their community will have access to this material for future use; the list today is 56 publishers and growing.
Nor did they anticipate the enormity and sophistication of the MetaArchive Project, a collaboration of Emory University, Georgia Tech, Virginia Tech, Florida State University, Auburn University, the University of Louisville and the Library of Congress. The focus of this project is no less than the preservation of digital content regarding the culture and history of the American South. LOCKSS has been adapted for use with the MetaArchives' variety of archival formats and security requirements.
And then there is CLOCKSS (the C stands for "control"), a community partnership whose board is comprised of 12 publishers and seven institutions, in which the libraries host CLOCKSS boxes and preserve materials to which they do and do not subscribe. There is no sense of ownership. After a trigger event, such a natural disaster, if the board determines that the content is no longer available from a publisher, they will release the material to all.
Given the LOCKSS team's expertise and accomplishments, the Library of Congress has presented them with other tasks, among which are:
- investigating how the Library of Congress might use LOCKSS technology internally
- exploring private LOCKSS networks in service of the states
- developing a checklist of threat models, a form of risk analysis for preserving digital data. This is especially urgent, given that the Library of Congress's digital holdings will expand the size of its digital content from terabytes to petabytes, exabytes and beyond in no time, and the threats to such a system need to be predicted, audited and measured.
It all comes back to cultural institutions' safely maintaining their own digital content for the long term. Though the big problems are daunting, the common challenge for all digital curators – large and small – is the same: taking responsibility for their digital content. LOCKSS helps. Vicky Reich sums up her vision in this way, "I wanted to enable librarians to say, 'Here is my box with my stuff in it.'"