August 11, 2009 -- When David Rosenthal talks, people listen. They may not always agree with the Chief Scientist of the LOCKSS program based at Stanford University, but they engage with what he has to say.
This was the case on July 27, 2009, when a large crowd gathered at the Library of Congress to hear Rosenthal’s presentation How Are We Ensuring the Longevity of Digital Documents? (PDF, 321KB). Rosenthal’s talk was a reprise of his widely-discussed plenary at the Spring 2009 Coalition for Networked Information Task Force meeting (external link). In his introduction at the CNI meeting, CNI Executive Director Clifford Lynch told the audience that Rosenthal’s work had changed his thinking about digital preservation.
Rosenthal's presentation at the Library was filmed and is available as a webcast.
Rosenthal began with a provocative question, musing on whether the current digital infrastructure constructed over the past 20 years had actually saved anything from oblivion. He said he found little evidence that it had, and framed his argument as a subtle rebuttal to the work of Jeff Rothenberg, most specifically Rothenberg’s 1995 Scientific American paper (revised in 1999) "Ensuring the Longevity of Digital Information (external link)" (PDF, 156KB).
This paper had a great influence on current digital preservation efforts, but Rosenthal argued persuasively that Rothenberg’s vision was distinctly rooted in the technical environment of 1995, and that the technical changes that have occurred since have negated some of his ideas.
Rosenthal said that Rothenberg based his arguments on now-outdated assumptions such as that digital documents survive largely in off-line media; that these media will have a short lifetime; that digital documents are stored in application-specific (mostly proprietary) formats; and that the hardware and software environments supporting this information will change rapidly.
"One big thing that changed," said Rosenthal, "is that in Jeff’s vision items survived offline, and came online only occasionally. Nowadays, it’s flipped. If items exist, they are found online. The permanent copy is the online version. Copy-ability is intrinsic to the online copy."
Rosenthal noted that the large software firms "lost the battle with their users," an outcome unforeseen by Rothenberg. A decade ago, the large software companies "needed to get their customers to buy the product that they’d already bought all over again," said Rosenthal, and "introduced format incompatibility by default…new versions of the software made it impossible to read older versions of the data."
This business model has become unacceptable to users, Rosenthal noted, in that users carried the cost. "The costs of incompatibility are just too high to put up with," he said. "Even if there was a time where it was easy for Microsoft or other dominant players to remove support for old formats," materials published online are quite different in that the dominant players can’t control the programs that read it. "Microsoft never came up with 'Microsoft html'," he said.
On the contrary, the contemporary information landscape shows a tendency towards increased openness, especially compared to Rothenberg’s time.
"The goal of publishers is to reach as many readers as possible," he continued, and a "gratuitous incompatibility in the world where you’re publishing content is self-defeating."
The conversation on openness lead to a discussion of emulation, Rothenberg’s preferred preservation method. Rosenthal noted that "Jeff was right about emulation, but wrong about why you would do emulation."
"Virtualization was not a part of the mainstream computing environment in 1995," Rosenthal said, whereas "nowadays, virtual hardware is a mainstream thing." "There are now industrial-strength emulations of the entire PC stack," said Rosenthal, "and much mainstream software is now written for virtual machines because it’s safer and more testable. Java, C#: these are virtual machines."
Rosenthal noted that current users regularly interact with emulated content through their computer browsers, which makes the activity less apparent as emulation to the end user. He also said that the increased importance of open source software is another development that has taken off since 1995, noting that "there are open source renderers for all major formats, even for those that have Digital Rights Management protections."
"Open source as a development project is extremely hostile to backwards incompatibility," he said.
However, he did observe that open source software, stored and developed in environments such as Sourceforge, wasn’t getting much attention from traditional preserving institutions. The software was being actively preserved and maintained by users, not "because people are interested in preservation itself, but because keeping the software alive is an essential part of the open source software development process."
Rosenthal went on to discuss some of the issues regarding the sustainability of digital preservation efforts, and expressed his belief that preserving institutions need to start thinking in industrial terms when dealing with huge quantities of digital information.
He noted that the digital preservation community has "a serious cost problem, because the costs we need to achieve are outside even the cheapest system we have available," but suggested that wise investments could be made in supporting those that are developing the emulators and renderers to read digital content, and in funding research to explore more reliable ways of storing large amounts of bits.
"Just storing the bits needs industrial strength infrastructure," he said.