Back to Digital Preservation Pioneers
Dr. Francine Berman's curriculum vita is a whopping 28 pages, packed with scientific and engineering achievements, awards and honors. Visionary, university professor, author, scientist and more, on paper Fran might seem to be an imposing figure. But in person she is far from it.
She has a kid's wide-eyed fascination with just about everything, an easy laugh and a disarming warmth. And her interest in digital preservation is as personal as it is professional; in the same breath she might muse about curating exabytes of NASA data and preserving photos from cell phones.
Fran is the director of the San Diego Supercomputer Center at the University of California, San Diego, which hosts both powerful computers and America's largest academic data center. For decades SDSC has played a major international role in scientific and engineering research, and in past years has become an ally with the world of archives and libraries. Information technology, part of the heart of SDSC, is blurring institutional boundaries.
Until recently, scientific bodies such as NASA have been the undisputed data archiving heavyweights. But many archives and libraries now work routinely with terabytes of digital content, and data volumes are rapidly increasing. The cultural heritage community needs partners like Fran to help tackle digital data stewardship on a massive scale. And she is eager to help.
In interviews with Fran in the late 1990s, she referred to the post-2000 decade as the "data decade." Now, in 2008 she reflects on the events of the past eight years and stresses the need for sustainable digital preservation. "Digital preservation is critical for research and education in the information age." she said. "The research community is cognizant that we need to preserve our most important digital collections – from nightly surveys of the skies in the National Virtual Observatory collection to the critical longitudinal information on families in the Panel Study of Income Dynamics. The conversation is getting beyond 'why is it important to preserve data?' to sustainability issues: 'how will we do it?' and even more challenging, 'how will we pay for it.'"
From 2006 to 2007, the Library of Congress sought answers to these questions by collaborating with SDSC to test data transfer and storage (PDF, 11KB). Since then other NDIIPP partners have conducted similar test projects, and each project has resulted in a step forward in the creation of the infrastructure – or, more accurately, the cyberinfrastructure – required for the large-scale data transfer, storage, retrieval and interoperability among stewardship institutions.
This work is similar to laying railroad track across the frontier or stringing telephone lines into homes. "It is an interesting time," Fran said. "People often think that the infrastructure is the boring stuff. But all of this 'boring stuff' has to happen in order for you to do what you may consider the exciting stuff. For you and I to work in our offices, somebody had to make sure that we were wired for reliable lighting and the light bills would be paid. But the fact that the lights are working is non-memorable. They are just part of a functioning environment.
"In the information age, our digital data needs to be accessible where and when we want it, which means that data preservation infrastructure has to become a basic component of our cyberlandscape. To be part of this cyberlandscape, someone has to make sure that the 'data bill' is paid. Funding for data cyberinfrastructure in the cyber-world is still not assumed or incorporated in the same way we incorporate funding for physical infrastructure in the natural world."
SDSC and its partners the UCSD Libraries, the National Center for Atmospheric Research and the University of Maryland Institute for Advanced Computer Studies have developed a digital-preservation approach called Chronopolis, based on a data grid (a system of sharing and managing data distributed among participating computers).
Chronopolis hosts each digital collection as three or more geographically distributed copies, which users can monitor and audit. Chronopolis partners are developing formalized agreements to ensure mutual trust in hosting the data and to allow them to monitor, audit, synchronize, and store multiple copies of digital collections over geographical space and for long periods of time.
Current Chronopolis collections include copies of the Inter-university Consortium for Political and Social Research and California Digital Library collections. "Multiple copies of collections are critical," Fran said, "to mitigate the risk of data loss due to error, power outage, natural disaster or other unforeseen circumstances at one of the sites."
She is concerned about the inevitable data infrastructure crisis that both the research community and the general public is headed for, and she wants to spread awareness that digital preservation and its enabling infrastructure is a critical issue. "In addition to digital research collections, many of us also care about saving digital family photos and the contents of our hard drives, although we may not think of this as digital preservation," she said. "Virtually everyone cares about preservation of some digital entity but we have not achieved the broad societal awareness of the issues of digital preservation in the same way that the general public is aware of other issues like global warming or stem cell research."
With Brian Lavoie from OCLC, Fran is the co-Chair of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (external link), which aims to "develop a set of economically viable recommendations to catalyze the development of reliable strategies for the preservation of digital information." Fran is encouraging the task force to take an active role in spreading awareness of personal digital preservation to a wider audience.
But there is something else about working with archives and libraries that tantalizes her. As a computer scientist with a background in grid, parallel and high-performance computing, she ponders the possibilities of what can be done with all of that juicy data after we preserve it.
Fran understands that, though cultural-heritage institutions are still working out storage and preservation issues, it will not be long before researchers want to work with data in imaginative ways. For example, researchers might analyze archived Web sites to glean social patterns and cultural contexts or cross refer digital newspapers and TV Web sites to model how media influences voting and public opinion. And in a networked environment researchers can collaborate more efficiently and publish their findings directly into a digital library.
She gives an example of how scientists can use brain-image data to improve surgery. "SDSC has worked with surgeons at Brigham and Women's Hospital on image-guided neuroscience," she said. "During a long operation to remove brain tumors, the brain deforms. Updated images of the brain during surgery can help neurosurgeons continue to differentiate accurately between diseased and healthy parts of the brain. The ability to incorporate digital imaging as a tool for improving surgical outcomes can make a tremendous difference."
Part of Fran's job is to provide the means, using SDSC resources, to enable the effective use of data for research and education. She loves problem solving and working with researchers to use technologies in new, interesting, effective and empowering ways. "Leading SDSC has given me the opportunity to come to work every day and make a difference for the research community," she said. "SDSC's terrific staff and powerful cyberinfrastructure have helped scientists do everything from creating new approaches to the design of cancer drugs to understanding how the Universe formed after the Big Bang. It's also introduced me to the world of digital data and helped create an abiding intellectual passion for its issues."
As for digital preservation in general, Fran's vision is both grand and rooted in reality. "Data is the natural resource of the information age," Fran said. "Data is fragile and needs to be stewarded in the 'cyberworld' just like we need to take care of rain forests and the environment in the physical world. Preserving valued data in the information age is fundamental to ensure that it will continue to inform and enrich our world for the foreseeable future."