Library of Congress

Digital Preservation

The Library of Congress > Digital Preservation > Feature Series > Meeting the Challenge > Setting Standards (Office Open XML and PDF/A)

Back to Meeting the Challenge

Chaos would rule nearly every aspect of life, were it not for standards. The conversations and writings people hear and see everyday are based on standard forms of expression, grammar and spelling for a particular language. Classification standards control how books and other library materials are cataloged. The Library of Congress has more than 20 million books. How would someone pinpoint the location of a specific volume if it had never been classified and shelved accordingly? How would deteriorating books be available in the future if libraries did not take steps to preserve them?

The same types of issues are true for digital materials.

Among the many challenges in preserving digital information is the establishment of standards. Standard preservation formats will help ensure accessibility over the long term because these formats are more likely to be maintained and made available to the community of archival institutions with a need to save digital assets.

As an important part of the digital preservation initiative, the Library of Congress has been actively engaged in creating and supporting the development of several key open standards for digital content. The Library has recently played an active role in the development of several digital standards: Office Open XML and PDF/A. (An overview of the JPEG 2000 image format will be found in a future Challenge article.)

Office Open XML

During the past two years, Library staff have participated in a technical committee working toward the standardization of the Office Open XML specifications, which as of April 2, 2008 became a reality. This new international standard (to be known as ISO/IEC 29500) will make it easier for libraries and archives to preserve a large body of digital material by ensuring that the content is generated in formats for which the specifications are published and will be maintained under the auspices of a standards organization.

Specifically, this standard is based on the formats used by the latest version of Microsoft Office and supports all features in the various versions of Microsoft Office since 1997. It is estimated that Microsoft Office has more than 400 million users generating billions of documents a year. The new XML-based standard will enable implementation by multiple applications on multiple platforms. The specific document types involved are word processing, presentations and spreadsheets.

The need for long-term preservation and interoperability was a major issue driving the development of this new open standard. For any organization that is required to retain documents for future use, there was concern that older documents would become unusable as formats change. Customers also asked that valuable data within documents, such as accounting figures in spreadsheets, be efficiently accessible by other applications and not hidden in proprietary binary formats. The participation of the Library of Congress and the British Library made it possible to introduce the interests and experience of archives and libraries in digital preservation as part of the standardization discussions.

As an additional benefit to the digital preservation community, Microsoft has released the specifications of its earlier binary formats and asked the Library of Congress to hold copies, now available on the Microsoft Office Binary (doc, xls, ppt) File Formats page.

PDF/A

The Library also continues to participate in the working group developing the PDF/A standard, which was initially approved as an international standard (ISO 19005-1) in 2005. This standard grew out of an increasing need for the widely used PDF document format to be readily accessible and consistent in appearance over time. The standard states that PDF/A "provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files."

PDF/A is a subset of the PDF format suitable for the long-term preservation and archiving of page-oriented text documents. In general, everything comprising a PDF/A document, including text, raster images, vector graphics, fonts and color information, should be permanently embedded within the file, limiting a reliance on external software or hardware dependencies. The format’s suitability for long-term retention, along with the ability for full-text searching within a PDF document, make this format an increasingly popular choice within the library and archival community.

For more specific information on PDF/A, visit the Library’s Formats Sustainability page at: http://www.digitalpreservation.gov/formats/fdd/fdd000125.shtml.

Back to top