Back to Digital Preservation Pioneers
Search Google Maps for the Library of Congress's address – 101 Independence Avenue S.E., Washington DC 20540 – and, if you select the satellite photo view and zoom in, you can see the construction of the new Capitol Visitor Center, located between the Capitol building and the Library. Three years ago this same satellite view would have displayed a tree-dotted park. A year or two from now this satellite view will display an elegantly landscaped block.
Web-based maps, now an integral online resource for directions and more, are getting increasingly cooler, more sophisticated, and more fun to use. While you're on that Google Map page, search Businesses for restaurants in the neighborhood. Or try the Google Real Estate Search to find homes for sale on Capitol Hill. But once you've generated a personalized search of the map features in Capitol Hill, what if you want to save all that information and repeat the same search in a few years to compare how everything has changed? It would be a challenge.
That challenge is not limited to Google Maps. There is a wealth of Geographic Information System (GIS) data available online now – such as NASA Landsat photos, U.S. statistics bureau data, real estate data, zoning data, census tracts and map mashups – and many clever tools to integrate and cross-reference that data. Is that data being preserved? Can all GIS data be shared among all GIS tools? Steve Morris is working hard to sort it out.
Engaging the community
"People are becoming accustomed to making decisions based on Web service environments like Google Earth or the others," Steve says. "The question is how one captures the data at that moment in time. How do you support it and document it as a basis for decisions?"
Steve heads the Digital Library Initiative at North Carolina State University (NCSU) and has been working for over a decade to help enable online access to geospatial data. More recently NCSU Libraries entered into a partnership with the Library of Congress to address the issue of long-term preservation of geospatial data. But the scope of the preservation challenge is so vast that the NCSU project's role is primarily as a facilitator. "Our focus isn't on developing technical architecture," Steve says, "but more in engaging the industry and being a catalyst for discussion, getting the community to think about [preservation and access]."
That community is broad and diverse. It includes government agencies at the local, state and federal levels. It also includes universities, real estate agents, land developers and just about anyone with a stake in geospatial data.
"The best approach [for the NCSU project] is to build a repository and learn firsthand what the challenges are," Steve says. "But a repository is not the end goal and the solution is not a repository at NCSU. The solution is to define state, federal and local government roles and to engage producers to think about what the role of the state GIS clearinghouse is and what the roles are at the national level."
GIS challenges, of course, are international and are best solved through collaboration. Steve co-chairs a new Data Preservation Working Group within the Open Geospatial Consortium (OGC), an international consensus standards organization in which commercial, governmental, nonprofit and research organizations collaborate on standards for geospatial content and services as well as GIS data processing and exchange.
NCSU is a partner in the National Digital Information Infrastructure and Preservation Program. Its project, the North Carolina Geospatial Data Archiving Project (external link) (NCGDAP), is a collaboration between the North Carolina State University Libraries and the North Carolina Center for Geographic Information and Analysis. The project focuses on collection and preservation of digital geospatial data resources from state and local government agencies in North Carolina. Its stated objectives are:
- Identification of resources
- Acquisition of at-risk geospatial data
- Development of a digital repository architecture for geospatial data, using open source software tools
- Enhancement of existing geospatial metadata with additional preservation metadata, using Metadata Encoding and Transmission Standard (METS) records as wrappers
- Investigation of automated identification and capture of data resources using emerging OGC specifications for client interaction with data on remote servers
- Development of a model for data archiving and time series development.
NCGDAP work is being carried out in conjunction with North Carolina's NC One Map (external link) initiative, a comprehensive statewide geographic data resource that provides seamless access to data, metadata, data sharing agreements and inventory processes. The NC OneMap vision statement includes the point that "historic and temporal data will be maintained and available." NCGDAP is focused on addressing that single item in the vision statement while also working within the context of the technical and organizational infrastructure that NC OneMap already offers.
The NC One Map online map viewer is available to the public for free. It contains approximately 20 topic areas for users to select from, such as Elevation, Hydrography, Environment, Land Cover, and Boundaries. Each topic expands to display several subtopics. Users can select and display several different types of graphic data simultaneously in transparent graphic layers over the map of North Carolina and save their customized maps as PDF files. Users can also download raw vector data on topics such as Airports, Census Boundaries, Public Libraries, and Railroads.
"Local GIS data is more detailed, more current and more at risk than the data from national sources," Steve says. The local North Carolina data is collected methodically from relevant agencies in each county. Seven years ago NCSU Libraries began to go out and acquire local data in order to meet demand from university data users. At that time few other agencies or users were seeking that data. Now a wide range of state and federal agencies as well as commercial firms and individual users seek this data.
For small local agencies, the constant barrage of data requests from the increasing number of interested parties can be overwhelming. The volume of requests can create what Steve calls "contact fatigue." But local, state, and federal data exchange partnerships have emerged, and more efficient methods of data sharing are evolving.
The local geospatial data is primarily based on digital orthophotography (digital images with the distortion from the camera angle and topography removed) that used to be generated about every two to seven years, but is now created every year or two. The orthoimagery becomes the raw material for creation of other GIS data layers. "It usually starts out of the county tax assessor's office and eventually other agencies use it," says Steve. "It builds out from the original base."
Generating interest and support in preservation
Steve has found that archiving and preservation of geospatial data is not a high priority for many in the GIS community. Most GIS project work is focused on use of the best available data, which typically means using the most current data. Given the traditionally lower demand for older data, there is not as much incentive to focus organizational resources on data archiving. This makes it necessary for the preservation effort to try to piggyback on business problems that are more compelling, such as business continuity and disaster preparedness.
To this end, some threads of the NCGDAP project are subordinated within other projects driven by other requirements that have stronger backing within the geospatial community. "The initiatives are driven by things more compelling than preservation, but the preservation project has a seat at the table in the process of developing infrastructure that results from these initiatives" Steve says.
Because preservation can be a tough sell in terms of committing resources, he often doesn't talk to members of the GIS community about "preservation" but rather promotes the importance of "temporal analysis" or data mapped over time. Increased interest in temporal analysis will create increased demand for older data that, it is hoped, will lead to increased commitment of resources for managing older data.
To get local agencies to place more value on the agencies' past data, NCGDAP is gathering business use cases to present to the agencies. These cases contain site location analysis on topics such as:
- Land use change, that is, how a county has changed in the last five years
- How zoning has changed
- Changes in the amount of impervious surfaces.
Examples of North Carolina coastal use cases include shoreline change and buildings that are being undercut by shoreline change.
Data producers (the agencies), software vendors and consulting firms are slowly beginning to appreciate the value of data mapped over time, and they're starting to understand that, in order to have temporal data, older data must be preserved. Successful business models help drive home the point of business opportunity. Steve tells of two commercial vendors that sell older data. "Part of their market is companies that want to know past uses about the land, whether there's any environmental liability," he says. These people also want to see trends and watch them change over time.
Steve anticipates many positive results from NCGDAP's outreach to the GIS community, and there are encouraging signs in the industry. For example, software vendors are increasingly interested in temporal analysis and temporal data management as customer problems, and consulting firms are increasingly seeing data preservation as a business opportunity and as a component to be considered in the development of project plans for clients. In North Carolina, state archives and GIS representatives are now attending each other's meetings, beginning the process of cross-fertilizing between the two communities.
Steve thinks that part of the challenge in engaging the preservation problem is that the geospatial industry has been "temporally impaired." "When I was getting started with GIS almost 20 years ago there was no older digital data," says Steve. "Students were encouraged to work with data that already existed, which most of the time meant working with current data." But there is evidence of more interest in doing temporal analysis and more of an expectation that the older digital data exists and is available.
One interesting surprise in the course of the project was the number of data producers that are digging out their older analog content and digitizing it. This increased interest in older map content is a sign of hope for increased commitment to long-term management of the digital data that is being created now.
Technical challenges
Steve feels that the proliferation of Web services and the emergence of digital globes and mapping APIs have complicated the issue of data preservation. Part of NCSU's challenge is exploring how to reduce these complicated maps to their simplest, desiccated state for preservation. What if you can't preserve the complex state of a given online map? How can you capture the essence of an online map without being hampered by its technology?
Formats present a challenge to preservation. Much of the GIS data is vector based, built on the notion of points, lines and polygons rather than pixels, and there is no widely adopted open format for vector data. The Geography Markup Language (GML), an OGC specification, is not so much a format as a means to define things that are like formats in the way of industry-specific application schemas adhering to specific profiles of GML.
The complex proprietary formats employed in the makeup of most vector-data resources pose significant technical challenges to long-term preservation. Added to this is the problem that, in many cases, vector data is overwritten continually as data are updated, wiping out the historical record in the absence of any effort to create temporal snapshots of the data..
Another challenge lies in preserving finished GIS projects and maps, which combine multiple data layers together with classification schemes, symbolization, annotation, and data model outputs. "The true counterpart to the old map is not the GIS dataset," Steve says, "but rather a finished geographic product such as a map or a chart." One non-geospatial file format that has started to gain wide use in the GIS community is PDF. In the NC One Map, for example, users can output their personalized map as a PDF-formatted document. The use of PDF to produce consumer-friendly maps has taken off over the past two years in the geospatial community.
It is worth noting that Adobe and TerraGo Technologies recently joined the OGC, a sign of an increasingly prominent role for PDF in the geospatial industry; Adobe, of course, produces PDF, and TerraGo creates software that utilizes PDF to produce interactive maps with layers of geospatial data.
As attractive and efficient as PDF is for capturing and delivering GIS project outputs, there are still technological challenges to be explored from a preservation perspective. PDFs may be more manageable and contained than complex GIS project files, but the underlying data intelligence is lost, and specific concerns about the use of special fonts or complex vector graphics need to be explored. Steve admits that PDF was not really on the radar when the project was started three years ago and that there is a need now to explore what the longer-term challenges might be for managing this content.
Metadata is also a challenge. Steve has found that either metadata doesn't exist for a given pile of GIS data or, if it does, the metadata is either not normalized to any widely understood structure or not synchronized with the data. "A local agency might develop metadata but then they continue to evolve their collection and not update its metadata, then the data becomes asynchronous," he says.
Given the complex multifile, multiformat nature of many geospatial data objects, and given the need to supplement existing geospatial metadata with additional technical and administrative metadata, NCGDAP has been interested in using METS as a form of metadata wrapper or content packaging. Steve says that, until now, the geospatial industry has not addressed the wrapper or content packaging issue in the way that some other information sectors have.
METS would be used for demonstration purposes, in hopes that in the longer term the geospatial industry would produce or adopt a solution. Implementation of a content packaging or wrapper solution by the data producer community would help to lower the cost of data acquisition and intake through the automation of transfer processes. "The more the machines read the data, and not a human," Steve says, "the lower the costs are."
The biggest costs for repository development in the NCGDAP project are in data acquisition and data ingest from the counties. There are 100 counties in North Carolina, creating 100 points of contact to negotiate data transfers and 100 different ways in which the data might arrive. Emerging data-distribution networks within the NC OneMap initiative, such as a street centerline (the midpoint of the street right-of-way) data distribution system that is in development, are expected to help with this problem.
If data from all counties is available through one system, then most of the data discovery acquisition costs go by the wayside, with the added benefit that the data can be transferred more routinely, with well-established provenance and rights. Steve highlights this as one example of how organizational and technical infrastructure developed for other more compelling business reasons can also help the archive development process.
Still, network data transfer is not easy when it comes to larger data resources such as digital orthophotography. "The data is huge; it can get into hundreds of gigabytes for a single county's orthophoto flight," Steve says.
So another technical detail that NCSU must work out is network data transfer. Steve admits that there is much they need to learn about large-scale data transfer. The Internet 2 network is not necessarily an option when dealing with local and regional agencies, so there has to be a reliable data-transfer method to transfer large amounts of data over the commodity Internet, no matter how long it takes. "Some sort of data I.V. drip to move stuff across the system," Steve says.
Another possible aspect of capturing Web services might lie in the harvest of static map tiles, little squares that add up to display a complete picture in online applications such as Google Maps. NASA also uses tiles to re-represent dynamic Web services as static in their WorldWind project. Tiling works better for high volume mapping services, because the previously created static renditions of the data can be retrieved more quickly.
Traditional geospatial Web services, including those based on the OGC Web Mapping Service (WMS) specification, draw maps for each user request. This is the case with the current NC OneMap viewer system. Maps are generated arbitrarily and then they're gone. "We had in the [NDIIPP] proposal to think about capturing [maps] by tiling them and then have a static version," Steve says. "But the problem is there is no market for those tiles unless someone knows how to retrieve them and can use them."
NCGDAP has been watching with interest efforts within the open source community, and more recently within the OGC, to define standard methods for describing tiling schemes and requesting map tiles. Steve thinks that in the longer term this activity may lead to an opportunity to start thinking about capturing a temporal component in Web services-based decision support environments.
Catching up with the present
For many, online maps are the first choice when there is a need to find directions. And online mapping software is only going to get better and more dazzling as access is gained to more sources of geospatial data. Still, Steve says, maps aren't the only use for the rich supply of GIS data. "It's more than data," says Steve. "It's also classification, layering, symbolization, annotation, modeling and so on."
One of the goals of the NCSU Libraries data services program is to provide GIS access for everyday folks, people who do not have expensive GIS software, GIS expertise or even high-network bandwidth. Another goal is to enable users to generate time-spanning data and to provide the means to preserve users' results. "GIS is not just about mapping; it's about analysis, extraction, charting and statistical processing," says Steve. "Maps are just one of many possible outputs."