EPA builds a better search

Proposed metadata standards promise better government search results

A keyword search in the Environmental Protection Agency's Web pages used to yield a mishmash of results. Typing, say, "water quality" in the search engine might have returned links to high-level overviews of water quality issues or to documents that merely mentioned water quality.

"The relevancy ranking of our search engine couldn't really say, 'Here's a general thing about water quality that could get you started,' " said Richard Huffine, program manager for the EPA's National Library Network. So EPA officials modified the search engine.

Now, the engine returns documents based on a ranking of data stored in metadata fields, giving priority — in descending order — to information that has the search query term embedded in a document's subject, title, description and text.

Draft recommendations, written in part by Huffine and issued by members of the Categorization of Government Information Working Group, call for adoption of similar metadata standards governmentwide. The working group is a subcommittee of the Interagency Committee on Government Information, a creation of the E-Government Act of 2002.

The metadata recommendations are part of group members' larger effort to preserve government information in digital formats and make it permanently available. The problem is that, although the federal government is permanent, individual agencies may not be. Documents stored digitally on one server can be moved to another. Such moves result in the all too common message "404 error — file not found."

Although it is technically possible to continually update databases to reflect changes as documents are moved, it is impractical, according to working group members. Instead of relying on URLs to locate digital information, members recommended that federal officials develop search schemes based on uniform resource names (URNs).

Federal officials would assign unique identifiers to each piece of government information — policy documents, Web sites, photos, maps and other digital materials. A searchable index would link users to a citation containing a minimum set of standardized metadata fields, such as subject, agency creator, title and publication date.

"If, for example, the identifier resolves to a book, then you get a citation for the book," said Eliot Christian, manager of data and information systems at the U.S. Geological Survey and chairman of the working group.

Combining URNs and a standardized metadata scheme would open the door to new possibilities for analysis, said James Erwin, primary author of the group's URN recommendations and director of information science and technology at the Defense Technical Information Center. "People can take that metadata and our identifier and put it into their database, their index, and they can use that for discovery," he said.

Information collected at one time by officials at one agency can be relevant in the future. Government surveys from the 1780s in the Northwest Territories, for example, are being used by Interior Department officials today to assess changes in vegetation patterns in Michigan and Ohio.

Deciding which types of information merit universal identifiers, however, is still a matter of debate. The group's members define government information as "any information product, regardless of form or format, that an agency discloses, publishes, disseminates or makes available to the public, as well as information produced for administrative or operational purposes, that is of public interest or public value."

All data in its place

This month, members of the Categorization of Government Information Working Group issued draft recommendations for defining, categorizing, indexing and searching government information on the Web.

After a period of public comment ending Dec. 5, the group's members will send final recommendations to Office of Management and Budget officials, who will have a year to fashion a policy for making government information more accessible.

The draft recommendations call for federal officials to assign unique identifiers to each piece of government information online so that users can find information independent of URLs.

The working group's members recommend that government officials adopt by the end of fiscal 2006 an interim identification scheme published by the Internet Engineering Task Force.

The members estimate that the management and operation of that scheme, called a Global Handle Registry, would cost between $300,000 and $1 million a year.

They recommend that Defense Information Systems Agency and General Services Administration officials assign and maintain unique identifiers for information online.

— David Perera