E-government specialists say it is unreasonable for the White House to call on agencies to release existing government data in machine-readable formats as part of an open government directive.
Federal Chief Technology Officer Aneesh Chopra on Wednesday said the imminent directive will include a schedule for the distribution of data in formats that citizens, companies and nonprofits can download, search and manipulate to gain greater insight into government operations.
Providing information in machine-accessible formats would be one step toward increasing transparency in government, which is the overarching purpose of the long-awaited directive first announced in January.
But in today's federal information technology environment, most agency data exists in PDF format, which cannot be easily extracted for analysis.
"There's no data associated with [a PDF]. It's not machine-readable; the only intelligible way to retrieve data from the system, is you need that system," said Kevin Novak, co-chairman of the World Wide Web Consortium (W3C) eGovernment Interest Group and a former director of Web services at the Library of Congress. W3C, a Web standards development organization, was founded by World Wide Web inventor Tim Berners-Lee.
Comments posted on an internal governmentwide discussion Web page set up by the White House asked how older content would fit into Obama's open government principles. The comments, which administration officials subsequently published without names, were part of a March online discussion to solicit ideas from federal employees on how to make government more transparent.
"One question I would throw out there is to what extent should our transparency efforts support legacy data?" an employee wrote in March. "Is it enough to just have 'search everything from 2007 onward,' or do we need to build systems with backwards compatibility (or even reprocess the data to fit it into the structures)? Obviously, in a perfect world it would be everything, but given limited resources how should we be prioritizing?"
President Obama, on his first day in office, told agency heads to compile by May 21 recommendations for a directive that would incorporate new technologies to create a more transparent, collaborative and participatory government. Wednesday's announcement marked the first reference from the White House on the timing and content of the final directive.
There are no executive branch standards for machine-readable data yet, Novak said. While scientific agencies, such as the U.S. Geological Survey and NASA, warehouse their information in machine-readable formats, they are the exceptions, he added.
White House officials said they are familiar with concerns regarding legacy data and are confident they can work collaboratively with agencies to achieve the president's transparency goals.
It would be best to direct agencies to develop standards for releasing data in machine-readable formats, said Novak, now vice president of integrated Web strategy and technology at the American Institute of Architects.
W3C officials, including Novak, recently shared with the White House the organization's notes on putting government data online.
The group also released on Sept. 8 formal steps and standards for publishing government data, which include posting data in raw form in structures that allow computers to manipulate the information and creating an online catalog of the data so people can discover what is available. Agencies also should make sure data contains attributions that humans and machines can understand.
W3C's recommendations could offer a glimpse into part of the directive, but the White House emphasized it is considering a range of ideas.
"Many groups have weighed in with the open government initiative on data standards issues, W3C among them but not exclusively," said Rick Weiss, senior science and technology policy analyst at the Office of Science and Technology Policy. "The Office of the [Chief Information Officer] and the CIO Council will work closely with the data standards community to continue to develop best practices around the release of open data."
Novak noted there is a new PDF version -- PDF/A -- that is more suitable for long-term preservation than the traditional portable document format. A PDF/A file contains the coding necessary to replicate, over time, the visual appearance of the document, including its text, images, fonts and color. The standard prohibits links to outside content and fonts that are not embedded in the file. This renders the document independent of other tools and systems.
"It is all about access, not just for today, but for the future," he said. "Once government begins to place data into the public space, the expectation will be that it becomes a resource center and research center for historical, current and future data. The challenge is what and how government deals with the legacy data, particularly image-based PDFs . . . and what level of effort is put to ensure those items are discoverable and accessible via the Web."