recommended reading

Why the Modern-Day Government Should Focus More on Big Data Curation

Lucia Fox/

Hari Donthi is the vice president of data development systems at NCI, Inc. and leads numerous big data and agile efforts in civilian agencies.

Data management is not a new concept in government IT, nor is the discussion about how to improve IT and business engagement through better use of data. Government agencies always have recognized the importance of leveraging their data. However, today’s government data users (usually referred to in IT circles as “the business”) believe their internal IT shop cannot give them the data they know is available within their agency or that exists in other open data platforms.

How Did We Get Here?

To understand this mismatch of expectations, it helps to look at how we got here. In the '90s, the primary goal of the data warehousing movement was to meet the organization’s needs by solving the single-version-of-the-truth problem.

This required careful reconciliation of data interpretations between different users and departments so everyone could be on the same page. Additionally, stringent data quality checks existed so decision-makers would have confidence in the data.

Because massively parallel processing solutions like Hadoop and column-oriented data stores or the cloud were not commonplace in the '90s — data models had to be designed, tuned and maintained by experts for good performance.

These factors created a barrier to getting new types of data into the data warehouse, and often led to expensive, multiyear programs that — in the end — had very limited utility.

Today, the need for single-version of enterprise-level data is no longer the primary objective of storing historical data. Users want full access to all data and the ability to interact with it to be able to extract insights and rapidly unlock the power of the data.

To achieve this, the focus of government’s data management efforts needs to shift from warehousing to data curation.

Moving Beyond the Warehouse

In our current age of big data, a single enterprise interpretation of the data is passé. The old data warehouse days focused on an enterprise data model created with fixed meanings for data attributes. The users of the data warehouse simply filtered the data based on their department’s needs.

Today, with the proven usefulness of predictive analytics in the private sector and the same growing in government, we must revisit the tradition of an enterprise data model.

Specifically, we should accept that the usage patterns, predictive power and meaning of the data attributes can evolve as an organization gets more mature in mining its data — deploying predictive models into the field and feeding back performance results to refine the models — and as events outside the organization affect its priorities. It is important to separate the data from how it is used.

The Data Curation Difference

Data curation differs from traditional data warehousing. A curated data store is a platform for data users — it does not tell the users how to consume or interpret the data. The data users make the data actionable and meaningful using statistical learning techniques, for example, to predict emerging trends like fraud, noncompliance and virus outbreaks.

The significance and meaning of data attributes are determined by the predictive power of the multiple models that use this data, and these “meanings” can be fed back into the curated data store so it can be a shared enterprise asset.

This process relieves a central authority (aka data steward) from having to be the sole arbiter or the bottleneck of curated data, which is very different from the traditional data warehousing lifecycle of the '90s.

Government can learn from these data warehousing experiences and issues from the '90s, including the role technology played. Back then, it was difficult introducing new data into the data warehouse and getting large databases to perform well for ad hoc analytics.

While technologies of today reduce the need for finely tuned data models, we cannot simply throw away data modeling and create a data lake. As Michael Stonebraker put it eloquently, a data lake can quickly turn into a data swamp. And this is why data curation is necessary and important.

Transitioning from data warehousing to curation also involves a change in user behavior. When curated data is presented to the users, a lot more is expected of them than simply filtering canned reports.

Data curation boils down to serving up the data on a platter. That is, the users know what the data elements mean, where they come from, how to explore and mine them, and how to make the insights actionable. Giving users this power and freedom of ad hoc exploration requires a different engagement model between the users and the maintainers of the curated data platform.

Both parties will need new skills. IT needs to build expertise making data available in a user-friendly way — expertise that is significantly different from delivering user-friendly applications and websites. Users need to acquire skills in interacting with data in a more modern way. Users need a lot more than standard “tool training.” IT and the users need to experience the power of the modern data mining and data exploration tools together, in the setting of their agency’s data.

Doing this will give IT the confidence to step back from creating fully spec’d silo applications to creating data platforms, and users, in turn, will reduce their appetite for expensive use-case specific applications.

This change in the frame-up of conversation between business and IT is the only way predictive analytics will become democratized and help empower government to meet its challenges more rapidly and more efficiently.

Threatwatch Alert

Thousands of cyber attacks occur each day

See the latest threats


Close [ x ] More from Nextgov

Thank you for subscribing to newsletters from
We think these reports might interest you:

  • Modernizing IT for Mission Success

    Surveying Federal and Defense Leaders on Priorities and Challenges at the Tactical Edge

  • Communicating Innovation in Federal Government

    Federal Government spending on ‘obsolete technology’ continues to increase. Supporting the twin pillars of improved digital service delivery for citizens on the one hand, and the increasingly optimized and flexible working practices for federal employees on the other, are neither easy nor inexpensive tasks. This whitepaper explores how federal agencies can leverage the value of existing agency technology assets while offering IT leaders the ability to implement the kind of employee productivity, citizen service improvements and security demanded by federal oversight.

  • Effective Ransomware Response

    This whitepaper provides an overview and understanding of ransomware and how to successfully combat it.

  • Forecasting Cloud's Future

    Conversations with Federal, State, and Local Technology Leaders on Cloud-Driven Digital Transformation

  • IT Transformation Trends: Flash Storage as a Strategic IT Asset

    MIT Technology Review: Flash Storage As a Strategic IT Asset For the first time in decades, IT leaders now consider all-flash storage as a strategic IT asset. IT has become a new operating model that enables self-service with high performance, density and resiliency. It also offers the self-service agility of the public cloud combined with the security, performance, and cost-effectiveness of a private cloud. Download this MIT Technology Review paper to learn more about how all-flash storage is transforming the data center.


When you download a report, your information may be shared with the underwriters of that document.