The public sector's data management efforts needs to shift away from warehousing.
Hari Donthi is the vice president of data development systems at NCI, Inc. and leads numerous big data and agile efforts in civilian agencies.
Data management is not a new concept in government IT, nor is the discussion about how to improve IT and business engagement through better use of data. Government agencies always have recognized the importance of leveraging their data. However, today’s government data users (usually referred to in IT circles as “the business”) believe their internal IT shop cannot give them the data they know is available within their agency or that exists in other open data platforms.
How Did We Get Here?
To understand this mismatch of expectations, it helps to look at how we got here. In the '90s, the primary goal of the data warehousing movement was to meet the organization’s needs by solving the single-version-of-the-truth problem.
This required careful reconciliation of data interpretations between different users and departments so everyone could be on the same page. Additionally, stringent data quality checks existed so decision-makers would have confidence in the data.
Because massively parallel processing solutions like Hadoop and column-oriented data stores or the cloud were not commonplace in the '90s — data models had to be designed, tuned and maintained by experts for good performance.
These factors created a barrier to getting new types of data into the data warehouse, and often led to expensive, multiyear programs that — in the end — had very limited utility.
Today, the need for single-version of enterprise-level data is no longer the primary objective of storing historical data. Users want full access to all data and the ability to interact with it to be able to extract insights and rapidly unlock the power of the data.
To achieve this, the focus of government’s data management efforts needs to shift from warehousing to data curation.
Moving Beyond the Warehouse
In our current age of big data, a single enterprise interpretation of the data is passé. The old data warehouse days focused on an enterprise data model created with fixed meanings for data attributes. The users of the data warehouse simply filtered the data based on their department’s needs.
Today, with the proven usefulness of predictive analytics in the private sector and the same growing in government, we must revisit the tradition of an enterprise data model.
Specifically, we should accept that the usage patterns, predictive power and meaning of the data attributes can evolve as an organization gets more mature in mining its data — deploying predictive models into the field and feeding back performance results to refine the models — and as events outside the organization affect its priorities. It is important to separate the data from how it is used.
The Data Curation Difference
Data curation differs from traditional data warehousing. A curated data store is a platform for data users — it does not tell the users how to consume or interpret the data. The data users make the data actionable and meaningful using statistical learning techniques, for example, to predict emerging trends like fraud, noncompliance and virus outbreaks.
The significance and meaning of data attributes are determined by the predictive power of the multiple models that use this data, and these “meanings” can be fed back into the curated data store so it can be a shared enterprise asset.
This process relieves a central authority (aka data steward) from having to be the sole arbiter or the bottleneck of curated data, which is very different from the traditional data warehousing lifecycle of the '90s.
Government can learn from these data warehousing experiences and issues from the '90s, including the role technology played. Back then, it was difficult introducing new data into the data warehouse and getting large databases to perform well for ad hoc analytics.
While technologies of today reduce the need for finely tuned data models, we cannot simply throw away data modeling and create a data lake. As Michael Stonebraker put it eloquently, a data lake can quickly turn into a data swamp. And this is why data curation is necessary and important.
Transitioning from data warehousing to curation also involves a change in user behavior. When curated data is presented to the users, a lot more is expected of them than simply filtering canned reports.
Data curation boils down to serving up the data on a platter. That is, the users know what the data elements mean, where they come from, how to explore and mine them, and how to make the insights actionable. Giving users this power and freedom of ad hoc exploration requires a different engagement model between the users and the maintainers of the curated data platform.
Both parties will need new skills. IT needs to build expertise making data available in a user-friendly way — expertise that is significantly different from delivering user-friendly applications and websites. Users need to acquire skills in interacting with data in a more modern way. Users need a lot more than standard “tool training.” IT and the users need to experience the power of the modern data mining and data exploration tools together, in the setting of their agency’s data.
Doing this will give IT the confidence to step back from creating fully spec’d silo applications to creating data platforms, and users, in turn, will reduce their appetite for expensive use-case specific applications.
This change in the frame-up of conversation between business and IT is the only way predictive analytics will become democratized and help empower government to meet its challenges more rapidly and more efficiently.