Sensitive Data Must Be Protected. That Doesn’t Mean It Can’t Also Be Used.

Yuri Hoyda/Shutterstock.com

Two promising tools may allow agencies to share data while maintaining privacy and security requirements.

Earlier this year, the Health and Human Services Department announced a multimillion-dollar contract for artificial intelligence tools, including machine learning, natural language processing, and more. It was just the latest example of a government agency making a significant investment in AI and leveraging big data to drive actionable insights that can guide future decision making.

That agreement was also another sign that data has become one of the hottest commodities in federal IT. It’s a centerpiece of the President’s Management Agenda, which calls for “leveraging data as a strategic asset.” It’s the focus of the Office of Management and Budget’s Federal Data Strategy. And it’s the nutrient that nourishes AI models and algorithms, which help IT professionals and data scientists extract information from massive troves of data they have at their disposal.

The data set challenge: Trust no one?

But while data is the foundation of any successful AI initiative, identifying trusted and verifiable data sets can be an enormous challenge. Agencies must ensure that the data they’re using is clean and reliable. If it isn’t, the results derived from the AI will be inaccurate—the old “garbage in, garbage out” principle. 

Data that has been cultivated within other government organizations is, ideally, trustworthy, but it may also be considered highly sensitive or proprietary. Many agencies may be understandably hesitant (if not outright prohibited) from sharing this data, even among counterparts who may use it to further research that benefits the U.S. This is particularly true if the data in question includes personally identifiable or classified information. 

How can agencies manage the delicate balance between using data to its fullest extent without compromising privacy and security requirements? Two exciting and innovative data analysis tools—homomorphic encryption, or HE, and federated learning—may hold the answers to this question. Let’s take a look at what these tools are, how they work, and what agencies need to know as they consider implementing these models.

What is HE?

HE allows an AI algorithm to analyze encrypted data, letting data scientists glean valuable insights without having to decrypt sensitive data sets. Essentially, HE enables fundamental algebraic operations on encrypted data that are equivalent to running the same operations on unencrypted data. 

Put another way, HE could be considered a form of x-ray vision that allows machines to “see” the underlying statistics within the encrypted data while still keeping that data private. This can be enormously beneficial for government agencies. They can gain valuable insights hidden within encrypted data sets without compromising the security or privacy of the information contained within.

Once the doorway to preserving sensitive data sets has been opened by HE, agencies may feel more comfortable sharing encrypted data and collaborating with one another.

What is federated learning?

Yet their ability to share sensitive data may still be regulated or restricted. This is where federated learning comes into play.

Google introduced the concept of federated learning in 2017 as a means to allow mobile phones to collaboratively learn a shared prediction model while keeping a person’s private data on their device. The result was the popular Google keyboard prediction algorithm that “predicts” what a user is going to type while they enter a phrase. 

The application offers significant potential benefits to public sector agencies. Organizations can gain insights from different data sources without having to move data from one siloed location to a centralized server. Essentially, the algorithm goes to the data, while the data itself stays put. 

Teams working with data housed in different locations can still collaborate on deep learning projects without having to share their data. For instance, a group working at the Defense Logistics Agency can collaborate with colleagues working at the Defense Information Systems Agency on a joint DoD project—without their data having to leave their respective organizations. 

Keeping the data at-rest and in the possession of the various teams protects the integrity of the data, reduces risk and improves security. Meanwhile, teams can continue to iterate and collaborate on their projects without fear of their sensitive data being compromised.

What do agencies need to do?

HE and federated learning are very compelling and powerful options for working with sensitive data, but there are some requirements that agencies will need to consider before implementation. 

First, it’s important to factor in the storage costs associated with HE. A good rule of thumb is to estimate the amount of storage that would be needed to store the data unencrypted—and then double it. Agencies that use HE may end up dealing with large-scale and highly complex data sets. They’ll want to make sure they’ve got the space to store these sets.

Agencies must also ensure that their compute resources are up to the task. Running analytics on HE data can be very demanding and cause latency issues due to the intense nature of the analytics processes. Organizations that have already invested in high-performance computing will be set up well to handle these processes. 

It’s also worth considering utilizing a flexible, open-source software stack with support for multiple machine learning and deep learning frameworks. Such stacks can create an abstraction layer that allows data scientists in different agencies to work in their favorite deep learning framework to create models that can be distributed to seamlessly share information—important for federated learning. 

But the most important thing to know is that there are now ways to share and gain insights from data without compromising that data’s integrity. HE and federated learning provide agencies with the ability protect data while also using it to its fullest extent. 

Sean McPherson is a deep learning data scientist at Intel.