Government AI hits a data roadblock but synthetic data could be the fix

When federal researchers tried to build an AI model to detect fraud in disability claims, they ran into a roadblock: they couldn’t use real claimant data. The workaround wasn’t a weaker model — it was synthetic data, an approach designed to mirror real-world information without exposing sensitive details.

That same challenge is playing out across government. From reducing mundane tasks to speeding insights and decision making, artificial intelligence is rapidly changing how agencies meet their missions. But to make AI tools functional and accurate, agencies need robust, meaningful data. And lots of it.

“AI models like large language models work by predicting the next thing based on what they’ve already seen,” said Dave Vennergrund, vice president of AI for General Dynamics Information Technology (GDIT). “To do that, they need to see a lot so they can understand language usage. They have to absorb as much information as they can; they’re data hogs.”

The problem is not that real-world data is incomplete, but that it’s often distributed unevenly. In many cases, only a tiny fraction of records represent the scenario an agency is trying to model, such as 0.01% fraud cases among 99.99% legitimate transactions. For government, those limitations are compounded by the sensitivity of the data, especially for models that need access to private medical or personally identifiable information.

“For smaller predictive models, the challenges are often tied up around data access and privacy,” Vennergrund explained, pointing to policies like HIPAA and others designed to keep protected health information secure. “The data owner has collected it for a purpose, now you want to use it for another purpose, so there needs to be stakeholder buy-in for that new purpose. That often takes a lot of time, or it’s not possible.”

While these safeguards exist for good reason, they leave agencies at a disadvantage when trying to train models in areas such as health care, fraud detection, or other domains that rely on highly sensitive information.

One valuable approach is the use of synthetic data, which can help agencies simulate realistic scenarios and generate the volume of data they need while simultaneously protecting privacy.

Speed, volume, privacy: The promise of synthetic data

At its core, synthetic data is information built to mirror real-world scenarios but entirely fabricated. That doesn’t mean it’s jumbled nonsense, however.

“It’s data that’s created to mirror another data set, and that mirroring can be done in a very arbitrary way, generating random numbers. Or it can be done in a more sophisticated way – where we generate data that matches the semantic relationships in the data, for example, pregnancy services only occur in females – and allows us to boost or suppress distributions to fine-tune models,” Vennergrund said.

This becomes especially valuable in areas where data is scarce. For example, researchers trying to create an AI model to analyze a rare disease may find the patient population too small to provide sufficient training data. 

By synthesizing information that reflects the characteristics of that population, agencies can create sufficiently robust data sets to train the model without waiting years for enough real-world data to accumulate.

“As agencies build and train AI models, the biggest question often isn’t which model to use but rather what data to feed it. Remember, your data is the differentiator,” said Asim Qureshi, part of the AIML specialty organization at Amazon Web Services. “Real-world data is often sensitive, incomplete, or too limited in volume to train AI effectively. Synthetic data offers an alternative, helping agencies simulate realistic scenarios, protect privacy, speed development, and mitigate the cost and effort of data gathering. Knowing when synthetic data adds value, when real-world data is irreplaceable, and how to strike the right balance is key to building more trusted, mission-ready AI solutions.

Synthetic data also proves useful when enough data exists but can’t be used because of confidentiality rules. In these cases, synthetic records can be generated to replicate sensitive datasets without exposing the underlying private information.

Vennergrund pointed to the disability claims case as one example. “We wanted to show a potential customer how AI could review disability claims, but we couldn’t use real data, so we recreated it,” he said. The team pulled a sample claim form from the internet, used publicly available distributions such as gender, top disease types, age, and location to build a database, and generated synthetic records that matched those patterns. They then planted a few fraudulent cases and trained a predictive model to detect them.

The effect was so convincing that during the demo, a leader at the agency stopped Vennergrund to ask how his team had been able to access the agency’s real data.

“When we explained it was synthesized, they were amazed,” he said.

Filling the gap for an AI-powered future

The reality is that AI models cannot function without data, yet the data government needs most is often the hardest to use. Synthetic data bridges that gap, offering a way to innovate responsibly, move faster, protect privacy, and still generate the insights agencies require.

As agencies continue to explore and expand AI adoption, synthetic data won’t just be a convenient workaround. It may prove to be the key ingredient that enables government to build models that are accurate, resilient, and mission-ready.

Learn more about how GDIT can help agencies make the most of AI.

This content is made possible by our sponsor GDIT; it is not written by and does not necessarily reflect the views of Nextgov's editorial staff.

NEXT STORY: Shaping healthcare’s future through intelligent care delivery