Synthetic Data Engine to Support NIH’s COVID-19 Research-Driving Effort


It’s all part of a new partnership the agency is embarking on with Syntegra, and the Bill and Melinda Gates Foundation.

An artificial intelligence-enabled synthetic data generator that converts clinical data of any kind into equivalent, mock versions that don't expose sensitive patient-identifying details is being put to use as a component of the National Institutes of Health-steered National COVID Cohort Collaborative, or N3C effort.

“The NIH’s N3C initiative is a result of the urgent need for understanding of COVID both to develop better patient care and understand the impacts on individuals and the health system as a whole,” Dr. Michael D. Lesh told Nextgov this week. Lesh—the co-founder and CEO of Syntegra, the company behind the synthetic data engine—shed light on how the tool works, and a new partnership between the business, NIH and the Bill and Melinda Gates Foundation that underpins this fresh endeavor.

In June 2020, not long after the novel coronavirus pandemic disrupted nearly every aspect of American life, NIH launched N3C to accelerate COVID-19 research and new medical breakthroughs. The collaborative pursuit, according to a June press release, intends to systematically capture relevant data from participating health care providers across the country, aggregate that data into accessible formats, and in-turn help approved users harness research insights from that harmonized information, via the NCATS N3C Data Enclave. 

Lesh noted that “a broad and clinically deep database for this type of research did not exist,” back then, facilitating the need for a massive data collection that pulls from many contributors. However, the life-saving insights such data access might offer is also limited if heaps of researchers can’t dig into it, which Lesh deemed “a proposition made difficult based on the need to maintain the privacy of the patients within this broad dataset.” 

His company hones in on one possible solution to that challenge.

“Synthetic data solves this issue, thus becoming a key pillar of the overall N3C initiative,” Lesh said. “By creating a synthetic version of the dataset with validated privacy and accuracy to the underlying data, Syntegra allows this groundbreaking dataset to get into the hands of more potential innovators, thus increasing the potential for society to benefit from its use.”

Through the initiative, N3C is responsible for aggregating data collected by more than 70 contributing sites, which at this point amounts to almost 3 billion rows. That number will continue to grow as new health systems join and new patient data flows in from those already on board. 

“Syntegra is in the process of creating a synthetic version of the entire N3C COVID database, including potentially all values for the entire patient population, currently at more than 2.6 million patients,” Lesh said. “It is our understanding that the N3C Enclave contains all relevant data about COVID including the care trajectories of all treatments, vaccinations, etc.”

The company’s ultimate role here is to produce synthetic versions of any data in the Enclave, and “provide rapid, widespread access without violating privacy,” Lesh added. Syntegra has created synthetic versions of test sets, as it prepares to roll out large-scale COVID synthetic data. Down the line, it could enable more rapid access to data-driven insights and help physicians and researchers uncover new insights around racial and ethnic disparities in spread and risk, predictors around hospitalization, long-term adverse effects and the impact of COVID-19 on hospitals, among other topics.

With help from AI, Syntegra’s synthetic data engine essentially extracts the relationship between all variables, within any medical dataset. That then produces about a billion parameters, which “accurately reflect the underlying medical patterns in the data, and are subsequently used to generate brand new synthetic medical records,” Lesh noted. This subsequent synthetic dataset maintains all of the statistical properties and patterns of the original data—without any of the original patient identities leaking into the newly created dataset.

“In other words, no one could work backward from the synthetic data to discover the original patients,” Lesh explained. “Two key elements in particular of this process are that it is done over an entire dataset learning from the data itself, rather than being limited to specific question-based cohorts, and that the output is accompanied by full validation metrics for both the accuracy and privacy against the original dataset.”

Before this new engagement with NIH, Syntegra signed a previous research contract with the  Bill and Melinda Gates Foundation, centered on a similar goal of driving forward large-scale COVID-19 research. 

“The Gates Foundation, however, found a similar issue to the NIH that a single sufficient dataset did not currently exist, leading to Syntegra and the Gates Foundation choosing to bring our existing partnership together into the NIH’s N3C initiative,” Lesh said. “With its unique focus on global health, the Gates Foundation expects Syntegra’s technology to provide a mechanism for cross-border COVID datasets to become widely available for research.”

Outside of this pursuit, the company is also engaging with the Food and Drug Administration regarding the role of synthetic data in regulatory decisions.

“The FDA is exploring with us several aspects of drug approval, including synthetic control arms, improved trial design and ‘what if’ analysis, ongoing drug safety monitoring, and approval of new indications for small sub-populations and rare diseases,” Lesh said.