How NIH’s Research-Driving, Centralized Hub for COVID-19 Patient Data is Evolving


Scientists are drawing insights via an enclave of patient records from 7.6 million individuals.

The pandemic continues to generate vast volumes of digital health data that could improve medical professionals’ understanding of COVID-19—but those potentially helpful datasets are often too large to share, and data management networks are so dissimilar they can’t be combined in a simple manner.

Near the middle of 2020, the National Institutes of Health’s NCATS, or the National Center for Advancing Translational Sciences, moved to help alleviate that issue by developing a centralized resource that integrates coronavirus-related electronic health record data from separate organizations in disparate formats into one seamless structure that can also be used to advance research to combat the global health crisis.

Multiple technology-based elements have resulted from this work, which is known as the NIH’s National COVID Cohort Collaborative—or the N3C—effort.

“The N3C Data Enclave is the largest collection of patient electronic health records and associated clinical information available for COVID-19 research. Its data, which resides in a secure environment that has strict access requirements, provides nearly complete U.S. geographic coverage and demographics fully representative of the U.S. population,” NCATS Acting Director Dr. Joni Rutter told Nextgov this week. “N3C is very unusual in that it is largely community- and volunteer-driven with over 2,800 registered users, 1,600 investigators, 89 institutions agreeing to share data, 225 institutions signed to use the data and 245 research projects.” 

At its core, N3C can be thought of as an enterprise level virtual research organization that enables scientists nationwide to engage in collaborative analytics via a secure, cloud environment. Clinical, laboratory and diagnostic data is rapidly collected through electronic health records stemming from a growing number of institutions, which is in turn tapped by the involved research community to study important questions about COVID-19—like risk and protective factors in particular populations, medications that may mitigate or promote severe infection, and long-term effects of infection—even as the pandemic progresses.

This national data enclave built explicitly for researchers was an outcome of NCATS adopting a cloud-first strategy over a period of years prior to the pandemic. This paved the way for secure, scientific, collaborative environments, according to Rutter.

“When the COVID-19 pandemic first emerged, NCATS was well positioned to quickly stand up an environment to enable research,” she said.

The emergence of the cloud as an enterprise option is a big part in what made this work possible. In Rutter’s view, such capabilities democratize information technology by enabling broad access to top-notch tools and services, and allows for a sustainable scalable environment that can be used for many different types of projects.

“In the past, investigators needed to not only do research, but were responsible for IT infrastructure,” she explained. “The cloud enables researchers to focus on the scientific questions and instead of the IT resources needed.”  

Deliberate moves have been made to preserve patients’ privacy, through a variety of means. The N3C Cohort Exploration dashboard provides an overview of key metrics and distributions by age, sex, race and ethnicity and comorbidity. 

“As of Sept 7, 2021, the N3C Data Enclave included patient records from 7.6 million individuals, including 2,565,158 patients with COVID-19,” Rutter confirmed. “This data was contributed by 64 organizations, with additional sites preparing to transfer data.” 

Hundreds of N3C research-based projects submitted through the Data Use Request process have been approved. They span topics including using machine learning for identifying drugs that can affect COVID-19 patient outcomes, exploring the influence of race on medical resource allocation associated with the novel coronavirus, estimating risks around re-infection, investigating new neurocognitive complications that patients have encountered and much, much more. 

Rutter noted that electronic health records are primarily documentation systems, so data entered is often adequate for billing and communications between providers but lacks the specificity and validity needed for research. Currently, more than 60 institutions are sending data on a weekly or monthly basis. They each use their own format. 

“Because data is the natural resource of research, having quality data is a priority for N3C, and we spend much time and effort cleaning, validating and harmonizing data to ensure the EHRs from the different institutions are comparable in an apples-to-apples way,” Rutter said. 

Still, the pandemic heightened awareness around data quality and harmonization challenges.

With that front-of-mind, the N3C data harmonization team is now working directly with N3C data-contributing sites and providing site-specific feedback to improve their local data quality. As such data is frequently missing terms or is incomplete, N3C is also exploring ways to enhance the usefulness by bringing in data that can supplement the EHR. One approach officials are pursuing is “Privacy Preserving Patient Record Linkage through an honest data broker that can allow disparate data sets to be evaluated for data overlap that would signify that the same person’s records are in the disparate datasets,” the acting director noted. 

Through that method, experts can potentially determine whether records are duplicated across data sets, discover individuals with characteristics important for a research question, or identify records that could be linked together to augment the data. Further, NCATS is also actively exploring the use of synthetic datasets created from complex data like the EHRs. 

“The promise of synthetic data is extremely appealing allowing for broad access to algorithmically derived clinical data that both preserves scientific validity while eliminating privacy concerns,” Rutter said.  

Looking ahead, she noted that N3C provides a single enclave with circumscribed types of COVID-19 patient data—so to truly maximize research potential in this pursuit, NCATS is testing the ability to integrate multiple enclaves of different types of data together. 

“For example, the ability to combine the N3C EHR data with a large imaging repository would give investigators new insights not available at the present time,” Rutter explained. “The combined repository of multiple enclaves could leverage high-performance computing environments where calculations-intensive resources are required.”  

Inside NCATS, responding to the COVID-19 pandemic has also demonstrated that guidance and policies must be updated to reflect the necessity of scientific collaboration and data sharing, especially with regards to using multiple data types to help answer complex questions. 

“We are working together with NIH policy leaders on approaches that will enable sustained and durable paths for collaborations that need state-of-the-art privacy and security practices,” Rutter said. “These efforts should maximize flexibility, while maintaining a premium on protecting patient information.”