How NIH is Organizing Its Enormous Troves of Data


The agency has a new strategic plan for how to manage, store and analyze its scientific data.

The National Institutes of Health on Monday announced a sweeping initiative to revamp the way it manages data in an effort to foster the adoption of artificial intelligence, supercomputing and other technologies poised to transform medical research.

The agency’s Strategic Plan for Data Science sheds light on a handful of challenges facing NIH as record amounts of data pour into the agency and outlines five broad areas where leaders plan to focus their efforts over the next five years:

  • Improving data infrastructure and security.
  • Breaking down information silos.
  • Increasing access to analytics tools.
  • Expanding the data science workforce.
  • Creating findable, accessible, interoperable and reusable data sets.

“Accessible, well-organized, secure and efficiently operated data resources are critical enablers of modern scientific inquiry,” the report said. Under the plan, the agency “aims to maximize the value of data generated through NIH-funded efforts to enable biomedical discovery and innovation.”

Every day, more than 3,000 groups submit data to NIH on epidemiological studies, genome sequencing, clinical trials and a slew of other medical research topics. By 2025, the agency estimates its stores of human genome data could significantly outgrow the total amount of information generated by astronomical research, YouTube and Twitter combined.

As the trove of information continues to expand, it’s become increasingly difficult to organize, secure and distribute the data across the entire NIH enterprise. The rising costs of data management may inhibit researchers from generating new data, the majority of existing data stores are not interconnected and unformatted data is often difficult to locate, according to NIH.

“Given the major opportunities, but also significant challenges, posed for biomedical research by advances in data science, this [is] the right time to map out a strategy for helping researchers achieve the promise of big data,” said Jon Lorsch, the director of NIH’s National Institute of General Medical Sciences. NIH has already started rolling out parts of the plan, and implementation will ramp up quickly over the next year, Lorsch said in an email to Nextgov.

Under the strategic plan, the agency would move to a software-as-a-service model for data storage, analysis and sharing, and also create an agencywide chief data strategist to oversee the rollout of new data science initiatives. The agency is also considering running bug bounty programs similar to those at the Pentagon to further bolster the security of its data infrastructure.

As part of the effort to expand its data science workforce, the agency also plans to develop training programs for current employees looking to improve their tech skills and create an NIH Data Fellows program that will offer one- to three-year posts to data scientists and technologists from academia and the private sector.

Salary constraints and regimented career paths have made it historically difficult for government to recruit tech talent—today federal IT employees age 60 and over outnumber those under 30 almost 5 to 1.

NIH noted its strategy is subject to change over the coming years and invites industry and government experts to weigh in on the plan.