The National Cancer Institute has taken the first concrete step to make its wealth of genomic cancer information available to a broad base of researchers worldwide -- potentially speeding up cancer research significantly.
The ultimate goal for the project is to build one or more computer clouds filled with data from the institute’s Cancer Genome Atlas that outside researchers can tap into with new data mining and analysis tools. Using that information, scientists say, they’ll be able to learn vastly more about how cancers develop and spread, spot hidden similarities between tumors on different parts of the body and improve treatments.
A presolicitation document posted Monday aims to prepare universities and research labs to bid for the chance to create one of three pilot clouds. Information gleaned from those clouds might be used to create a new cancer cloud, managed by the government, a university consortium, or the private sector, or one of the clouds might develop into a full-scale model, according to the posting.
Because the types of cancer data and the tools used to mine it differ so greatly, it’s likely there will have to be at least two cancer clouds after the pilot phase is complete, George Komatsoulis, director and chief information officer of the National Cancer Institute’s Center for Biomedical Informatics and Information Technology, told Nextgov in August.
The Cancer Genome Atlas contains half a petabyte of information now, the equivalent of about 5 billion pages of text. By 2014 officials expect that figure will grow to 2.5 petabytes of genomic data drawn from 11,000 patients.
Just storing and securing that information would cost an institution $2 million per year, Komatsoulis said, a price tag that’s prohibitive for many small colleges, universities and other research institutions. By putting the data in the cloud and allowing researchers to access it remotely, perhaps on a pay-as-you-go model, the cancer institute could massively expand the number of researchers working on tough genomic problems, he said.
The federal government is working with Amazon on a separate initiative to put the Thousand Genomes Project in the company’s Elastic Compute Cloud where researchers could access the data set and only pay for the computing they use.
The cancer institute plans to hold an online conference in December to help institutions prepare their proposals to build one of the three clouds.
The institute is gathering ideas for how the clouds should be organized on the crowdsourcing website Ideascale.