Like seemingly everything having to do with computers these days, bioinformatics is pushing into the cloud, allowing for cheaper and more efficient computation to identify potential targets for precision medicine.
The National Cancer Institute (NCI) has awarded three contracts to the Broad Institute, the Institute for Systems Biology and Seven Bridges Genomics Inc. to host The Cancer Genome Atlas, a dataset from an NCI-funded project that includes data from more than 11,000 patients with 33 different types of cancer.
"The NCI is pretty visionary. It saw this bottleneck in analysis before a lot of other institutions did," James Sietstra, president and co-founder of Seven Bridges, told BioWorld Insight.
Seven Bridges, of Cambridge Mass., announced last week that its version of the data – Seven Bridges Cancer Genomics Cloud – is now available for use. Cloud versions of the data being made available by the Broad Institute and the Institute for Systems Biology are scheduled to come online this year as well.
Each of the three cloud systems may have unique tools to visualize the data developed by the respective institutions. The NCI has funded $1 million in computational time for researchers to try out all three platforms.
The Cancer Genome Atlas holds about 2.5 petabytes of data, including clinical information for each participant, genomic exome sequence for both tumor and normal samples and whole genome sequence for some patients, mRNA sequences for tumor samples, single nucleotide polymorphism array data, and somatic and germline mutation calls for each patient.
The dataset has been available for download so researchers could analyze the data on their own servers, as long as they were authorized by the NIH. But downloading and storing the data was prohibitively expensive for many researchers.
"It's a non-trivial cost to store The Cancer Genome Atlas," said Sietstra, noting that it can result in $2 million per year in storage costs and a "meaningful bump" on top of that for computational costs.
The cloud-based system just requires researchers to have access to the internet to view and analyze the data, although they'll still need to be authorized by the NIH. Seven Bridges is an NIH Trusted Partner, allowing it to authenticate and authorize access to controlled data in The Cancer Genome Atlas.
In addition to using NCI's data, Seven Bridges' cloud system will allow researchers to upload their own data to use the tools in the cloud to analyze their data alongside or separate from The Cancer Genome Atlas.
Sietstra sees the potential for collaboration as a huge benefit of keeping data in the cloud. The current alternative requires the shipment of physical hard drives between collaborators and both institutions have to be using the same architecture so they can read each other's data – not to mention having to get the lawyers involved when shipping patient data.
By combining their data in the cloud, researchers may be able to find genomic associations that wouldn't be statistically obvious with their individual datasets.
A MATH PROBLEM
Seven Bridges is named after the math problem – Seven Bridges of Königsberg – that required a walk through the city of Königsberg, which has seven bridges over a river that has two islands in the center. The problem, which required a pass over each bridge once and only once, laid the foundations for graph theory to prove that there was no solution to the problem.
The company used graph theory to improve the alignment of short-read sequences produced by sequencing machines made by Illumina Inc. and Thermo Fisher Scientific Inc.'s Ion Torrent. Rather than aligning the short-read sequence to a linear reference sequence, Seven Bridges devised a way to create a Graph Genome that takes into account the variation – and frequency of those variations – at each nucleotide in the genome. By incorporating the potential variation into the alignment process, alignment with the Graph Genome can be more accurate.
Seven Bridges is working with Genomics England, which plans to sequence 100,000 genomes from around 75,000 people covered by England's National Health Service. Half the project, which started in late 2012 and is scheduled for completion in 2017, entails sequencing patients with rare diseases and two of their blood relatives. The other 50,000 genomes will cover cancer patients that will have both normal and DNA from tumor tissue sequenced.
"Genomics England has, in many ways, led the way in population genetics. It's a model for a lot of other nations for national scale precision medicine projects," Sietstra said.
The Graph Genome facilitates easier alignment of the short-read sequences, but also allows for easy presentation of the variation data in a form that's substantially compressed from the thousands of genomes that were used to create the Graph Genome.
"They need modern data structures to work at that scale," Sietstra said of Genomics England.
In addition to making the alignment easier, the population genetics based approach has the added advantage of anonymizing the individuals' genomic data. When it's incorporated into the Graph Genome, the data from an individual adds to the frequency of variation data, but removes the identifying data from who that variation came from.
Last week, Seven Bridges announced a $45 million in series A fundraising, led by Kryssen Capital. The funding will be used to take on more large-scale genomics projects.
In conjunction with the financing, Seven Bridges added Tom Daschle to the board. Sietstra said the national health policy experience of the former U.S. Senate Majority Leader and founder and CEO of The Daschle Group should help guide Seven Bridges' plan to secure more government projects, hinting that the company was already in discussions with "several countries" interested in large-scale genomics projects of their citizens.
Seven Bridges also added Kai-Fu Lee, founding president of Google China. "He has experience taking research concepts and turning them into marketable products," Sietstra said.