| BioWorld

Keeping up with the exponential growth of DNA sequencinginformation is considered a challenge now. But in a few years,experts warn, one of the science's main objectives _ treating disease_ could be hampered for lack of data coordination and evaluation.

Key to managing the flood of data and to deciphering gene functionsis bioinformatics, an emerging field being pushed into fast-trackdevelopment by the race to sequence the human genome and by theenormous expectations of drug discovery.

"In the 1980s," said Robert Robbins, of the U.S. Department ofEnergy (DOE), "the crisis was data acquisition. In the 1990s, thecrisis has become data integration."

Robbins, a computational biologist on loan to DOE from JohnsHopkins University in Baltimore, compares the problem to a libraryin which books are stacked chronologically, according to publishingdate.

"If they're put on the shelves as they come in the door, rather thanbeing catalogued, it's going to be difficult to find the informationyou need," Robbins said. "The drug discovery process is not beingslowed now, but it will be in the next couple of years."

"You can't stress that point strongly enough," said Carlos Zamudio,director of bioinformatics for privately held Sequana TherapeuticsInc., of La Jolla, Calif. "All the disparate groups have to come upwith a mechanism to communicate."

The groups Zamudio is referring to comprise a worldwide jumble ofdata bases that were founded to meet the needs of individual researchcommunities, but whose value for drug discovery lies in workingtogether.

The gene sequence data bases are GenBank, run by the NationalCenter for Biotechnology Information within the National Institutesof Health in Bethesda, Md., the DNA Data Base of Japan (DDBJ),the European Bioinformatics Institute (EBI) in Cambridge, England,and the Genome Sequence Data Base (GSDB) in Sante Fe, NewMexico.

GenBank, DDBJ and EBI are partners, sharing information daily, butGSDB, a more recent creation, is not included.

For gene mapping information, there's the Genome Data Base(GDB) at Johns Hopkins University, as well as other data bases suchas Genethon in Evry, France, and the Massachusetts Institute ofTechnology Genome Center.

Then there are the protein sequence data bases: SwissPro inSwitzerland, Protein Identification Resources International inWashington, and the Protein Structure Data Bank at BrookhavenNational Laboratory in New York.

Added to all of the national and international data bases are hundredsof local data bases at colleges and research institutions worldwide.Private genome companies, such as Human Genome Sciences Inc.,of Gaithersburg, Md., and Incyte Pharmaceuticals, of Palo Alto,Calif., also have developed significant proprietary data bases.

And in addition to sequencing the human genome, which is onlyabout 10 percent complete, researchers for years have beensequencing data on other organisms, such as yeast, Escherichia colibacteria and mice, which are essential to understanding how humangenes work.

"All the [major] public data bases are very loosely coordinated," saidKenneth Fasman, director of the Johns Hopkins Genome Data Base."The field is so new, there's a lot of independent motion going on,but we're trying not to duplicate efforts. There's a good amount ofsharing and collaboration."

Fasman said a main impediment to drug discovery in the near futuremay be a lack of scientists skilled in bioinformatics.

"A number of recent isolations of new genes," Fasman said, "wasdependent on mining the data in sequence data bases. Withoutbioinformatics, all the discoveries in gene sequencing will get lost inthe stacks of individual libraries."

A Data Explosion

Describing the dramatic growth in genomics and bioinformatics,Fasman said in the 1970s all the genome researchers in the U.S.could hold a meeting in one room and pass around index cards toshare their findings. Not until the late 1970s and early 1980s weredata bases founded to handle the explosion of data.

Mark Boguski, of GenBank, said the amount of informationprocessed by his data base doubles every 20 months.

In its broadest sense, bioinformatics encompasses almost everythingrelated to computer analysis in medical research. However, a majorfocus of bioinformatics for drug discovery involves comparisons ofnewly generated gene and protein data to what is known in an effortto identify genes, their functions and mutations.

"The intrinsic value of gene sequencing data is low," Boguski said."You have to link your sequences up with what's known. You haveto be able to see if your little piece of DNA has been seen before."

Boguski said GenBank gets more than 10,000 requests each day forcomparisons from researchers in the field.

"Three different colon cancer genes were discovered last year," hesaid. "Their function was inferred from bacterial and yeast genesequences that had been known for years. Gene discoveries are madeall the time, but about once a month one makes it into [the popularpress]."

To illustrate the importance of data base integration, Zamudioreferred to Sequana's work in trying to identify genetic variationsresponsible for Type II diabetes.

Using DNA from families afflicted with the disease, the researchersfirst get help from genome mapping data bases to pinpoint thelocation in the DNA where the genetic problems might exist. Thenthey compare DNA sequenced from their patient population withwhat is on file in the sequence data bases.

"When you're sequencing," Zamudio said, "you don't know if it's agene or if it's something that codes for the gene. If it has no relationto anything in the data base you have one of two things. Either itisn't a gene or it's a new gene."

If the gene sequence matches something known, there may beinformation available in data bases containing other organisms' genesequences or in data bases containing protein structures.

If the sequence is totally novel, the identification process is muchmore difficult, requiring the researcher to go back to the laboratoryto first conduct experiments to determine if the material is even agene.

Comparisons derived from searching data bases assist in acceleratingsubsequent laboratory experiments that eventually may lead totherapeutic treatments.

"Right now, we're spending a lot of our energies just trying tomanage different data bases," Zamudio said. "There isn't a singlestop and shop."

One-Stop Shopping At GenBank

The closest thing to it may be at GenBank. Boguski said if aresearcher is lucky enough to get a match with a known sequence,related information on protein structures and published scientificliterature is only a few clicks away with a computer mouse throughGenBank's integrated data retrieval system, called Entrez. It isaccessible at GenBank through the Internet or available on CD-ROM.

Boguski tells researchers if they don't get a match on their DNAsequence the first time they should try again later. "We're gettingdata so fast, their match might be there the next week," he said.

Robbins described the challenges facing bioinformatics in threesteps; overcoming technical, semantic and social interoperability ofthe various data bases.

The first involves developing access while the second, semanticinteroperability, requires solving differences in language. For anexample of the latter, he said, consider that the concept of what aprotein is and what a gene is can differ from one data base toanother.

Social interoperability, Robbins said, is the step to data integration."Connecting all the data bases," he said,"doesn't mean anything until scientists say this is related to that ineach data base. The social interoperability requires the scientificcommunity to take responsibility for the data."

Robbins said the goal of the Human Genome Project was to provideraw material to produce a molecular anatomy. The real rewards inbiology, he added, will come from constructing a molecularphysiology.

However, moving from gene sequence data to comprehensivefunctional analyses, Boguski said, won't be easy. "Even when thecomplete human genome sequence is done," he said, "we'll bespending much of the next century figuring out how to interpret it." n

-- Charles Craig

BIOINFORMATICS KEY TO TRANSLATING DNA SEQUENCE DATA INTO DRUG DISCOVERY

Today's news in brief

Today's news in brief

News in brief

Proteomics finds surprise commonalities as well as differences in neurodegenerative diseases

With surface mimicry, molecular glues shed hairpin need

BIOINFORMATICS KEY TO TRANSLATING DNA SEQUENCE DATA INTO DRUG DISCOVERY

Related Articles