A molecular biologist with a new gene discovery submitted the DNAsequence to GenBank of the National Center for BiotechnologyInformation (NCBI) only to find out it corresponded to a differentclass of proteins than he thought, changing the course of his research.

Another scientist, a parasitologist, visiting NCBI entered her as yetunidentified human nucleotide sequence into GenBank's integrateddata bases and found a possible human gene related to it, a humanprotein expressed by that gene and a corresponding yeast gene,whose biology had been studied extensively.

Not only did the parasitologist leave GenBank armed with a batteryof information that saved her months of research, but comparisons ofthe yeast and human gene data uncovered mistakes in conclusionsdrawn by the yeast researchers.

As scientists generate huge amounts of DNA sequence informationdaily _ human, animal, plant and bacteria _ integration of computerdata bases for comparative analyses of what is known with what isbeing discovered is considered essential for understanding thefunction of genes and how they relate to each other.

GenBank in Bethesda, Md., and its associates _ the DNA Data Basein Japan, the European Bioinformatics Institute (EBI) in England andthe Genome Sequence Data Base in Sante Fe, New Mexico _constitute the world's largest repository of public genomicsinformation.

GenBank and NCBI operate within the National Library of Medicineat the National Institutes of Health.

To assist researchers in taking advantage of the exponential growth ingenomics discovery, GenBank's scientists have integrated gene andprotein sequence data bases with taxonomy and scientific journal databases to facilitate comparative analyses. GenBank will add 3-Dprotein structure and complete genome data bases this summer andeventually will integrate transcription maps into the network.

Data on DNA material from 11,000 species is represented.

`Neighboring' Helps Researchers

Among the most recent features is incorporation of a technique called"neighboring" to enable scientists to discover related informationabout unidentified pieces of DNA, even if no corresponding sequenceis known.

James Ostell, chief of the GenBank's information engineeringbranch, oversees development of the software that makes the database integration possible.

He explained that if a scientist has a protein sequence and doesn'tfind a match in the sequence data bases, the researcher can access the3-D structure data base, determine if it corresponds to a particularshape and then do a "structural neighboring" search to see if anothersequence, with the same shape, has been discovered.

If a "neighboring" DNA sequence exists, even though it's totallydifferent from the researcher's, a comparison between the two canyield insight into how the unknown sequence works and the gene thatexpresses it.

Discussing GenBank's move from archivist to research partner,Ostell said, GenBank used to contain piles of DNA pieces whoserelationships were not catalogued. That was GenBank's past.

Now through development of bioinformatics and data baseintegration researchers can quickly determine if their pieces of DNAare similar to what already has been identified.

Up To 20,000 Queries A Day

GenBank's Senior Medical Researcher Mark Boguski said the database gets 15,000 to 20,000 queries a day and more than half arescientists conducting homology searches for their gene and proteinsequences.

Researchers using the data bases via World Wide Web(http://www.ncbi.nlm.nih.gov) get answers within 20 seconds to twominutes. The data bases also are accessible by electronic mail and areavailable on CD-ROMs, which are updated every two months.

Francis Ouellette, who manages GenBank's submissions, said WorldWide Web is fast becoming the method of choice for access to thesystem.

As of April 13, GenBank contained 125,544 human gene sequencesand another 51,315 from other organisms. After humans, the nextmost sequenced organisms are plants (16,355), followed by worms(12,102), rice (11,000) and yeast (3,000).

No eukaryotic organism has been sequenced fully, but the first maybe finished soon. Ouellette said a European group expects tocomplete sequencing the entire yeast genome by early 1996.

Friday: Why two national data bases for the U.S.?

-- Charles Craig

(c) 1997 American Health Consultants. All rights reserved.