By Charles Craig

In the past six years, GenBank at the National Center forBiotechnology Information (NCBI) has shifted from publicrepository of DNA sequence data to partner in facilitating genediscoveries for scientists throughout the world.

That progression not only reflects advances in bioinformatics andcomparative analysis, but also is indicative of the enormousexplosion of DNA information.

The most recent example of GenBank's shifting mission is the launchof a new project to develop transcription maps of the human genomeand integrate that information into the data base. The project stemsdirectly from the exponential growth of human gene sequencing data,particularly recent contributions from Merck & Co.'s uniquecollaboration with gene sequencers at Washington University in St.Louis

"This is an exciting time for biotechnology," said David Lipman,NCBI director. "Even before Darwin, comparative analysis was usedin the study of biology. Now we're working at the gene level andwe're seeing similarities that go back 3 billion years."

NCBI operates within the National Library of Medicine, of theNational Institutes of Health, in Bethesda, Md.

GenBank and its associates - the DNA Data Base in Japan, theEuropean Bioinformatics Institute (EBI) in England and the GenomeSequence Data Base in Sante Fe, N.M. - form the world's largestpublic data base of DNA sequence information.

In discussing the transcription map project, GenBank's SeniorMedical Researcher Mark Boguski said, "Originally, when theHuman Genome Project was conceived, the idea was to completelysequence the genome. Identifying genes was more a by-product.Ninety-five percent of the genome is junk. Mappers used anonymousmarkers, or sign posts, which in most cases don't correspond togenes.

Marking The Way Through Genome Junk

"Now people are making markers based on gene sequences[expressed sequence tag or EST] and putting those on the map todetermine where the genes are in relation to the anonymous markers."

For its part, GenBank is providing the mappers with clean, non-redundant, gene sequences to create the transcription maps, whichwill provide researchers with dense sequence ready clones of 50kilobases, as opposed to more traditional markers spread across thegenome in bites of 10 to 20 megabases.

For disease gene hunters and positional cloners, Boguski said, thetranscription maps, which locate genes on chromosomes, are ashortcut to targeting smaller chromosomal regions in which to begintheir searches. In some cases, the more detailed maps could saveyears of work.

Transcription maps, Boguski said, also move the science another stepcloser to understanding large scale aspects of gene expression.

"There is no global view of how genes are organized and how thataffects what they do," Boguski said. "Transcription maps are animmediate resource for positional cloners, but they also are forpeople who want to understand the biology."

GenBank is working on the transcription maps with three institutions:Massachusetts Institute of Technology (MIT) Genome Center inCambridge, Mass.; Stanford University Genome Center in Palo Alto,Calif.; and Cooperative Human Linkage Center (CHLC) of the FoxChase Cancer Center in Philadelphia.

Boguski and his fellow government researchers have proposed firstlocating the 3,000 to 4,000 unique genes in GenBank's data base onthe transcription maps. Those several thousand genes were part of theGenBank data base before Merck's contributions started arriving viacyberspace to GenBank's computers in February.

The Merck-Washington University collaboration is transferring toGenBank about 1,000 gene sequences each day. Merck, ofWhitehouse Station, N.J., estimated it will provide as many as400,000 sequences over the next 18 months, making it by far theworld's largest contributor of public gene sequence information.

Lipman estimated Merck's contributions as of mid-March have addedas many as 9,000 new, as yet unidentified, genes to GenBank's database.

"We held back from mapping gene sequences because there weren'tenough ESTs," Boguski said. "This transcript mapping projectcatalyzed out of the Merck initiative."

Merck's importance to the public data bases is illustrated by the factthat in its first transfer of 15,000 gene sequences in February, itequaled what each of the other top contributors - Genethon, of Evry,France, and The Institute for Genomic Research (TIGR), ofGaithersburg, Md. - deposited over the last several years.

Both Genethon and TIGR are non-profit institutes. GenBank and itsassociate data bases also receive DNA sequences from otherinstitutional and private company researchers worldwide.

TIGR has launched a transcript mapping project of its own inassociation with Norwalk, Conn.-based Perkin-Elmer Corp.,California Institute of Technology (CalTech) and France-basedCentre d'Etude du Polymorphisme Humain (CEPH).

Integration: The First Step

GenBank's gene sequence data base is integrated with other databases: Medline, containing literature from scientific journals; Protein,containing protein sequences; and a taxonomy data base, classifyinggene and protein sequences by species. Those soon will be integratedwith a 3-D protein structure data base and a complete genome database, which also will include large chunks of DNA sequences. Accessis available through the Internet and on CD-Roms.

The data base integration, essential for comparative analyses, wasamong GenBank's first moves from archive to research participant.

"The central concept of GenBank and NCBI," said Lipman, "is thatby contributing sequences to a public data base you get a hugeproduct multiple."

What scientists at one institution are sequencing, Lipman said, soarsin value when comparisons are made with work submitted byresearchers elsewhere in the world.

Thursday: A closer look at data base integration.

(c) 1997 American Health Consultants. All rights reserved.