Researchers from the Encyclopedia of DNA Elements (ENCODE) consortium reported data from the third phase of the project.
The ENCODE project started in 2003, shortly with the completion of the first draft human genome project, to better understand the function of the 98% of the genome that does not directly code for proteins.
The project reported phase I data in 2007 and phase II data in 2012.
Phase III data, which were published in more than a dozen papers in Nature and its sister journals on July 29, 2020, consisted of 6,000 experiments performed on around 1,300 samples. Data papers were accompanied by both a Perspective and a News & Views commentary to put them into context.
And as big as that effort is, it is dwarfed by the number of publications that have come from the data overall.
“We’ve got over 2,000 publications by researchers outside of the ENCODE community using the project’s data,” Elise Feingold told BioWorld. “The uptake by the community, the value of the resource to the community [is]… much greater than I would have expected, and very rewarding.”
Many of those outside papers have focused on specific diseases, illustrating the translational value of the resource.
Feingold is the scientific advisor for strategic implementation in the division of genome sciences at the National Human Genome Research Institute (NHGRI), and a co-author on multiple papers from all phases of the project.
Like the Human Genome Project itself, ENCODE has both furthered and benefited from the development of new technologies. Data from the first phase of the project was generated mainly through the use of microarrays, which limited researchers to detailed looks at no more than a few hundred base pairs at a time from selected regions of the genome.
The advent of next-generation sequencing (NGS), Feingold said, was the equivalent “going from analog to digital – [it] totally changed the throughput and the resolution” of experiments.
Technology developments have not occurred in lockstep with ENCODE phases. But broadly speaking, in phase III, researchers have moved from cell lines to primary cells from multiple tissues, which has enabled insights into tissue-specific aspects of RNA binding proteins, transcription factors and DNase I hypersensitive sites, which mark open chromatin.
The phase III data also include papers on the early fetal development of the mouse methylome and transcriptome, as well as the most comprehensive comparative analysis of different lab mouse strains published to date.
The project is by now deep in phase IV as far as data generation is concerned. For this phase, prominent technology developments have been the advent of single-cell analysis and long-read transcripts, which have matured a great deal since phase III’s 2012 beginnings.
The ENCODE consortium is also stepping up biological validation of the regulatory regions they have identified, through functional characterization centers – currently, “we have candidate regulatory regions identified via biochemical assays,” Feingold said. The functional centers are both using now-standard methods like CRISPR and contributing to the development of new ones, such as massively parallel reporter assays (MPRAs).
Feingold said she anticipates that phase IV will be the final phase of the ENCODE program. The NHGRI is currently in the process of developing a strategic plan that is scheduled to be released this fall, most likely in October, to decide on the Institute’s priorities going forward.
The success and translational value of ENCODE, she said, has been very gratifying. But “as the science evolves over time, it’s good to take a fresh look in new directions.”