LONDON – The extent to which existing DNA databases fail to reflect human genetic diversity is laid bare in the most geographically comprehensive sequencing initiative to date.
The study applied the latest sequencing techniques to 929 genomes from 54 diverse populations around the world. It uncovers a large number of previously undescribed genetic variants, including variants that are common in certain southern African, central African, Oceanian, and north and south American populations that previously were unknown.
“We managed to sequence genomes from several populations that are not covered by existing resources, including sub-Saharan Africa and Papua New Guinea,” said Anders Bergstrom, of the Francis Crick Institute in London.
“These groups are under-represented in current research, and it was mainly in those populations that we found common variants that are not in any existing sequencing databases,” he told BioWorld.
Bergstrom, who carried out the research in a former posting at the Wellcome Sanger Institute in Cambridge, is lead author of the study published in the March 20, 2020, issue of Science.
The newfound variations may have influences on the susceptibility of different populations to disease. However, medical genetics studies have so far predominantly been conducted in populations of white European ancestry, meaning any health implications the variants may have are unknown.
“We are showing medical geneticists here that there are large gaps in understanding and these need to be filled to achieve equality in the application of genomic and personalized medicines,” Bergstrom said.
The researchers found new variants in the previously unstudied populations, in genes that are known to be associated with disease. But, said Bergstrom, “We can’t say they cause disease, because we didn’t have access to any health data.”
Identifying the previously unknown variants can be seen as a first step toward expanding genomics studies to under-represented populations. The sequences will be made freely available to researchers carrying out studies of genetic susceptibility to disease in different parts of the world.
While the study uncovered unknown diversity, no single DNA variant was found to be present in 100% of genomes from any major geographical region while absent from other regions, indicating the majority of common genetic variation is found across the globe.
The genomes in the study come from the Human Genome Diversity Project (HGDP), and are held at the Center for the Study of Human Polymorphism in Paris.
In the panel of 929 genomes, the researchers identified 67.3 million single nucleotide polymorphisms, 8.8 million small insertions or deletions (indels) and 40,736 copy number variants. That is nearly as many as the 84.7 million variants discovered in 2,504 individuals by the 1,000 Genomes Project, which is viewed as the reference database of human genetic makeup.
In part, the amount of variation captured in that study is a result of the use of the latest high coverage sequencing techniques. But it also is a reflection of the greater population diversity in the panel that was studied.
Although the number of human genomes sequenced as part of medically motivated genetic studies has grown into the hundreds of thousands, the number based on anthropologically informed samples to characterize human diversity remains in the hundreds to low thousands.
Equity apart, the gap in coverage is all the more questionable because it is known genomic variation is greatest in Africa, meaning risk variants are easier to identify in African populations.
As one case in point, a paper published in the Jan. 31, 2020, issue of Science identified previously unknown risk variants for schizophrenia in a genome-wide association study (GWAS) involving 900 people from the South African Xhosa population. The researchers said they gained those new insights in what was by GWAS standards very few subjects, because of the depth of genetic variation in Africa.
At the same time, any biomedical implications of variations common in unsequenced populations, but rare or absent elsewhere, will remain unknown until genetic association studies are extended to include those geographies.
Given the scale of ongoing medical and national genome projects, such as NIH’s All of Us program, which aims to collect genomic data from 1 million individuals, the researchers say producing high coverage sequences for at least 10 people in each geographically distinct population “would arguably not be an overly ambitious goal” and would represent a scientifically important step “towards diversity and inclusion” in human genome research.
“Ideally, we would like to have more sequences. There are many parts of the world which are still not represented, even now,” Bergstrom said.