By using an unsupervised machine learning approach to look at genetic variation across the protein-coding genomes of 140,000 species, researchers at Harvard Medical School and Oxford University have developed a new variant classifying system that performed on a par with wet lab approaches.

The ability to make predictions about pathogenicity from evolutionary data is one of the eye-catching aspects of the study, opening the possibility that "in the future, we'll go to clinician and these scores will be used to make a diagnosis... informed by a sequence obtained from an obscure organism at the bottom of ocean," Pascal Notin told BioWorld Science.

Notin, Mafalda Dias and Jonathan Frazer are postdoctoral scholars and joint first authors of the paper describing their method, which they have named evolutionary model of variant effect (EVE). The paper appeared in the October 27, 2021, issue of Nature.

Genome sequencing is now cheap enough so that in principle, it could be part of routine medical care.

But while obtaining sequences has become trivial, interpreting them remains anything but.

Even in the best-studied disease genes, such as BRCA, there are variants of unknown significance (VUS). And in genes that are less-studied, or less strongly linked to disease risk, than BRCA -- which is to say, most of them -- such VUS are more the rule than the exception.

Individuals differ from each other, on the average, in about 0.1% of their genome, which works out to roughly 3 million base pairs. But "only 2% of variants have any sort of clinical annotation to date," Frazer told BioWorld Science.

Even before the advent of massive computational power, sequence alignment was used to gain insights into the functional importance of single amino acids.

When DNA sequences are highly conserved across species, this is usually a clue that they are functionally important. If an amino acid stays unchanged over evolutionary timescale, "changing that amino acid is quite likely to be damaging," Frazer said. "We've known that for a long time."

More recently, computational advances have enabled scientists to look at variation on a much larger scale.

Supervised vs. unsupervised

The most advanced current models, however, use a supervised learning approach. That is, scientists train an algorithm to recognize disease-causing variants by using known variants.

In their work, Frazer, Dias, Notin and their colleagues used an unsupervised approach. In a two-step procedure, they trained their model on approximately 250 million protein coding sequences from 140,000 species to recognize how immutable any given position was.

In the second step, the model assessed the probability that a given mutation was pathogenic, giving a score between 0 and 1 to each mutation.

The team used the approach to predict whether mutations in roughly 3,200 disease-associated human proteins were benign or pathogenic, and compared the model's output with ClinVar, which describes itself as "public archive of reports of the relationships among human variations and phenotypes, with supporting evidence."

EVE came to the same conclusions as ClinVar on the pathogenicity of those genes, including a set of genes that ClinVar has labeled as "clinically actionable."

The team then compared EVE's predictions to 40,000 experimentally measured variants across 10 proteins, and found that the EVE performed on a par with wet lab experiments in predicting pathogenicity.

EVE cannot determine why a given mutation is pathogenic. "What comes out of the model is a probabilistic statement, we are not making a causal statement," Notin said.

For now, the model looks at single genes. "Combination between variants" -- both within single genes and across different genes "is something that we would like to explore," Dias told BioWorld Science.

Another avenue for further work is to adapt the model to be able to make predictions from multiple human sequences, rather than multiple species.

"What we are modeling is the distribution of sequences in evolution," Dias said. However, some proteins are human-specific, and some variants, such as certain splicing variants, have detrimental effects in humans but not in other organisms.

For the great majority of proteins, though, an evolutionary lens can contribute valuable insights on their variants. In their paper, the team noted that their study is "one small but unusually direct demonstration of how the diversity of life on Earth benefits human health." More than 10% of the organisms the team used in its comparisons are on the International Union for Conservation of Nature's Red List of Threatened Species, including 21 that are outright extinct, and another 10 that are extinct in the wild.

"The progressive disappearance of species," they wrote, "is a threat to the diversity on which this work is built."