Researchers at Yale University have described what they have called a “data sanitization tool,” enabling them to strip personal identifiers out of functional genomics data while preserving their usefulness for research.

The work provides one solution to what Mark Gerstein defined as a central issue in genomics research: “How do we do research and share data in a world of privacy?”

Gerstein is Albert L Williams Professor of Biomedical Informatics and professor of molecular biophysics and biochemistry, of computer science, and of statistics and data science at Yale University and the senior author of the paper reporting the “Privaseq” methodology, which was published in the Nov. 12, 2020, issue of Cell.

Open access at all levels, from publications to raw data, has been transforming the scientific enterprise. And COVID-19 is further accelerating that transformation.

It’s easy to see why.

“The value of all these large genetic and genomic studies really comes from the aggregation of data,” which means sharing datasets, Gerstein told BioWorld. “That’s where you get the statistical power.”

However, when raw genomic data are shared, unique privacy concerns ensue.

Genomic privacy is different from other forms of privacy for two main reasons, Gerstein said. For one, a privacy breach can’t be solved by changing the compromised information.

“If someone hacks into your email or your bank statement, you can get a new one,” Gerstein said. “If someone steals your genome, you can’t get a new genome.”

Another issue is that genomes are both unique and shared.

“You share your genome with your parents, your ancestors – whatever you do has bearing on them,” Gerstein said.

Because of Nobelist James Watson’s decision to share his genomic data, it is public knowledge that he has two copies of the ApoE4 allele, giving him a high risk of developing Alzheimer’s disease.

It is also public knowledge by inference that his sons have at least one copy of the ApoE4 allele each.

Notably, Watson tried to prevent his ApoE status from becoming public knowledge by not releasing the part of his genome that contained the ApoE gene itself. That attempt was foiled by the covariance of SNPs.

Such issues will only become worse in the future.

Presently, “we don’t know that much about the genome,” Gerstein said. “Thirty years from now, who knows what we’ll know… In hindsight, this is probably something people are going to realize they should have thought more about.”

For now, when people do think about it, the issue is approached in a fairly black-and-white fashion. The primary solution to privacy concerns is keeping data private.

In their paper, Gerstein and his colleagues demonstrated that there are methods to “get around those risks and still share data,” he said.

Though one catchy part of the paper describes a cloak-and-dagger scenario of genome theft via residues on a coffee cup, the paper’s main focus was on trade-offs and solutions, not worst-case scenarios.

“I’ve worked in this privacy world a lot,” Gerstein said. “In some of the earlier work I’ve done, I’ve identified these privacy risks.”

Those risks have garnered a lot of attention, he said, but also some frustration, along the lines of, “You just have bad news for us – can you say something useful?”

In their experiments, Gerstein and his colleagues both conducted experiments to determine how much private information is in genomics data, and to demonstrate that “you can do fairly straightforward things to scrub [private information] off,” he said.

The reason data sanitization works in functional genomics is that “the point of the data isn’t really the variants, but other aspects,” Gerstein explained. For example, “I contract a particular disease – say COVID-19 – and you do an RNA seq to see what goes up and what goes down.”

Under those circumstances, “the private data is sort of carried along as an afterthought.”

No Comments