COVID-19 has disrupted science in the way it has disrupted everything else. In the short term, universities have largely closed shop as a way to maximize social distancing, and lots of science – or at least, lots of bench work – is not getting done.
And much like the current societal shutdown is accelerating changes that are permanently altering both the workplace and social interactions, “COVID-19 is going to change science and the way we do science in some profound ways,” said Mark Musen.
Musen is professor of medicine and of biomedical data science at Stanford University. At a virtual conference on COVID-19 and Artificial Intelligence (AI), hosted on April 1 by Stanford’s Center for Human-Centered AI, he gave an overview of what he considered one of the most important ways that COVID-19 is accelerating change in science: the push for open data.
Science in the past 20 years or so already “has changed radically,” Musen said in his talk on “Knowledge Technology to Accelerate Open Science,” and a major change has been the increased importance of making data available online.
Some journals, such as the Public Library of Science (PLoS) family of journals, require authors to make a minimal dataset, defined as “data required to replicate all study findings reported in the article, as well as related metadata and methods” upon publication of articles.
Funding agencies require data sharing – though not always the raw data – and, in general, scientific output is increasingly seen “not just in terms of publications, but in terms of basic data,” Musen said.
An entire session at the conference was devoted to how data ranging from temperature measurements to cell phone location data has been used in other countries to track the pandemic and deploy countermeasures. Other presentations explored how such data could be used in the U.S., a country that, for better or for worse, has a stronger commitment to individual rights and privacy than China, Singapore, Hong Kong and South Korea – all places that have been successful at slowing the infection rate.
In the COVID-19 pandemic, Musen said, there is “lots of data around, [and] people want data quickly. That’s the good news.”
If you can’t find it, is it really there?
The bad news is that most of those data are not FAIR.
FAIR, in this case, stands for “findable, accessible, interoperable, reusable.”
Just because a dataset is in principle available somewhere on the internet does not mean that someone looking for data to address a certain question can find it.
Musen likened the current situation to “having a library where you don’t have a good catalog.”
The situation is no better with regard to metadata, tags that are meant to standardize data in order to make it reusable by other researchers.
Currently, Musen said, “if you are an investigator trying to find data, you are often really stuck because the metadata are terrible.”
For example, “if you are looking for a patient’s age, data can be presented in dozens of ways,” Musen said, illustrating his point with a slide listing some of those ways: Age. age. AGE. Age (y). Age (years). Age (Years). Age, years. Age, yrs. Age_years. Age.years. Age (weeks). Age (days). Age of patient. Age of patients. Age_patient. Age of subject.
In an analysis published by Musen and his co-author Rafael Goncalves in Scientific Data in 2019, the scientists reported that metadata in BioSample, a data repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples, which is managed by the European Bioinformatics Institute (EBI), were of “variable quality.”
In his talk at the HAI conference, Musen was more direct, describing the metadata as “awful.”
Boolean tags, which are supposed to be “true” or “false,” managed that seemingly simple feat only about a quarter of the time.
Supposedly Boolean tags for whether a trial participant was a smoker were almost as plentiful as tags for age: “Non-smoker, nonsmoker, non smoker, ex-smoker, Ex smoker, smoker, Yes, No, former-smoker, Former, current smoker, Y, N, 0, – never, never smoker, among others,” Goncalves and Musen reported in their paper.
And a quarter of the tags that were specified to be integer numbers could not be represented by integers. In their analysis, the authors found “integers,” including JM52, UVpgt59.4 and pig.
Musen is working to improve the situation as principal investigator of the Center for Expanded Data Annotation and Retrieval (CEDAR). CEDAR is a multi-institutional center of excellence supported by the NIH Big Data to Knowledge Initiative that aims to make it easier to provide good metadata through several avenues, including offering templates. CEDAR also works with GO FAIR, an initiative that works to enable researchers and institutions to implement FAIR data principles when they make their work publicly available.
GO Fair’s newest project, Virus Outbreak Data Network (VODAN) GO FAIR Implementation Network, was launched in March in order to make sure that COVID-19 data are managed in a way that allows them to be used to the full potential.
“During this epidemic and in earlier occasions, we have seen severely suboptimal data management and data reuse. … For instance, the data from the past Ebola epidemics are very difficult to find, to access, and if accessible, they are not interoperable, let alone reusable,” according to the GO FAIR website. “Under the urgent need to harness machine-learning and future AI approaches to discover meaningful patterns in epidemic outbreaks, we need to do better.”
As of April 2, the World Health Organization reported 896,450 confirmed cases of COVID-19 globally, and 45,526 deaths.