In this series, researchers answer three questions about their latest results in the fields of quantum, AI, and data science. Editor’s note: Responses may be edited for length or clarity.
Data sharing practices are top of mind for many scientific organizations in 2023. The White House Office of Science and Technology Policy, along with a host of federal agencies and other organizations, launched the Year of Open Science in January 2023. As part of the initiative, the National Institutes of Health (NIH) released its Final Policy on Data Management and Sharing. Data sharing best practices follow FAIR principles—meaning data should be findable, accessible, interoperable, and reusable. However, datasets may fall short from the start if they are not first findable based on their metadata. In a recent Scientific Data paper, a collaboration of systems biology researchers focused on infectious diseases from 15 centers created a metadata schema to improve biomedical data reuse that can also be applied to other research areas.
Tsueng, G., Cano, M.A.A., Bento, J. et al. (2023) “Developing a standardized but extendable framework to increase the findability of infectious disease datasets.” Scientific Data 10, 99. DOI: 10.1038/s41597-023-01968-9
Q&A with co-author Laura D. Hughes of The Scripps Research Institute
(1) What was the problem you set out to solve?
One of the most striking things about the COVID-19 pandemic has been how the research community has embraced a new paradigm of open research, with groups sharing their data openly in the public, often in near real time. These publicly available data have helped the community rapidly develop diagnostics, therapeutics, and vaccines to combat the new virus.
However, sharing data in a way that makes it easy to find and reuse is still really challenging. There are many data repositories which help researchers share data for a variety of subjects, but each of them uses different standards to describe their data, making it difficult to find what you’re looking for.
As infectious disease researchers, we wanted to figure out how we could develop methods to promote access to data we generate in a way that makes it easier to find, regardless of where the data are stored.
(2) How will your results impact the field going forward?
The NIH, one of the largest funders of biomedical research, recently updated its data sharing policy to further support open access to biomedical data. Our work helps researchers figure out how to implement this new policy—a practical guide to help make sure that shared data can more easily be validated and reused by others. There’s still a lot to be done (see our related Commentary), but it’s an exciting time to think about how we can make biomedical data more findable, accessible, interoperable, and reusable (FAIR).
(3) What was the solution, and how did you reach it?
We first sought to understand how biomedical data repositories use metadata standards (or schemas) right now. Metadata is contextual data which describes a dataset: what it is, who created it, when and how did they create it, and more. This information becomes critical anytime you’re trying to search through the millions of datasets available to understand what might be useful to your own research. Based on this analysis, we developed a framework for how to create standards which are tailored to infectious disease research but still interoperable with other types of research. Lastly, we helped increase the findability of nearly 400 datasets and computational tools we created.