Thanks to the collaboration between academic medical institutions like The University of Washington and data science startups, such as MDClone, a health care informatics company, synthetic data is advancing to the point that it may soon be a viable tool to allow institutions to share information about patients in ways that have never before been possible.
This week, researchers from The University of Washington School of Medicine published two studies: in (1) the Journal of the American Medical Informatics Association (JAMIA); and (2) the Journal of Medical Internet Research (JMIR). These studies demonstrated that analyzing synthetic data generated from real COVID-19 patients accurately replicates the results of the same analyses conducted on the real patient data.
Co-author of one of the papers, Philip R.O. Payne, PhD, Biomedical Informatician & Data Scientist, Janet & Bernard Becker Professor & Director of The University of Washington, and Associate Dean & Chief Data Scientist at Wash Med, describes it as a simulation. “We’re trying to build the hurricane-track equivalent for pandemics, using large amounts of data,” said Payne.
Using conventional methods of sharing de-identified patient records, institutions must de-identify the data, be certain that it cannot be re-identified, and ensure timely access to sufficient quantities of data to make it useful for large-scale studies. With the use of synthetic data, those problems are no longer an issue because the data is manufactured and contains no identifying elements that could be linked back to a person. Further, because it is not associated with individual health records, it can be more easily shared across institutions.
The research reveals that data synthesis platforms are expected to help translate clinical data into faster COVID-19 insights and decrease barriers to data access by multiple stakeholders.
It will be fascinating to witness progress in a field that promises to address one of the largest bottlenecks in the development of diagnostic, therapeutic or preventative solutions – sufficient data that is easily accessed and shared.
What is synthetic data?
Synthetic data are generated based on actual data, but do not tie to EHRs or any other source of individual patient records. As a result, there is no risk of re-identifying patient data as there would be in using actual patient records.
How is the data artificially generated?
Synthetic data is created by recreating the statistical characteristics of the real patients, such as measures of blood pressure, body mass index and kidney function. It creates a set of new records with human characteristics, but which do not tie back to any one individual’s PHI (name, address, birthday), but only a set of associated symptoms and other factors such as behaviors.
What are the applications of synthetic data ?
With leading-edge informatics techniques and tools, including pattern recognition and machine learning techniques, the data could predict, for example, which patients are at highest risk of needing intensive care or ventilators. It also could help identify patterns in treatment strategies to see if drugs that a patient is already taking for a different condition help or hinder their progress.
“Synthetic data mimics real patient data, accurately models COVID-19 pandemic – National synthetic dataset boosts coronavirus research, helps prepare for future pandemics,” Washington University School of Medicine in St. Louis News Release, April 27, 2022
Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C), Journal of the American Medical Informatics Association, March 31, 2022
The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data, Journal of Medical Internet Research, October 4, 2021