Accessing high-quality patient data without compromising confidentiality or security is a growing problem for clinical trial teams. Detailed datasets, like those contained in electronic health records, raise privacy concerns. Addressing the privacy challenges via anonymized datasets can degrade the quality of the data set without fully addressing privacy issues.
Synthetic data seeks to resolve these issues. “Synthetic data look like the original data source, without containing any information on real individuals,” write researchers Theodora Kokosi and Katie Harron in BMJ Medicine.
Early attempts to create synthetic data sets identified several challenges. These challenges have become the focus of continued attempts to improve the useability of synthetic data. As efforts continue, synthetic data may become useful to clinical researchers — and in some cases, even preferable to real-world patient data.
How Is Synthetic Data Becoming More Useful?
In a 2023 study in PLOS Digital Health, researchers Aldren Gonzales, Guruprabha Guruswamy, and Scott R. Smith surveyed existing literature on synthetic data research. They list seven use cases for synthetic data that appear in current life sciences research:
- Creating simulations and making predictions.
- Developing health IT tools.
- Education and training.
- Epidemiology and public health research.
- Linking data sets.
- Releasing data sets for public use.
- Testing hypotheses, methods, and algorithms.
Synthetic datasets are already paying off for epidemiologists focused on the largest public health question of the decade: the COVID-19 pandemic. A synthetic dataset based on data from COVID-19 patients is currently helping researchers make predictions about the behavior and spread of the coronavirus that closely track trends based on real-world patient datasets.
“We’ve shown that we can build sophisticated predictions of what is going to happen in a population with a disease like COVID-19,” says informatician and data scientist Philip Payne, who is director of the Institute for Informatics at Washington University and the principal investigator on the project.
Synthetic data is gaining new ground in research on privacy and technology in healthcare as well. When layered with other privacy-enabling technologies (PETs) like a trusted research environment, “…synthetic data may help researchers to refine their queries and build provisional models, therefore enabling experimentation while keeping safe any sensitive data,” writes Alison Noble, Technikos professor of biomedical engineering at the University of Oxford.
Currently, synthetic data may offer the most value as a modeling tool. When used alongside early phase clinical trial platforms like AnjuEPS, synthetic data may allow early phase clinics to better refine their research questions, identify potential challenges, and make swift, effective safety decisions.
What Ethical Questions Surround the Use of Synthetic Data?
To date, the primary ethical questions synthetic data addresses are those of privacy and confidentiality.
Early applications of artificial intelligence and machine learning to patient data sets sought to isolate data points in a way that didn’t reveal enough information about any individual patient to identify the person. While this task is relatively simple when limited to a single data point like “How many people smoke?”, it becomes more difficult as additional data points are added.
“The goal, therefore, is to define privacy in a rigorous, mathematical way — known as differential privacy in the literature — and design privacy-preserving machine learning techniques that will not break even when additional information becomes available,” says Thomas Strohmer, director of the UC Davis Center for Data Science and Artificial Intelligence Research (CeDAR). Synthetic data researchers look for ways to compute connections among data points that don’t lead to a complete, identifiable dossier on any one human being.
Synthetic data shows promise. Yet it also faces limitations, write Gonzales, Guruswamy, and Smith. Synthetic data’s promise in addressing privacy and confidentiality questions may be limited by “…recognized risk for data leakage, dependency on imputation model, and [that] not all synthetic data replicate precisely the content and properties of the original dataset.”
To date, legislation and regulation have not caught up with advances in synthetic data generation. Lack of clear rules may put patients and consumers at risk, write Anmol Arora and Ananya Arora in The Lancet. For example, insurers or other companies may use synthetic data sets to make decisions that adversely affect real-world customers. Security and privacy rules that currently address real-life customer or patient datasets may not apply to synthetic data sets in the same way — even though there are currently no guarantees that a synthetic data set is completely unidentifiable.
Similarly, institutional review board (IRB) requirements may not apply to proposed research using synthetic datasets in the same way these rules apply to proposed research involving human beings, says Jon D. Morrow, senior vice president of medical affairs and informatics at MDClone. Yet some form of IRB oversight may be necessary given synthetic data’s shortcomings, which include the inability to guarantee that no real-world data point will be randomly generated within the synthetic data set.
“The potential benefits and risks of synthetic data might be unclear, but that does not mean they are unlikely to occur,” write Arora and Arora.
Where Is Synthetic Data Headed?
As synthetic data becomes more sophisticated, it may offer a better foundation for modeling, predictions, and hypothesis refinement. More complex uses of synthetic data may even allow researchers to create artificial patients who can stand in for real clinical trial participants — at least in some ways.
Using synthetic data to construct an artificial patient offers one way to advance life sciences research without facing many of the hurdles involved in working with real-world patients and their health information.
Bertalan Mesko, director of The Medical Futurist Institute, defines an artificial patient as “…a set of data representing the desired human characteristics the best possible way that is based on large amounts of real patient data, without actually including any backtracable real-patient data.”
Soon, artificial patients may allow clinical trial teams to predict the effects of various drug molecules, model success rates of medical devices or treatment protocols, or act as the control group in a clinical trial, writes Mesko.
Synthetic control groups for clinical trials appear especially promising in the study of targeted therapies for rare diseases, including oncogene-driven cancers, writes Vivek Subbiah in Nature Medicine. Studying treatments targeted for a specific gene or gene expression further restricts the pool of available clinical trial participants — in some cases, to the individual patient. Synthetic data could allow for the construction of a synthetic control twin for a real-world patient, allowing for better understanding of how gene therapies work.
While artificial patients haven’t yet reached the level of complexity required to accurately mimic a human patient, the current trajectory of work on synthetic data encompasses a future in which clinical research teams may include artificial patients alongside human ones in their work.
Many ethical challenges lie in the use of synthetic data, including the need to protect patients and customers when rules applying to real-world individuals aren’t clearly applicable to AI-generated datasets. Resolving these issues can help researchers build more robust synthetic data approaches that improve clinical research.
Images by: gorodenkoff/©123RF.com, budabar/©123RF.com, andreypopov/©123RF.com