Researchers receive $1.2 million NIH grant to study synthetic data use in health care
Three researchers from UC Davis have been awarded a $1.2 million grant over four years from the National Institutes of Health (NIH) to generate high-quality synthetic data, or data generated by a computer program. The team will use artificial intelligence and machine learning (AI/ML). The research may help physicians predict, diagnose and treat diseases.
The interdisciplinary research team involves principal investigator Thomas Strohmer, director of the Center for Data Science and Artificial Intelligence Research (CeDAR). It also includes two UC Davis Health investigators: Rachael Callcut, professor of surgery and chief research informatics officer and Jason Adams, associate professor of pulmonary, critical care and sleep and director of data and analytics strategy.
Preserving privacy while making data accessible for research
Sharing health care data is crucial for understanding patterns and trajectories in diseases to develop personalized medicines and personalized treatment. However, patient privacy regulations can make it tricky to share detailed data for analytical purposes.
The challenge is to balance privacy concerns with data access, and to answer the overarching question: How to develop privacy-preserving machine learning techniques to make the data accessible for analytics? Enter synthetic data.
Synthetic data is generated by a computer program using real-world data as a model. It can be generated from real-world data in a way that preserves the statistical properties of the original data but without the risk of exposing sensitive information or violating privacy rules. The original data can come from various sources such as images, videos, text, speech, etc. The machine learning techniques should be able to analyze the different modalities and combine them in a privacy-preserving way to generate the synthetic data.
The researchers were inspired to investigate synthetic data when Nick Anderson, director of informatics research at UC Davis Health, gave a talk at a CeDAR event on the possibilities of synthetic data creation. CeDAR, one of four IMPACT Centers from the Office of Research, provided a platform for collaboration and visibility, and helped him connect with people interested in machine learning technologies.
“It truly started with the coming together of different faculty interested in machine learning and data science from different angles,” Strohmer said.
Strohmer explains that for medical records, one may first want to preserve the one-dimensional marginals. For example, that could mean preserving the number of people who smoke or the number of people who have diabetes. Then researchers may want to expand that preservation to other conditions, such as how many people who smoke also have diabetes or how many people who smoke also have diabetes and COVID-19.
He warned that this detailed method has its own pitfalls and may break privacy rules when the questions are too detailed.
“The goal, therefore, is to define privacy in a rigorous, mathematical way — known as differential privacy in the literature — and design privacy-preserving machine learning techniques that will not break even when additional information becomes available,” Strohmer said.
Part of the benefit and the excitement about this particular partnership between the clinical and analytical sides of our university is the opportunity to develop synthetic datasets that reflect the complexity, but also provide a high fidelity, which is what’s required to get useful machine learning algorithms when they go into the clinical environment.”
Extending the generation of multimodal synthetic data into a clinical domain
The team is using Acute Respiratory Distress Syndrome (ARDS) — a high-risk condition — as a model to test their methods. “About one out of every 10 intensive care unit (ICU) patients, and one out of every four mechanically ventilated patients in the ICU has ARDS,” Adams said.
The physicians chose ARDS because it has evidence-based life-saving treatments, and if diagnosed on time, those treatments can provide beneficial results. “The other advantage to using ARDS as a model is that the data that classifies ARDS is multimodal in nature, and so it can be used to test the robustness of the machine learning algorithms,” Adams said.
Both Callcut and Adams have research expertise in clinical outcomes of ICU patients. Callcut has worked on all aspects of ARDS detection, treatment and management for over a decade. One of her goals as the chief of the research division is to unify teams to work on advanced analytics and machine learning.
Adams explained that since ARDS patients in the ICUs are extremely sick, they tend to be routinely monitored through numerous channels. “As a result, a huge amount of multimodal health data is collected from ICU patients, much more than from typical hospitalized or outpatient clinic patients. Therefore, the ICU presents an ideal opportunity to precisely describe the clinical state of a patient, and then use the data to develop predictive algorithms that can do the same,” Adams said.
Callcut’s role has been to create the clinical use cases to help develop the data sources for the team to utilize. Her training as a data scientist helps her understand computational approaches.
“At our lab, we are looking at a panel of almost 40 different markers on patients to try to understand how those pathways are interacting with one another. Our real goal is to try to identify those patients early,” she said. “We can then create novel therapies and interventions that can potentially abate the development or severity of ARDS, and that’s why AI/ML algorithms will be so important in this field.”
In addition to the data that Adams and his group have collected, Callcut has a diverse set of data from patients, ventilators and monitors. One of the compelling aspects of this type of research is that the team will also analyze how well the data fare compared to real data in terms of understanding its efficacy in clinical environments.
“Part of the benefit and the excitement about this particular partnership between the clinical and analytical sides of our university is the opportunity to develop synthetic datasets that reflect the complexity, but also provide a high fidelity, which is what’s required to get useful machine learning algorithms when they go into the clinical environment,” said Callcut.