$1.2 million to study synthetic data use

Sharing health care data is crucial for understanding patterns and trajectories in diseases to develop personalized medicines and treatment. However, patient privacy regulations can make it tricky to share detailed data for analytical purposes.

The challenge is to balance privacy concerns with data access, and to answer the overarching question: How to develop privacy-preserving machine learning techniques to make the data accessible for analytics?

Enter synthetic data — which is generated by a computer program, but uses real-world data as a model.

This spring, UC Davis researchers were awarded a four-year, $1.2 million National Institutes of Health grant to generate high-quality synthetic data. The team will use artificial intelligence and machine learning (AI/ML), and hopes the research may help predict, diagnose and treat diseases.

The team involves principal investigator Thomas Strohmer, Ph.D., director of the UC Davis Center for Data Science and Artificial Intelligence Research (CeDAR); Rachael Callcut, M.D., M.S.P.H., F.A.C.S., professor of surgery and chief research informatics officer; and Jason Adams, M.D., M.S., associate professor of pulmonary, critical care and sleep medicine and director of data and analytics strategy.

‘Differential privacy’

Synthetic data can be generated from real-world sources such as images, videos, text or speech in a way that preserves statis-tical properties without the risk of exposing sensitive information. Machine learning techniques should be able to analyze the different modalities and combine them in a privacy-preserving way to generate the synthetic data.

Strohmer explains that for medical records, one may first want to preserve “one-dimensional marginals,” such as the number of people who smoke or who have diabetes. Then researchers may want to expand preservation to other conditions, such as how many people who smoke also have diabetes, or how many people who smoke have diabetes and COVID-19.

He warned that this detailed method has its own pitfalls — and may break privacy rules when questions are too detailed.

“The goal, therefore, is to define privacy in a rigorous, mathematical way — known as differential privacy in the literature — and design privacy-preserving machine learning techniques that will not break even when additional information becomes available,” Strohmer said.

The team is using Acute Respiratory Distress Syndrome (ARDS), which often involves ICU patients, as a model to test their methods. The condition has evidence-based life-saving treatments that can provide beneficial results with timely diagnosis. The data that classifies ARDS is also multimodal, and can help test robustness of machine learning algorithms.

“A huge amount of multimodal health data is collected from ICU patients, much more than from typical hospitalized or outpatient clinic patients,” Adams said. “(It) presents an ideal opportunity to precisely describe the clinical state of a patient, and then use the data to develop predictive algorithms that can do the same.”

Calcutt’s training as a data scientist helps her understand computational approaches, and she’s creating clinical use cases to help develop data sources.

“At our lab, we are looking at a panel of almost 40 different markers on patients to try to understand how those pathways are interacting with one another. Our real goal is to try to identify those patients early,” she said. “We can then create novel therapies and interventions that can potentially abate the development or severity of ARDS, and that’s why AI/ML algorithms will be so important in this field.”

How to design machine learning techniques that preserve privacy?

‘Differential privacy’