Synthetic cancer registry
Challenge
A cancer registry is one of the comprehensive disease-specific sources and fundamental for conducting oncological research. While it is possible to request data from a cancer registry, this often requires that the requestor has certain qualifications, as well as in-depth knowledge about the data available to be able to submit their data request. This significantly limits the number of people who would be able to access cancer registry data for research purposes.
Solution
Synthetic data mimics the structure and statistical patterns of an original dataset. In contrast to traditional anonymisation techniques, none of the records in a synthetic dataset are linkable to real individuals. This makes it safer to share this data with others, especially when the data has been generated using strict mathematical privacy definitions such as differential privacy.
We generated and released one of the first high-dimensional synthetic medical datasets for public access. This gives more people the opportunity to experiment with record-level cancer data and contribute to oncological research.
Results
Synthetic breast cancer registry:
Developed differentially private algorithms for synthetic data generation
Performed a comprehensive quality evaluation to compare synthetic datasets and determine which one was able to provide the required utility under strict privacy guarantees.
Written a Data Protection Impact Assessment (DPIA) to document safeguards that deal with potential privacy risks,
Released the data publicly, communicated its purpose, and assisted users.
Further research:
Generated synthetic data based on cancer registry data to act as a external control arm in scarce single-arm clinical trials.