open langage menutoggle menu

What synthetic data generators mean for data protection laws

Author: Dr Helena CANEVER (from the Talan Research Centre)

 

The value of global data market is estimated to be worth  $271.83 billion and the average annual growth of the volume of data creation is estimated at 40%. Such a dynamic environment presents opportunities and challenges for innovation in the use and storage of data. While forward-looking businesses have begun looking to use so-called “synthetic data” to meet those opportunities and challenges many businesses still are not familiar with the intricacies of integrating/using synthetic data.

This article will set out what synthetic data is, why it is useful for data storage, how it can increase privacy of otherwise individually identifiable data, and why its use by businesses needs to be reviewed and improved in order to ensure regulatory compliance.

 

Synthetic data and its application

Synthetic data is defined as artificial data that is generated by an AI from existing data while maintaining the statistical qualities of the original data. This definition applies to all types of data, structured (relational and tabular databases) and unstructured (images, video, text).

Synthetic data has already been proved to increase the accuracy of models, reduce the costs and time of AI development and current estimates show that by 2024 it will constitute 60% of all training data. A promising application of synthetic data is long-term data storage using limited resources, as a trained AI would be able to generate any amount of synthetic data while using less storage than the actual data even if the original source of information is removed. So why isn’t synthetic data currently exploited as an alternative to data storage and retention?

 

The burden of identifiability

In the healthcare and financial sectors, synthetic data is often hailed as a fully anonymous solution that can allow the use of sensitive data while remaining within the bounds of data protection regulations like the European Union General Data Protection Regulation (GDPR).

Supporters of the use of synthetic data argue that, since the generated data no longer identifies any specific real individual, it is fully anonymous. In reality, personal data breaches can still occur, even from synthetic data.

The current text of the GDPR identifies personal data as “ any information relating to an identified or identifiable natural person.“

Synthetic data generated from personal data will always concern a natural person as personal data is the substrate upon which synthetic data is generated. The key to triggering current European regulations is the identifiability of individuals. An individual is identified if data contains personal information such as names, addresses, age, and it is identifiable if further information is present that holds the potential to relate to one individual specifically.

In this sense, a synthetic dataset which perfectly captures all the statistical properties of the original dataset can still expose the identity or personal information of outliers due their uniqueness. More often for synthetic data to be safe, a certain level of noise or distortion in relation to the original has to be added. In other words, the utility of the dataset, which is how similar it is to the original, is decreased. This balance between the similarity between synthetic and real data and the preservation of privacy is often referred to as the utility-privacy compromise.

When it comes to adopting synthetic data for storage, from a company’s perspective, data is an important asset which might be used to develop analyses, insights and AI models. The uncertainty surrounding the utility-privacy compromise of generated data, which can be hard to evaluate and quantify a priori, might deter companies from fully adopting synthetic data generators as a storage alternative. Moreover, the identifiability of synthetic data is highly context-dependent and can change over time as technology evolves, leading to more uncertainty. As such, the high standards for individual privacy and latent identifiability risks can become a reason for a company not to adopt synthetic data.

 

A legal justification to process data

The use of personal data is subject to strict regulation by the GDPR, even within the bounds of a company. According to Article 5 of the GDPR, personal data: "shall not be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed”.

In other words, a company can only process personal data where they have a lawful basis to do so, for a specific and limited purpose and for a limited time. It is often the case that the internal use of personal data to develop AI solutions may not be justified as a machine learning model cannot be trained from personal data if a company lacks lawful grounds, like consent, contract fulfillment or a legitimate interest to do so.

We point out that the GDPR Article 5 also limits the long-term retention of most personal data. For instance, a company cannot retain data about former employees once their contracts are terminated or a service provider cannot retain the personal data of end users once their subscription to the service has ended beyond what’s strictly necessary.

Using a synthetic data generator to retain some of the useful information that such personal data may offer, for instance to improve attrition, may require the explicit consent by individuals and an update of Terms and Conditions. The use of AI solutions to bypass the limits on data retention may constitute a legal grey area that businesses, sensitive to the risk of GDPR violations, may hesitate to step into.

 

A cybersecurity issue

Machine learning models can be vulnerable to cyberattacks that cause data breaches and certain attacks constitute a particular risk to personal data. In model inversion attacks, knowledge about the model can lead to knowledge about the training data with a certain degree of accuracy. On the other hand, in membership inference attacks, knowledge about the training data is not retrieved but it is possible to infer whether or not a particular individual was in the training set.

Both types of attacks can be carried out solely based on query access, for instance through an API, (black-box attacks), or with knowledge about the model’s architecture (white-box attacks). Because of these vulnerabilities, experts in the field of data protection and AI argue that machine learning models hold the potential to be considered not only intellectual property but personal data in itself.

Whether or not the European Supreme Court of Justice will in the future adapt its interpretation of the GDPR to address these vulnerabilities, companies that train synthetic data generators on sensitive data might have to heavily scrutinize their AI to identify the potential breaches.

 

A missed opportunity

In this article we have highlighted reasons that might deter data-driven businesses from training AIs to reproduce data, especially personal data. We argue that a scarcity of knowledge on how synthetic data is generated and on its potential applications is the main cause of lack of trust in this technology.

We believe synthetic data generators constitute a valid alternative that which can be compliant with personal data regulations. For instance, a great deal of progress has been made in developing utility metrics that allow to better compare real and synthetic data. Moreover, formal privacy guarantees can now be integrated into generative models to decrease the risk of identification in synthetic data.

We also argue that synthetic data more often constitute a means to increase privacy rather than a risk to the personal data of individuals. Businesses should review and improve their data governance practices to ensure regulatory compliance and bring the challenge of synthetic data to the attention of regulatory institutions such as the GDPR.

 


Sources

General Data Protection Regulation (GDPR) (https://gdpr-info.eu/)

López, C. A. F. On the legal nature of synthetic data. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research. 

Veale, M., Binns, R., & Edwards, L. (2018). Algorithms that remember: model inversion attacks and data protection law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 376(2133), 20180083.