Why Synthetic Data Is Not Private

May 4, 2022 :: 3 min read

And what's more, it's snake oil... unless done right.

We anonymised it, we removed sensitive identifiers, these records are not real. You might have heard some of these terms when talking about synthetic data. While they aren’t just voodoo spells, these techniques might not be enough to protect anyone’s privacy from a determined attacker.

What even is synthetic data?

In case you don’t know what synthetic data is, here’s a working example that we’re going to use throughout this post.

Let’s say we are a bank. We record a lot of information about our customers: date of birth, social security number, address, marital status, all their transaction info, device identifiers, and more. We mix-and-match information between these records and create new, fake ones. We shuffle some names around, move people from one street to another nearby, add a five grand espresso machine to their purchase history.

We do it in a clever way — there’s no overlap with our actual customers, and resulting records are as representative/useful; we can use it to compute statistics, train models and they will be just as good. Now, we have a synthetic dataset.

What is it for?

Let’s take a step back for a moment. We have the real data and we can use it ourselves. But it’s quite a waste if we are the only ones using it. Can we profit from it in other ways?

How to share it with researchers so that they develop a clever algorithm for us? Synthetic data. How about we sell it to a data vendor? Synthetic data. Maybe we can create a public dataset or a machine learning competition? You’ve guessed it, synthetic data.

While not real, synthetic data is a good proxy for many applications. Picture source.

So what’s the problem?

First things first. Proper*anonymisation is challenging. Removing personally identifiable information (PII) is not sufficient. It may be but in many cases, it’s possible to link one released dataset with another, and completely deanonymise many records (e.g. Netflix prize dataset). Maybe we could use our banking data example, and combine it with some census data, maybe a regional hospital statistic, and boom, we have figured out that John Doe has early-onset dementia.

At best, deanonymisation helps ad vendors track you better so that they can sell you stuff. Somewhere in the middle, an insurance company will profile John Doe and raise his premiums. At worst, someone is going to assault you for being gay (Grindr got fined for selling their data to ad networks).

Now, some people advertise synthetic data as a viable way of generating anonymous data. Heck, even the International Association of Privacy Professionals still lists it as a usable technique in their docs.

What is more, a lot of synthetic data generation methods are based on machine learning which helps to get the foot in the door during a meeting, I guess. Bonus points for being trendy.

Unfortunately, synthetic data also suffers from the reconstruction/linkage problems that I’ve described above (you can skim this paper to learn more). However, it’s easy to sell it. Combining my record with my colleague’s, and creating two new records, doesn’t give us any anonymity, yet many people believe it does. Any synthetic dataset that accurately models the original dataset, must leak information about the real records.

If you’d like to learn more about it, here’s a good post about reconstruction attacks.

Accepting the trade-off

There’re ways to address this. We can use differential privacy — a modern way of measuring information leakage from a dataset. It gives us a provable guarantee of how much an attacker could learn about a record.

Differential privacy has a major caveat. In exchange for privacy, we significantly reduce the utility. For any complex, high-dimensional dataset, by a lot. Simple machine learning models, and aggregate statistics? Sure. Accurate deep learning? Not so much.

I don’t say any of this to bash synthetic data, differential privacy or machine learning. Far from that, privacy-preserving ML is my shtick after all, and I think it’s amazing. However, I see a lot of misinformation and false advertising about it.

So the takeaway from this is that you can’t have your cake and eat it too. When you hear the term synthetic data in a stakeholder meeting, you’d be sceptical. Doubly so, if the same conversation doesn’t include terms like differential privacy or privacy analysis.

Disclaimer: more security savvy of you might have noticed that in this post I use privacy and anonymity almost interchangeably. They are different, though related, topics. The bigger, and more digestible point, that I’m trying to make in this post is that breaching one, can lead to breaching the other.

More posts.