Bag on Head One often hears that some massive collection of data will not have privacy implications because it has been “anonymized”. Any time you hear that, treat the statement with great skepticism. It turns out that effectively anonymizing data, making it impossible to identify the individuals in the data set, is much harder than you might think. The reason comes down to combinatorics and structured information.

This article on Medium by Vijay Pandurangan discusses a massive data set of NYC taxies, complete with medallion number, license number, time and location of every pick up and drop off, and more. The key to unraveling it is that there are just not that many taxi medallions, and the numbering structure only allows for a manageable possible number of combinations (under 24 million). While that would be a lot to work through by hand, Vijay was able to hash and identify every single one in the database in under 2 minutes.

Another approach would have been to make a set of known trips, note the location, time, etc., then use that to map the hash to the true identity. More work but very straight forward.

Even harder is the problem of combinatorics when applied to “non-identifying” data. One will often see birth date (or partial birth date) zip code, gender, age, and the like treated as non-identifying. Just five digit Zip-code, date of birth, and gender will uniquely identify people 63% of the time.

A study of cell phone location data showed that just 4 location references was enough to uniquely identify individuals.

This is a great resource on all kinds of de-anonymization.

The reality is that, once enough is collected is is almost certainly identifiable. Aggregation provides the best anonymization, where individual records represent large groups of people rather than individuals.

Lance Cottrell is the Founder and Chief Scientist of Anonymizer. Follow me on Facebook, Twitter, and Google+.

Update: small edit for clarification of my statement about aggregation.