June 26, 2014

Slightly too Open Data

  1. The Atlantic published some visualisations of taxi rides in New York
  2. Chris Whong asked for the data under Freedom-of-Information laws, and got it. Of course, the taxi and driver ids were anonymized
  3. Vijay Pandurangan noticed that the driver id and taxi id were really, really weakly anonymised.
  4. You can find out a lot once you know the taxi id.


The NY Taxi & Limousine Commission had run the ids through a cryptographic hash function, MD5. Hash functions are designed so that if you don’t know anything about the input you can’t reconstruct it from the output, but if you know the input exactly, you can verify easily that it gives the same output.  The problem comes when you know a lot about the input, but not everything.  In this case, there are only about two million possible id numbers, and you can just try them all. Once you have the ids, you can look up.

Even if the taxi authorities had done the anonymisation correctly — replacing each id with a random number — it would inevitably have been possible to extract some of the ids with a bit of work.  That’s not the same as being able to extract all of them with a few hours’ computer time.


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »