March 27, 2013

Why PDFs are not open data

The US non-profit journalism group Pro Publica has written a number of good stories recently about drug company payments to doctors.  They used the data that some US states force the drug companies to release.  They have a new story explaining why this isn’t as easy as it sounds, since there was no requirement that the data be released in any useful form.  For example, a lot of it was in PDF files.  As they write

Here’s how a PDF works, deep down: It positions text by placing each character at minutely precise coordinates in relation to the bottom-left corner of the page. It does something similar for other elements like images. A PDF knows about shapes, characters and their precise positions on the page. Even if a PDF looks like a spreadsheet — in fact, even when it’s made using Microsoft Excel — the PDF format doesn’t retain any sense of the “cells” that once contained the data.

They used a wide range of techniques: in some cases they could use the grid cells on the tables to work out which digits belonged to the same number, but in other cases they basically had to treat the PDF file as an image and use optical text recognition software on it, just as you would for a scanned bitmap. Most people wouldn’t go to these heroic lengths, and would rapidly decide to investigate other exciting stories.

Even Excel spreadsheets are only useful open data formats if they are structured so that it’s easy for a computer to find and extract the actual numbers from the worksheet. Stats NZ , who realise this, try to have data available both as Excel spreadsheets designed for visual display and in some useful downloadable form. Some other sources of NZ official data are not as helpful.

(via @adzebill on Twitter)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »