April 11, 2016

Missing data

Sometimes…often…practically always… when you get a data set there are missing values. You need to decide what to do with them. There’s a mathematical result that basically says there’s no reliable strategy, but different approaches may still be less completely useless in different settings.

One tempting but usually bad approach is to replace them with the average — it’s especially bad with geographical data.  We’ve seen fivethirtyeight.com get this badly wrong with kidnappings in Nigeria, we’ve seen maps of vaccine-preventable illness at epidemic proportions in the west Australian desert, we’ve seen Kansas misidentified as the porn centre of the United States.

The data problem that attributed porn to Kansas has more serious consequences. There’s a farm not far from Wichita that, according to the major database providing this information, has 600 million IP addresses.  Now think of the reasons why someone might need to look up the physical location of an internet address. Kashmir Hill, at Fusion, looks at the consequences, and at how a better “don’t know” address is being chosen.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    steve curtis

    That farm/town in Kansas has the ‘default’ location for 600 million IP addresses that exist elsewhere. Other people are located in area with a number of major server farms and the default location just happens to be their house or street.

    1 year ago