December 26, 2014

Safety and effectiveness in data mining

New medications have to demonstrate safety and effectiveness before they are marketed. Showing effectiveness is usually fairly straightforward, if slow and expensive. Safety is more difficult, because it’s mostly about uncommon events, edge cases, interactions.

Automated decisions based on data mining and algorithms have a similar problem.  It’s fairly easy to make sure they do what you intended them to do. It’s much harder to make sure they don’t also do things you didn’t think of.

Sometimes this is just human error, like the problems with RepricerExpress rules that led UK small businesses to post prices as low as 1p for goods on Amazon before Christmas, leading to massive losses. Sometimes it’s an algorithm than optimises the wrong thing.

Eric Meyer has written a post about Facebook’s “Year in Review”, which (repeatedly) pops up a picture in his feed saying “Eric, here’s what your year looked like!”. The algorithm is right. Horribly right.

But for those of us who lived through the death of loved ones, or spent extended time in the hospital, or were hit by divorce or losing a job or any one of a hundred crises, we might not want another look at this past year.

If I could fix one thing about our industry, just one thing, it would be that: to increase awareness of and consideration for the failure modes, the edge cases, the worst-case scenarios.  And so I will try.



Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »