Posts written by Thomas Lumley (1224)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

July 31, 2014


‘This is statistics’ website

The American Statistical Association is launching a public relations campaign to make people think statistics is less boring and pointless, which is good:

We want students and parents to have a better understanding of a field that is often unknown or misunderstood. Statistics is not just a collection of numbers or formulas. It’s not just lines, bars or points on a graph. It’s not just computing. Statistics is so much more. It’s an exciting—even fun—way of looking at the world and gaining insights through a scientific approach that rewards creative thinking.

That’s a quote from the  shiny new website, ThisIsStatistics. It has stories about what statisticians do, and information about salary and job trends and stuff.  There are videos of statisticians talking about their work: currently Roger Peng (Johns Hopkins, SimplyStatistics blog) and Genevra Allen (Rice University).

It’s slightly disappointing that more of the people on the site arent’ real, just stock photos, but I suppose that’s unavoidable. What’s a bit more annoying is one of the photos in particular:


This looks as if it was constructed specially (the cup/mat/tablet/glasses are stock, eg).  It’s a rose chart, which is an ok way to display circular data (eg wind directions), but is not so good for comparison because of the way the wedges change shape as they get larger. The numeric labels are also a slightly strange choice for a circle measured in degrees (90 isn’t a multiple of 20).

Much more importantly, given the emphasis of the site on statistics as solving real problems, this is labelled as not being real: “data A” and “data B”.  Not helpful when we’re trying to tell people “It’s not just lines, bars or points on a graph”.


July 30, 2014

If you can explain anything, it proves nothing

An excellent piece from sports site Grantland (via Brendan Nyhan), on finding explanations for random noise and regression to the mean.

As a demonstration, they took ten baseball batters and ten pitchers who had apparently improved over the season so far, and searched the internet for news that would allow them to find an explanation.  They got pretty good explanations for all twenty.  Looking at past seasons, this sort of short-term improvement almost always turns out be random noise, despite the convincing stories.

Having a good explanation for a trend feels like convincing evidence the trend is real. It feels that way to statisticians as well, but it isn’t true.

It’s traditional at this point to come up with evolutionary psychology explanations for why people are so good at over-interpreting trends, but I hope the circularity of that approach is obvious.

July 29, 2014

H.G. Wells and statistical thinking

A treatment for unsubstantiated claims

A couple of months ago, I wrote about a One News story on ‘drinkable sunscreen’.

In New Zealand, it’s very easy to make complaints about ads that violate advertising standards, for example by making unsubstantiated therapeutic claims. Mark Hanna submitted a complaint about the NZ website of the company  selling the stuff.

The decision has been released: the complaint was upheld. Mark gives more description on his blog.

In many countries there is no feasible way for individuals to have this sort of impact. In the USA, for example, it’s almost impossible to do anything about misleading or unsubstantiated health claims, to the extent that summoning a celebrity to be humiliated publicly by a Senate panel may be the best option.

It can at least produce great television: John Oliver’s summary of the Dr Oz event is viciously hilarious

July 28, 2014

Rise of the machines



The Automatic Statistician project (somewhat flaky website) is working to automate various types of statistical modelling. They have interesting research papers. They also have a demo that’s fairly limited but produces linear regression models, model checks, and descriptions that are reasonable from a predictive point of view.

Automating some bits of data analysis is an important problem, because there aren’t enough statisticians to go around. However (as Cathy O’Neill points out about competition sites like Kaggle), they aren’t tackling the hard bits of data analysis: getting the data ready, and more importantly, getting the question into a precisely-specified form that can be answered by fitting a model.

Misleading maps

This map, from Reddit, shows the most common name in each county of England and Wales in 1881, based on the 1881 census.


Matthew Yglesias at  says what’s remarkable is how nearly perfectly the Smith/Jones divide lines up with the political boundary between England and Wales”.  I think it’s remarkable that he think’s it’s remarkable — I think of ‘Jones’ as the stereotypical Welsh name — but obviously associations are different in the US.  It is worth pointing out that the line-up isn’t as good as you might think if you weren’t careful: three of the light-green counties are actually in England, not in Wales. 

Yglesias also says that the names seem to show pretty distinctively what part of the British Isles your male line hails from.” That’s an example of how maps are systematically misleading — the conclusion may be true, but the map doesn’t support it as strongly as it seems to.  The map shows the most common name in each county, and most of the counties where Jones is the most common name are Welsh. However, that doesn’t mean most people called Jones were in Wales. In fact, based on search counts from, Lancashire had more Joneses than any Welsh county, and London had more than all but two Welsh counties. Overall, only 51% of Joneses were in Wales, going up to 60% if you include the three English counties coloured light green on the map.

In this particular case, many non-Welsh Joneses probably did have Welsh ancestors who had left Wales well before 1881, but not all of them – according to Wikipedia, the name came from Norman French and the first recorded use was in England.

July 27, 2014

More rugby stats

From Offsetting Behaviour (specifically, Seamus Hogan): How unfair is the Super 15 schedule?

The was prompted by one of the posts on the (apparently new) blog Sport Loves Data, by Kirdan Lees.

Air flight crash risk

David Spiegelhalter, Professor of the Public Understanding of Risk at Cambridge University, has looked at the chance of getting three fatal plane crashes in the same 8-day period, based on the average rate of fatal crashes over the past ten years.  He finds that if you look at all 8-day periods in ten years, three crashes is actually the most likely way for the worst week to turn out.

He does this with maths. It’s easier to do it by computer simulation: arrange the 91 crashes randomly among the 3650 days and count up the worst week. When I do this 10,000 times (which takes seconds). I get



The recent crashes were separate tragedies with independent causes — two different types of accident and one deliberate shooting — they aren’t related like, say, the fires in the first Boeing Dreamliners were. There’s no reason for the recent events should make you more worried about flying.

July 25, 2014

Storytelling with data: genre and shared language

A talk from this year’s Tapestry conference, taking the idea of storytelling with data seriously by looking at genre

Genres create a shared language, but they can also become formulaic. 

Here’s one example to get you going: what do love stories have to do with taxi maps?

Watch the video

(via Alberto Cairo)