Posts written by Thomas Lumley (1962)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

March 12, 2017


  • False positives: many people who think they are allergic to penicillin actually aren’t, and so don’t need to be given broader-spectrum antibiotics (which have more impact on resistance). Ars Technica, the research paper.
  • Cancer genomics researcher accused of data falsification. Long NY Times story, including very clever animation of Western blot duplication.
  • A bill in the US House of Representatives wouldn’t quite let employers demand genetic data from employees, but it would let employers make employees pay not to give it. (STAT news)
  • President Trump described good employment numbers under the previous government as ‘phony’.  After the first month of his government, the White House press secretary said “They may have been phony in the past, but it’s very real now”.  (via Vox)
  • “Cause of death” is complicated: the BBC has a story “The biggest killer you may not know” about sepsis. The story says it “kills more people in the UK each year than bowel, breast and prostate cancer combined.” But it’s not either/or. A substantial number of sepsis deaths are due to cancer or cancer treatment.
  • Cathy O’Neil on how looking harder for crimes by any group (such as immigrants) is bound to increase the crime rate — if a spurious increase wasn’t the aim, you’d need to be careful about interpreting the data.
March 9, 2017

Causation, correlation, and gaps

It’s often hard to establish whether a correlation between two variables is cause and effect, or whether it’s due to other factors.  One technique that’s helpful for structuring one’s thinking about the problem is a causal graph: bubbles for variables, and arrows for effects.

I’ve written about the correlation between chocolate consumption and number of Nobel prizes for countries.  The ‘chocolate leads to Nobel Prizes’ hypothesis would be drawn like this:


One of several more-reasonable alternatives is that variations in wealth explain the correlation, which looks like


As another example, there’s a negative correlation between the number of pirates operating in the world’s oceans and atmospheric CO2 concentration.  It could be that pirates directly reduce atmospheric CO2 concentration:


but it’s perhaps more likely that both technology and wealth have changed over time, leading to greater CO2 emissions and also to nations with the ability and motivation to suppress piracy:


The pictures are oversimplified, but they still show enough of the key relationships to help with reasoning.  In particular, in these alternative explanations, there are arrows pointing into both the putative cause and the effect. There are arrows from the same origin into both ‘chocolate’ and ‘Nobel Prizes’; there are arrows from the same origins into both ‘pirates’ and ‘CO2‘.  Confounding — the confusion of relationships that leads to causes not matching correlations — requires arrows into both variables (or selection based on arrows out of both variables).

So, when we see a causal hypothesis like this one:


and ask if there’s “really” a gender pay gap, the answer “No” requires finding a variable with arrows into both gender and pay.  Which in your case you have not got. The pay gap really is caused by gender.

There are still interesting and important questions to be asked about mechanisms. For example, consider this graph


We’d like to know how much of the pay gap is direct underpayment, how much goes through the mechanism of women doing more childcare, and how much goes through the mechanism of occupations with more women being  paid less.  Information about mechanisms helps us think about how to reduce the gap, and what the other costs of reducing it might be.  The studies I’ve seen suggest that all three of these mechanisms do contribute, so even if you think only the direct effects matter there’s still a problem.

You can also think of all sorts of things and stuff I’ve left out of that graph, and you could put some of them back in


But you’re still going to end up with a graph where there are only arrows out of gender.  Women earn less, on average, and this is causation, not mere correlation.

March 8, 2017


  • “Exploding boxplots”: although a boxplot is a lot better than just showing a mean, it’s usually worse than showing the data
  • The US state of Michigan used an automated system to detect unemployment benefit fraud. Late last year, an audit of 22427 cases of fraud overturned 93% of them! Now, a class-action lawsuit has been filed (PDF), giving (a one-sided view of) more of the details.
  • StatsChat has been saying for quite some time that people shouldn’t be making generalisations about road crash rates without evaluating the statistical evidence for increases or decreases.  It’s good to see someone doing the analysis: the Ministry of Transport has a big long report (PDF, from here) including (p37)[updated link]

    110. However, since 2013 the fatality rate has injury rate has begun to increase. We conducted statistical tests (Poisson) to see whether this increase was more than natural variation, and found strong evidence that the fatality and injury rates are actually rising.

  • Fascinating blog by John Grimwade, an infographics (as opposed to data visualisation) expert (via Kieran Healy)
  • “Not only does Google, the world’s preeminent index of information, tell its users that caramelizing onions takes “about 5 minutes”—it pulls that information from an article whose entire point was to tell people exactly the opposite.”  Another problem with Google’s new answer box, less serious than the claims about a communist coup in the US, but likely to be believed by more people.

Yes, November 19


The graph is from a Google Trends search for  “International Men’s Day“.

There are two peaks. In the majority of years, the larger peak is on International Women’s Day, and the smaller peak is on the day itself.

March 7, 2017

The amazing pizzachart

From YouGov (who seem to already be regretting it).


This obviously isn’t a pie chart, because the pieces are the same size but the numbers are different. It’s not really a graph at all; it’s an idiosyncratically organised, illustrated table.  It gets worse, though. The pizza picture itself isn’t doing any productive work in this graphic: the only information it conveys is misleading. There’s a clear impression given that particular ingredients go together, when that’s not how the questions were asked. And as the footnote says, there are a lot of popular ingredients that didn’t even make it on to the graphic.



March 6, 2017

Cause of death

In medical research we distinguish ‘hard’ outcomes that can be reliably and objectively measured (such as death, blood pressure, activated protein C concentrations) from ‘soft’ outcomes that depend on subjective reporting.  We also distinguish ‘patient-centered’ or ‘clinical’ or ‘real’ outcomes that matter directly to patients (such as death, pain, or dependency) from ‘surrogate’  or ‘intermediate’ outcomes that are biologically meaningful but don’t directly matter to patients.  ‘Death’ is one of the few things we can measure that’s on both lists.

Cause of death, however, is a much less ideal thing to measure.  If some form of cancer screening makes it less likely that you die of that particular type of cancer but doesn’t increase how long you live, it’s going to be less popular than if it genuinely postpones death.  What’s more surprising is that cause of death is hard to measure objectively and reliably. But it is.

Suppose someone smokes heavily for many years and as a result develops chronic lung disease, and as a result develops pneumonia, and as a result is in hospital, has a cardiac arrest due to a medical error, and dies. Is the cause of death ‘cardiac arrest’ or ‘medical error’ or ‘pneumonia’ or ‘COPD’ or ‘smoking’?  The best choice is a subjective matter of convention: what’s the most useful way to record the primary cause of death? But even with a convention in place, there’s a lot of work to make sure it is followed.  For example, a series of research papers in Thailand estimated that more than half the deaths from the main causes (eg stroke, HIV/AIDs, road traffic accidents, types of heart disease) were misclassified as less-specific causes, and came up with a way to at least correct the national totals.

In Western countries, things are better on average. However, as Radio NZ described today, in Australia (and probably NZ) deaths of people with intellectual disability tend to be reported as due to their intellectual disability rather than to whatever specific illness or injury they had.  You can see why this happens, but you should also be able to see why it’s not ideal in improving healthcare for these people.  Listen to the Radio NZ story; it’s good. If you want a reference to the open-access research paper, though, you won’t find it at Radio NZ. It’s here



  • Newshub had a story about the Accommodation Survey not specifically excluding people in hotels who were there as emergency housing.  Nerds across the NZ political spectrum (eg, me, Keith Ng, and Eric Crampton) were unimpressed with this story. Eric actually wrote a blog post, so I’ll refer you there for more details.
  • Russell Brown wrote about the overuse of workplace drug tests that aren’t tests for impairment.
  • A research paper in PLoS One shows that newspapers write about news.  That is, they write `breakthrough’ stories about new treatments but give a lot less prominence to later studies that are less favorable.  Interestingly, this didn’t apply to ‘lifestyle’ stories, where ‘coffee is Good/Bad this week’ can always find a place.
  • The Herald had a story last week about “The $2m+ price tag for a top decile Auckland education.” In contrast to their story two years ago, this doesn’t make any attempt to estimate the premium for the top school zones. That is, if a family with school-age kids doesn’t live in the ‘Double Grammar Zone’ and pay $2 million for a house, they’ll still have to live somewhere and pay something for a house.  In the 2015 story, the cost of a house just outside a top school zone was about 20% lower than just inside. Even that probably overestimates the school premium, but the total cost of a house obviously does.
February 26, 2017

This bit is even nerdier

Nick Smith, the Environment Minister, on Stuff

This bit is very nerdy. We are saying at 540 E.coli the risk is one in 20 (of getting sick).  But that one in 20 is at the 95 per cent confidence level. So there is an extra level of cautiousness. Even if you put 20 people in water and it has a 540 E.coli level it’s not saying on average one person gets sick out of 20. It’s saying one in 20 of 20 groups will have one in 20 get sick.

No, it’s not saying that.

Let’s step back a bit.  First, why is such a baroque description of the risk, less than 1/20 95% of the time, even being used?

As Dr Smith does convey in the interview, the problem is that risk varies. There are two sources of uncertainty if you go swimming in the Hutt River. First, the bacteria count varies over time — with rain, temperature, and whatevever else — so you don’t know what it will be at precisely the time you stick your head under.  Second, if you end up swallowing some  Campylobacter you still only have a chance of getting infected.

Summarising these two types of uncertainty in a single number is hard. One sensible approach is to pick a risk, such as 1/20.  If we want to say that the chance of getting infected is less than 1/20, we need to handle both the variation in shittiness of the water, and the basically random risk of infection for a given level of contamination.

Suppose we imagine a slightly implausible extreme sports facility that sends 100 backpackers on one-day swimming parties each day.  On 95% of days (347 days per year), they’d expect fewer than 5 to get infected. On 5% of days (18 days per year) they’d expect more than 5 to get infected, but it couldn’t possibly be more than 100.  So the total number of infections across the year is less than 5*347+100*18, or 10% of swimmers. That sounds bad, but it’s an extremely conservative upper bound.  In fact, when the risk is less than 5% it’s often much less, and when it’s greater than 5% it’s usually nowhere near 100%.  To say more, though, you’d need to know more about how the risk varies over time.

There are statistical models for all of this, and since everyone seems to be using the same models we can just stipulate that they’re reasonable.  The detailed report is here (PDF), and Jonathan Marshall, who’s a statistician who knows about this sort of thing, has scripts to reproduce some calculations here.

Using those models, a `yellow’ river, with risk less than 1/20 95% of the time actually has risk less than 1/1000 about half the time, but occasionally has risks well over 10%.  Our imaginary extreme sports facility will have about 3 infections per 100 customers, averaged over the year. About half these infections will happen on the worst 5% of days.

So, the 1/20 of 1/20 level doesn’t by itself guarantee anything better than 10% infection risk for people swimming on randomly chosen days, but combined with knowledge of the actual bacteria distribution in NZ rivers, seems to work out at about a 3%  risk averaged over all days.  Also, if you can detect and avoid the worst few days each year, your risk will be reduced quite a lot.

February 20, 2017

Meat for men?

Q: It’s nice to see a balanced nutrition story in the Herald today, isn’t it?

A: Um.

Q: They talk about benefits and drawbacks of a vegan diet.

A: Um.

Q: It’s impressive that just one serving of butter a day can double your risk of diabetes, isn’t it?

A: <sigh>

Q: Isn’t that what the research paper says?

A: It’s a bit hard to find, since they don’t link and don’t give any researcher names.

Q: Did you find it in the end?

A: Yes. And that’s not really what it found.

Q: Is this the weird yoghurt thing?

A: Yes, that’s part of it.  They found a higher risk in people who ate more butter or more cheese, a lower risk in people who ate more whole-fat yoghurt, and “No significant associations between red meat, processed meat, eggs, or whole-fat milk and diabetes were observed.

Q: That doesn’t sound like a systematic effect of meat. Or animal products.

A: And there wasn’t any association at the start of the study, only later on.

Q: So it’s eating butter in a research study that’s dangerous?

A: Could be.

Q: Ok, what about the bit where men need meat for their sons to have children?

A: No men in the study

Q: Mice?

A: No, smaller.

Q: Zebrafish?

A: Smaller.

Q: Um. Fruit flies?

A: Yes.

Q: Do fruit flies even eat meat?

A: No, there wasn’t any meat in the study either. The flies got higher or lower amounts of yeast in their diet.

Q: But don’t vegans eat yeast?

A: I’m not sure that’s the biggest problem with extrapolating this to Men Need Meat.

February 15, 2017

Another way not to improve your Lotto chances

I was on Radio LIVE Drive earlier this evening, talking about lotto (way to be stereotyped). The underlying story is on Stuff

A Nelson Lotto player who won more than $100,000 playing the same numbers 12 times on the same ticket says he often picks the same numbers multiple times.

“So that when my numbers do come up, I can win a greater share of the prize.”

The player won 12 second division prizes on a single ticket bought from from Nelson’s Whitcoulls on Saturday, winning $9481 on each line, totalling $113,772.  

There’s nothing wrong with this as an inexpensive entertainment strategy. As a strategy for getting better returns from Lotto it can’t possibly work, so the question is whether it doesn’t have any effect or whether it makes the expected return worse.

In this case, it’s fairly easy to see the expected return is worse. If you play 12 lines of Lotto every week, with 12 different sets of numbers, you’ll average one week with a Division 2 win every thousand years.  If you use the same set of numbers 12 times each week, you’ll average one week with 12 Division 2 wins every twelve thousand years. You might think this factor of 12 in the odds is cancelled out by the higher winnings, but that’s only partly  true.

This week there were 25 winning Division 2 tickets, which each got an equal share of the $237,000 Division 2 prize pool. The gentleman in question held 12 of those 25 winning tickets, and so got about half the pool.  If he’d bought that set of numbers and 11 others he would have held 1 of 14 winning tickets and won, not 1/12 as much, but about 1/7th as much.   By increasing the number of winning tickets, he reduced the prize for each of his tickets, and so his strategy has slightly lower expected return than picking 12 different sets of numbers.

On the other hand, these calculations are a bit beside the point. If you play Lotto for the expected return you’re doing it wrong.