Posts written by Thomas Lumley (1805)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

July 1, 2016

Too good to check

Twitter is a good way to get news and rumours of news from across the world, but it also exposes you to a lot of items that are either deliberate fraud or just ‘too good to check’.  Here’s one: it claims to compare maps of the ‘Leave’ vote with BSE prevalence.


It’s clear what idea the author was going for, and it’s also clear that it has to be unfounded as well as malicious. The BSE prions weren’t preferentially consumed in farming areas — people in cities eat hamburgers, too — and nvCJD is not only very rare, but primarily affects movement rather than political beliefs.

However, it’s not inconceivable that farming areas which experienced losses from BSE and then later from foot-and-mouth would be anti-government and possibly anti-European. Some correlation, even a strong correlation, would be possible for that reason.

If you cared about the truth, there’s a simple two-word Google search you could do before passing on the maps: BSE Scotland. Yes, there was mad cow disease north of the border. You could also note the implausibility of having exactly the same map layout, and a color scheme that was just a grayscale version of the modern one.

Thinking about the numbers

More students cheat in exams, and most are in Auckland, says the Herald.

This story combines two frequent StatsChat themes: denominators, and being careful about what was actually measured.

Auckland, as we have noted before, has a higher population than other regions.  As you will recall, it’s about a third of the NZ population, so it looks like making up about 50% of those caught cheating is excessive. That’s the sort of work that the paper might do for you — as well as checking if 1/3 is still about right as the proportion of students sitting NCEA exams (it seems to be).

On similar lines, if you look just at the totals without denominators, you’ll miss some notable values.  Northland had 25 students caught cheating, which is more than the much-larger Waikato and Canterbury regions. You’d expect about 10 at the national average rate and about 15 at the Auckland rate.

Much more important is the question of what proportion of those cheating were caught — to say things like

Again Central Plateau and the Cook Islands had no cheaters, and Wairarapa and Southland students were also honest

or to draw conclusions about trends over time assumes that you’re not missing many.

The story says

NZQA received 1,314,207 entries in NCEA and New Zealand Scholarship examinations from 145,464 students last year.

The 290 attempts at cheating that were caught come to just under 0.2% of students and just over 0.02% of exams.  Maybe I’m just cynical, but I’d be surprised if the real rate was as low as one exam in a thousand, let alone five times lower.

June 27, 2016

Graph of the week

From the BBC coverage of the Brexit referendum. With great power comes great responsibility.


  • In Brexit, the YouGov estimate that I mentioned last week was pretty accurate, but the result genuinely was too close to call.  The real-time forecasts from Chris Hanretty at the University of East Anglia seemed to work well.
  • If you combine turnout estimates with voting estimates by age group, the proportion of 18-24 year olds voting Remain (75% of the 36% turnout) was less than the proportion of 65+ year olds (39% of the 83% turnout).  Turnout matters.
  • You’re much more likely to survive a cardiac arrest on TV than in real life.
  • What theoretical physicists (statisticians, etc,) look like when working (Dr Katie Mack, astrophysics/cosmology). No lab coats; no brighly coloured Erlenmeyer flasks.
June 23, 2016

Or the other way around

It’s a useful habit, when you see a causal claim based on observational data, to turn the direction around: the story says A causes B, but could B cause A instead? People get annoyed when you do this, because they think it’s silly. Sometimes, though, that is what is happening.

As a pedestrian and public transport user, I’m in favour of walkable neighbourhoods, so I like seeing research that says they are good for health. Today, Stuff has a story that casts a bit of doubt on those analyses.

The researchers used Utah driver’s-licence data, which again included height and weight, to divide all the neighbourhoods in Salt Lake County into four groups by average body mass index. They used Utah birth certificates, which report mother’s height and weight, and looked at 40,000 women who had at least two children while living in Salt Lake County during the 20-year study period.  Then they looked at women who moved from one neighbourhood to another between the two births. Women with higher BMI were more likely to  move to a higher-BMI neighbourhood.

If this is true in other cities and for people other than mothers with new babies, it’s going to exaggerate the health benefits of walkable neighbourhoods: there will be a feedback loop where these neighbourhoods provide more exercise opportunity, leading to lower BMI, leading to other people with lower BMI moving there.   It’s like with schools: suppose a school starts getting consistently good results because of good teaching. Wealthy families who value education will send their kids there, and the school will get even better results, but only partly because of good teaching.

June 22, 2016

Eat up your doormats

Q: Did you see food allergies are caused by diet?

A: That makes sense, I suppose.

Q: Does it make sense that low-fibre diets are why people get peanut allergy more now?

A: Ah. No.

Q: Why not?

A: Because fibre in the typical diet hasn’t changed much in recent years and peanut allergies have become much more common.

Q: It could still be true that adding more fibre would stop people getting peanut allergies, though?

A: Could be.

Q: And that’s what the research found?

A: Up to a point.

Q:  Mice?

A: Mice.

Q: But peanut allergy and dietary fibre?

A: Yes, pretty much. And a plausible biological reason for how it might work.

Q: So it’s worth trying in humans?

A: Probably, though getting little kids to eat that much fibre would be hard.

Q: But the story just says “a simple bowl of bran and some dried apricots in the morning”

A: Sadly, yes.

Q: So how much fibre did they give the rats?

A: They compared a zero-fibre diet to 35% fibre

Q: Is 35% a lot?

A: Well, it’s more than All-Bran, and that was their whole diet.

Q: A more reasonable dose might still work, though?

A: Sure. But you wouldn’t want to assume it did before the trials happened.



  • The Data Journalism Awards from the Global Editors Network (scroll down past the ceremony stuff to get the links to the award-winners). And a commentary from Simon Rogers
  • Colorado legalised cannabis recently. They have data on teenage cannabis use over time. It hasn’t gone up, probably because teens who wanted pot could already get it.
    I still haven’t seen anything on the two more important potential health impacts: do car crashes go up or down, does alcohol consumption go up or down.
  • A nice detailed explanation of how YouGov is doing its predictions for the Brexit referendum.  At the moment they are robustly unsure, estimating 48%-53% vote for “Leave”.  They do discuss how badly the last election polling went, and conclude  “if we have a problem, it will probably not be the same problem as last time.” (via Andrew Gelman)

Making hospital data accessible

From the Guardian

The NHS is increasingly publishing statistics about the surgery it undertakes, following on from a movement kickstarted by the Bristol Inquiry in the late 1990s into deaths of children after heart surgery. Ever more health data is being collected, and more transparent and open sharing of hospital summary data and outcomes has the power to transform the quality of NHS services further, even beyond the great improvements that have already been made.

The problem is that most people don’t have the expertise to analyse the hospital outcome data, and that there are some easy mistakes to make (just as with school outcome data).

A group of statisticians and psychologists developed a website that tries to help, for the data on childhood heart surgery.  Comparisons between hospitals in survival rate are very tempting (and newsworthy) here, but misleading: there are many reasons children might need heart surgery, and the risk is not the same for all of them.

There are two, equally important, components to the new site. Underneath, invisible to the user, is a statistical model that predicts the surgery result for an average hospital, and the uncertainty around the prediction. On top is the display and explanation, helping the user to understand what the data are saying: is the survival rate at this hospital higher (or lower) than would be expected based on how difficult their operations are?

June 19, 2016

Cheese addiction hoax again

Two more news sites have fallen for the cheese addiction story.

A recap for those who missed the earlier episodes:

  • There was a paper using the Yale Food Addiction Scale that evaluated a lot of foods for (alleged) addictiveness.
  • Pizza came top.
  • Someone (we don’t know who) pushed a story to various media sites saying the research had found cheese-based foods were the most addictive (false), and that this was because of milk protein fragments called casomorphins (which aren’t even mentioned in the research paper, as you can check for yourself, and which haven’t been shown to be addictive even in mice).
  • The people behind the research have disclaimed these weird interpretations of what they found. Here’s a detailed story



Are stay-at-home dads the end of civilisation?

In a Herald (Daily Telegraph) story this week

When confronted, he confessed he’d been having an affair with a single mother he met at the school gates.

“She was vulnerable,” says Janet. “I guess he liked that. It made him feel like a hero.”

Her experience sadly chimes with the findings of a new study of more than 2,750 young married people by the University of Connecticut, which showed that men who are financially dependent on their spouses are the most likely to be unfaithful. In fact, the bigger the earning gap, the more likely they are to have an affair, with those who rely solely on their wives for their income the biggest cheats.

It turns out there are things wrong with this story, even restricting to statistical issues. To find them, you’d need to look at the research paper, which you probably can’t, because it’s paywalled.  I should note that these aren’t necessarily all criticisms of the paper for the question it was answering (which was about motivations for fidelity/infidelity). but they are for how it has been used — to turn a couple of anecdotes into a Deeply Worrying Social Trend.

First, the income data. The researcher writes

I calculated the measure from respondents’ and their spouse’s total income earned during the previous year from wages, salaries, commissions, and tips. I excluded self-employment income, because the division between labor income and business income is often measured with substantial error.

That means the group described as ‘rely solely on their wives for income‘ includes all the self-employed men, no matter how much they earn. There may well be more of them than voluntarily unemployed house-husbands.

Second, a somewhat technical point, which I think I last covered in 2012, with two posts on a story about mid-life crisis in chimpanzees.

Here’s a summary of the model given in the research paper


Notice how the curved line for men bends away from the straight line for women on both sides? And that the deviation from a straight line looks pretty symmetric? That’s forced by the statistical model.

Conceptually, the left and right sides of this graph show quite different phenomena. The right side says that the decrease in infidelity with higher relative income flattens out in men, but not in women. The left side says that the increase with lower relative income accelerates in men. The model forces these to be the same, and uses the same data to estimate both.

Since there are more men with positive than negative relative income, most of the actual evidence for men is on the right-hand side, but the newspaper story is looking at the left-hand side.