Posts written by Thomas Lumley (1302)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

October 18, 2014


1. There’s a conference coming up in Canada on “Fairness, Accountability, and Transparency in Machine Learning”, a topic I wrote a little about for the Listener

Questions to the machine learning community include:

  • How can we achieve high classification accuracy while eliminating discriminatory biases? What are meaningful formal fairness properties?
  • How can we design expressive yet easily interpretable classifiers?
  • Can we ensure that a classifier remains accurate even if the statistical signal it relies on is exposed to public scrutiny?
  • Are there practical methods to test existing classifiers for compliance with a policy?


2. From Nate Silver at

Democrats may not be wrong. The polls could very well be biased against their candidates. The problem is that the polls are just about as likely to be biased against Republicans, in which case the GOP could win more seats than expected.

This sort of slowly varying bias is probably one of the reasons the NZ election polls weren’t very good: not only did they have more variability than you’d expect given the sample sizes, but averaging didn’t cancel out much of the error.

3. Yesterday was Spreadsheet Day. Flee in terror! (via @kara_woo)

4. An informative  visualisation of what the world eats, over time. (via Harkanwal Singh)


When barcharts shouldn’t start at zero

Barcharts should almost always start at zero. Almost always.

Randal Olson has a very popular post on predictors of divorce, based on research by two economists at Emory University. The post has a lot of barcharts like this one


The estimates in the research report are hazard ratios for dissolution of marriage. A hazard ratio of zero means a factor appears completely protective — it’s not a natural reference point. The natural reference point for hazard ratios is 1: no difference between two groups, so that would be a more natural place to put the axis than at zero.

A bar chart is also not good for showing uncertainty. The green bar has no uncertainty, because the others are defined as comparisons to it, but the other bars do. The more usual way to show estimates like these from regression models is with a forest plot:


The area of each coloured box is proportional to the number of people in that group in the sample, and the line is a 95% confidence interval.  The horizontal scale is logarithmic, so that 0.5 and 2 are the same distance from 1 — otherwise the shape of the graph would depend on which box was taken as the comparison group.

Two more minor notes: first, the hazard ratio measures the relative rate of divorces over time, not the relative probability of divorce, so a hazard ratio of 1.46 doesn’t actually mean 1.46 times more likely to get divorced. Second, the category of people with total wedding expenses over $20,000 was only 11% of the sample — the sample is differently non-representative than the samples that lead to bogus estimates of $30,000 as the average cost of a wedding.

October 16, 2014

Do you feel lucky?

I’m glad to say it’s been quite a while since we’ve had this sort of rubbish from the NZ papers, but it’s still  going across the Tasman (the  Sydney Morning Herald)

If you’re considering buying a lottery ticket, you’d better make sure it’s from either Gladesville or Cabramatta, which are now officially Sydney’s luckiest suburbs when it comes to winning big. 

NSW Lotteries has released statistics that show the luckiest suburbs across all lotto games in NSW and the ACT, as well as other tips for amateurs hoping to ring their bosses tomorrow morning to say they wouldn’t be coming in to work. 

Of course, the ‘luckiest’ suburbs are nothing of the sort: just the ones where the most money is lost on the lotteries. Cabramatta has improved a lot in recent years, but it’s still not the sort of place you’d expect to see called ‘lucky’.

October 14, 2014

Does it make any more sense this time?

From the Herald today

“The average annual weekly wage increase of $28.06 was not enough to offset a $30,000 increase in the national median house price and an increase in the average mortgage interest rate from 5.52% to 5.86%,” the survey found.

We did this one last time, in June. Today’s story is better in that it links to the Massey report. It could still do with a bit of interpretation.

Quick, without a calculator, roughly what would be a large enough weekly wage increase to offset a $30,000 increase in the national median house price?  Would we need to up the $28.06 by ten percent, or  ten dollars, or a factor of ten?

[Update: I should also note that the word "weekly" wasn't in the description of wage increases last time, so this is a definite improvement]

Ada Lovelace Day

October 14 is Ada Lovelace Day, an international celebration of the achievements of women in science, technology, engineering and maths.

New Zealand has (only) three female Professors of Statistics, the top position in our UK-style academic ranking. They work in very different areas of statistics, but with related applications to ecological and environmental monitoring, an area of particular interest in New Zealand.

Going north to south:

  • Marti Anderson is at Massey University in Albany (and was previously at the University of Auckland). Her research is in multivariate analysis — techniques for analysing ecological data on multiple species together, rather than one at a time — mostly applied to marine species
  • Shirley Pledger retired this year from Victoria University. Her research is on capture-recapture methods for counting animals. It’s often impossible to get a complete census of a species even in a limited area, but you can mark the individuals you catch, release them, and observe how often you catch them again. The simplest approaches to estimation are easy but unrealistic; she has worked on more sophisticated and sensible models.
  • Jennifer Brown is head of the Maths & Stats department at the University of Canterbury. Her main statistical research is on sampling techniques for monitoring sparse or patchy populations: either rare animals and plants, or invasive weeds. Sampling systematically or purely at random are both very wasteful; ‘adaptive’ sampling designs allow you to take advantage of finding a clump of your target species without biasing the overall results.


October 13, 2014

Context from everyday units

From @JohnDonoghue64 on Twitter


From the Guardian, a few years ago

Perhaps, as with metric and imperial measurements, such comparisons should be given convenient abbreviations: SoWs (size of Wales), SoBs (size of Belgium), OSPs (Olympic swimming pools), DDBs (buses) and so on. Thus the Kruger national park in South Africa measures 1 SoW (Daily Telegraph), as do Lesotho (London Evening Standard) and Israel (Times), whereas Lake Nzerakera in Tanzania is 2 SoBs (Observer).

At times the most carefully calibrated calculations can go awry. So we learn that Helmand province in Afghanistan is “four times the size of Wales” (Daily Telegraph, 2 December 2009) only to find a few weeks later that it has apparently shrunk to “the size of Wales” (Daily Telegraph, 29 January 2010).

For the benefit of NZ readers, a badger appears to weigh about the same as three female North Island brown kiwi, two typical merino fleeces, or half a case of Malborough sav blanc. That should help you get a grasp on the size of the Lindisfarne Gospels.

Herald data blog starts

The Herald’s Data Editor, Harkanwal Singh,  announces the online site’s new ‘Data Blog’, with the first new post being a map of NZ internet affordability created by Jonathan Brewer.

This has got to be a Good Thing for data literacy in the local media.

October 12, 2014

Unofficially over arithmetic

From the Herald (from the Washington Post), under the headline “Teens are officially over Facebook” (yes, officially)

Now, a pretty dramatic new report out from Piper Jaffray – an investment bank with a sizable research arm – rules that the kids are over Facebook once and for all, having fled Mark Zuckerberg’s parent-flooded shores for the more forgiving embraces of Twitter and Instagram.

This is based on a survey by Piper Jaffray, of 7200 people aged 13-19, (in the US, though the Herald doesn’t say that).

It looks as though US teens are leaving Facebook, but they sure aren’t flocking to Twitter, or, really, to Instagram. If you go to a story that gives the numbers, you see that reported Facebook use has fallen 27 percentage points. Instagram has risen only 7 percentage points, and Twitter has fallen by 4.


So, where are they going? They aren’t giving up on social media entirely — although “None” category wasn’t asked the first time around, it’s only 8 percent in the second survey.  It’s possible that teens are cutting down on the number of social media networks they use, but it seems more likely that the question was badly designed. Even I can think of at least one major site that isn’t on the list, Snapchat, which globalwebindex thinks is used by 42% of US internet-connected 16-19 year olds.

Incidentally: those little blue letters that look like they should be a link? They aren’t on the Herald site either, and on the Washington Post site they link to a message that basically says “no, not for you.”

October 10, 2014


  • Something strange happened to this month’s unemployment data in Australia: Guardian, ABC News, interview with Rob Hyndman (who knows from time series)
  • “Ferguson’s 3,287 new registrants (in two months) is more than recorded by any township in St. Louis County in any midterm election since 2002.” Or not. A number that seems really extreme may just be wrong.
  • When there’s a lot of variation, it can be a mistake to make statements about “typical” attitudes: Andrew Gelman
October 9, 2014

…and to divide the light from the darkness

Q: There’s a story that charging your phone in your bedroom make you fat.

A: Yes, there is.

Q: Why?

A: Because it looked like a good headline.

Q: No, why does it make you fat?

A: Melatonin. The theory is that any light at night time makes your body not produce enough melatonin and that this is bad.

Q: How much more did people who charged their phones in their bedroom end up weighing?

A: There weren’t any people involved.

Q: Ok, so they had mice with cellphones in their bedrooms?

A: Rats. And not cellphones.

Q: Some other light source of a similar brightness?

A: No.

Q: What, then?

A: They put melatonin in the rats’ drinking water.

Q: So that should make them lose weight. Did it?

A: Not that they reported.

Q: Can you work with me here?

A: They measured the conversion of fat under the rats’ skin from ‘white’ to ‘brown‘, which is theoretically relevant to energy use and perhaps to diabetes and heart disease. It’s interesting research. (abstract)

Q: So it could be relevant, but doesn’t the generalisation seems a bit indirect?

A: Yes, “a bit.”

Q: Do international patterns of cellphone use match patterns of obesity?

A: Not really, but maybe in East Asia they use different chargers or something.

Q: Is the LED on a charger really enough to make a difference?

A: That’s what the story lead implies, but the second paragraph talks about research involving phone screens, laptops, artificial lighting, and street lights, so I’m guessing there’s a bit of a bait and switch going on.

Q: Couldn’t it be enough? I mean, in nature, it would be completely dark at night, like they say.

A: Only up to a point. There was another relevant story today, too.