Posts written by Thomas Lumley (1263)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

August 29, 2014

Getting good information to government

On the positive side: there’s a conference of science advisers and people who know about the field here in Auckland at the moment. There’s a blog, and there will soon be videos of the presentations.

On the negative side: Statistics Canada continues to provide an example of how a world-class official statistics agency can go downhill with budget cuts and government neglect.  The latest story is the report on how the Labour Force Survey (which is how unemployment is estimated) was off by 42000 in July. There’s a shorter writeup in Maclean’s magazine, and their archive of stories on StatsCan is depressing reading.

August 28, 2014

Age, period, um, cohort

A recurring issue with trends over time is whether they are ‘age’ trends, ‘period’ trends, or ‘cohort’ trends.  That is, when we complain about ‘kids these days’, is it ‘kids’ or ‘these days’ that’s the problem? Mark Liberman at Language Log has a nice example using analyses by Joe Fruehwald.


If you look at the frequency of “um” in speech (in this case in Philadelphia), it decreases with age at any given year



On the other hand, it increases over time for people in a given age cohort (for example, the line that stretches right across the graph is for people born in the 1950s)



It’s not that people say “um” less as they get older, it’s that people born a long time ago say “um” less than people born recently.


‘Dodgy use of data’ edition [Background: the Washington Post is the serious DC paper. The Washington Times, not so much]

  • From the Washington Post  “But really, is it possible that more than 1 in 6 people in France could “back” Islamic State? When you look at the numbers closely, something doesn’t add up.”
  • From journalism/editing blog HeadsUp: “Too bad a clear conscience and a pure heart can’t turn correlation into cause, no matter what your first named source says.”
  • From economics blog TVHE:  For example, [Tourism Industry Association New Zealand] claims that 15%of Upper Hutt residents’ jobs depend on the tourism industry, while only 9% of residents’ jobs in Queenstown-Lakes District depend on tourism.”



August 26, 2014


Infographic edition

1. Thomson Reuters illustrated the importance of fine detail in graphic in one of their ads. It looks like a Venn diagram. Oops.




Removing the transparent overlap and changing the colours makes it less Venn-ish




2. Kevin Schaul in the Washington Post came up with this neat graphical summary of state data



Because the basic outline of the US is so familiar (especially to people who live there), the huge spatial distortions aren’t actually all that disturbing.  Mark Monmonier, a geographer, seems to have been the first person to move in this direction (eg). I suggested to Kevin, on Twitter, that this technique would also allow Alaska to be moved from the tropical Pacific to its proper home in the north, and he agreed.


3. That’ll wake you up


Jawbone, who make products that tell you if you are awake and walking around, looked at the impact of this week’s Napa earthquake. The data resolution isn’t quite fine enough to see the time taken for the ground waves to propagate — compare XKCD on the Twitter event horizon


August 22, 2014

Margin of error for minor parties

The 3% ‘margin of error’ usually quoted for poll is actually the ‘maximum margin of error’, and is an overestimate for minor parties. On the other hand, it also assumes simple random sampling and so tends to be an underestimate for major parties.

In case anyone is interested, I have done the calculations for a range of percentages (code here), both under simple random sampling and under one assumption about real sampling.


Lower and upper ‘margin of error’ limits for a sample of size 1000 and the observed percentage, under the usual assumptions of independent sampling

Percentage lower upper
1 0.5 1.8
2 1.2 3.1
3 2.0 4.3
4 2.9 5.4
5 3.7 6.5
6 4.6 7.7
7 5.5 8.8
8 6.4 9.9
9 7.3 10.9
10 8.2 12.0
15 12.8 17.4
20 17.6 22.6
30 27.2 32.9
50 46.9 53.1


Lower and upper ‘margin of error’ limits for a sample of size 1000 and the observed percentage, assuming that complications in sampling inflate the variance by a factor of 2, which empirically is about right for National.

Percentage lower upper
1 0.3 2.3
2 1.0 3.6
3 1.7 4.9
4 2.5 6.1
5 3.3 7.3
6 4.1 8.5
7 4.9 9.6
8 5.8 10.7
9 6.6 11.9
10 7.5 13.0
15 12.0 18.4
20 16.6 23.8
30 26.0 34.2
50 45.5 54.5

California drought visualisation


From XKCD. Both the data and the display technique are worth looking at



Presumably you could do something similar with New Zealand, which is roughly the same shape.

August 21, 2014

Auckland rates arithmetic

In today’s Herald story about increases in rates and impact on renters it’s not that the numbers are wrong, it’s that they haven’t been subjected to the right sorts of basic arithmetic.

The lead is

Auckland landlords are hiking rents amid fears of big rates increases next year on the back of spiralling property values.

and later on

Increases in landlords’ expenses, including rates, mortgage interest rates and insurance premiums, could push up rent on a three-bedroom Auckland house by between $20 and $40 a week, he said.

Including‘ is doing a lot of work in that sentence. The implications are particularly unfortunate in a story targeted at renters, who don’t get sent rates information directly and are less likely to know the details of  the system.

The first place to start is with a rough estimate of how much money we’re looking at. One of the few useful things the Taxpayers’ Union has done is to collate data on rates, hosted now at Stuff. The average Auckland rates bill was $2636.  That’s all residences, not three-bedroom houses, but the order of magnitude should be right. An annual bill of $2636 is $50/week. If the average total weekly rates payment is around $50, the average increase can’t reasonably be a big fraction of $20-$40/week or there’d be a lot more rioting in the streets.

Anyone who owns a house in Auckland or checks the Council website should know there is a cap on rates increases to cover the neighbourhoods where prices are increasing fastest. The cap is 10%/year; no rates increase faster than that, and most increase slower.  To get more detailed information you’d need to look at the website describing 2014/2015 rates changes, and find that the average increase for residential properties is 3.7%, then calculate that 3.7% of $50/week is about $2/week.

According to the Reserve Bank, both floating and two-year-fixed mortgage interest rates have gone up 0.5% since last year.  That’s $9.60/week per $100,000 of mortgage, so it’s likely to be a much bigger component of the rental cost increase than the rates are.

The average increase in rates is a lot slower than the increase in property prices (10% in the year to July), but you’d expect it to be. The council doesn’t set a fixed percentage of value from year to year and live with real-estate price fluctuations. It sets a budget for total rates income, and then distributes the cost using a combination of a fixed charge and a proportion of value. In other words, the increase in average real-estate prices in Auckland has no direct impact on average increase in rates — it’s just that if your house value has gone up more than average, your rates will tend to go up more than average.   Increases in average real-estate price obviously do lead to increases in rental price, but rates are not the mechanism.

The Council is currently working on a ten-year plan, including the total rates income over that period of time. It will be open for public comment in January.


August 20, 2014

Good neighbours make good fences

Two examples of neighbourly correlations, at least one of which is not causation

1. A (good) Herald story today, about research in Michigan that found people who got on well with their neighbours were less likely to have heart attacks

2. An old Ministry of Justice report showing people who told their neighbours whenever they went away were much less likely to get burgled.

The burglary story is the one we know is mostly not causal.  People who tell their neighbours whenever they go on holiday were about half as likely to have experienced a burglary, but only about one burglary in seven happened while the residents were on holiday. There must be something else about types of neighbourhoods or relationships with neighbours that explains most of the correlation.

I’m pretty confident the heart-disease story works the same way.  The researchers had some possible explanations

The mechanism behind the association was not known, but the team said neighbourly cohesion could encourage physical activities such as walking, which counter artery clogging and disease.

That could be true, but is it really more likely that talking to your neighbours makes you walk around the neighbourhood or work in the garden, or that walking around the neighbourhood and working in the garden leads to talking to your neighbours? On top of that, the correlation with neighbourly cohesion was rather stronger then the correlation previously observed with walking.

August 19, 2014

Fortune cookie endings

Or, often in NZ papers, “… in the UK”.

There’s a Herald story with the lead

More than 12,000 new cases of cancer every year can be attributed to the patient being overweight or obese, the biggest ever study of the links between body mass index and cancer has revealed.

Since there about about 20,000 new cases of cancer a year in NZ, that would be quite a lot.  The story never actually comes out and says the 12,000 is for the UK, but it is, and if you read the whole thing it becomes fairly clear.  It still seems the sort of context that a reader might find helpful.

“More maps that won’t change your mind about racism in America”



Ultimately, despite the centrality of social media to the protests and our ability to come together and reflect on the social problems at the root of Michael Brown’s shooting, these maps, and the kind of data used to create them, can’t tell us much about the deep-seated issues that have led to the killing of yet another unarmed young black man in our country. And they almost certainly won’t change anyone’s mind about racism in America. They can, instead, help us to better understand how these events have been reflected on social media, and how even purportedly global news stories are always connected to particular places in specific ways.