Posts written by Thomas Lumley (2034)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

August 21, 2017

Effective treatment is effective

There’s a story in New Scientist, and in the NY Daily News, based on this research paper, saying that choosing alternative treatment instead of conventional treatment for cancer is bad for you.

The research is well done: they looked at the most common cancers in the US and found a small set of people who turned down all conventional treatment in favour of ‘alternative’ medicine.  They matched these people on cancer type, age, clinical group stage, what other disease they had, insurance type, race, and year of diagnosis, to a set who did get conventional treatment.   Even after all that matching, there was a big difference in survival.

There are two caveats to the story. First, this is people who turned down all conventional treatment, even surgery. That’s rare. In the database they used, 99.98% of patients received some conventional treatment. It’s much more common for people to receive some or all of the recommended conventional treatment, plus other things — not ‘alternative’ but ‘complementary’ or ‘integrative’ medicine.

Second, the numbers are being misinterpreted.  For example, New Scientist says

Among those with breast cancer, people taking alternative remedies were 5.68 times more likely to die within five years.

The actual figures were 42% and 13%, so about 3.1 times more likely. Here’s the graph

Similarly, the New Scientist story says

They found that people who took alternative medicine were two and half times more likely to die within five years of diagnosis.

The actual figures were 45% and 26%; 1.75 times more likely.

What’s happening is a confusion of rate ratios and actual risks of death; these aren’t the same.  The rate (or hazard) is measured in % per year; the risk is measured in %.  The risk is capped at 100%; the rate doesn’t have an upper limit.   Because of the cap at 100%, risk ratios are mathematically less convenient to model than rate ratios. As a tradeoff, it’s harder to explain your results using rate ratios. The Yale publicity punted on the issue, not mentioning the numbers and leaving reporters to get it wrong.  When this happens, it’s the scientists’ fault, not the reporters’.


August 19, 2017

Sampling bias

Via GeoNet, a magnitude 4.5 quake south of Dannevirke (blue box)


The squares are reports of shaking. The big cluster is Palmerston North, with secondary clusters in Feilding and Ashhurst: there are more people who felt the quake there because there are more people there.  See also XKCD

August 18, 2017

Green and full of terrors

Q: Did you see avocado gives you breast cancer?

A: Me?

Q: Well, women with mutations in the BRCA genes, such as Angelina Jolie

A: 🙄

Q: “Women with the faulty ‘Angelina Jolie’ gene should cut back on trendy avocado-based breakfasts to slash their chances of cancer.

A: No.

Q: So the study wasn’t in women?

A: No. Or avocados.

Q: Mice

A: Not even. Cells in a lab. (press release)

Q: And the avocados?

A: The cells were given extra folate.

Q: And they got cancer?

A: No, they died.

Q: Then why is there a cancer story?

A: The researchers speculated that folate could be part of a future treatment for BRCA-damaged tumours.

Q: That’s kind of not what the Herald says

A: No, but they did get the story from the Daily Mail.

Q: So what do the researchers say about avocados?

A: They don’t mention avocados

Q: Ok, what do they say about folate, then?

A: “The authors caution that no conclusions should be drawn about whether there is any overall effect in a living animal consuming folate.”

Q: So it wasn’t the press release this time

A: No, this looks like it’s down to the Daily Mail.


August 16, 2017


  • “Is it legal for me to violate Terms of Service in order to collect data for a research project?” (in the US). Casey Fiesler on law and ethics of scraping
  • My first boss as a statistician. John Simes, has won the University of Sydney Vice-Chancellor’s Award for Excellence. Among other things, he was one of the early proponents of universal clinical trial registration. In 1986 he wrote about the impact of publication bias on treatment choice in cancer.
  • A teaching example based on a baseball/brain cancer ‘cluster’ that didn’t hold up.  Much smaller numbers than the brain injury problems in US football or even rugby, and less prior plausibility.
  • It’s not just New Zealanders who have order of magnitude-and-units problems. From The New Yorker, via Felix Salmon

Seatbelts save (some) lives

It’s pretty standard that headlines (and often politicians) overstate the likely effect of road safety precautions — eg, the claim that lowering the blood alcohol limit would prevent all deaths in which drivers were over the limit, which it obviously won’t.

This is from the Herald’s front page.


On the left, the number 94 is the number of people who died in crashes while not wearing seatbelts. On the right (and in the story), the we find that this is about a third of all the deaths. It’s quite possible to wear a seatbelt and still die in a crash.

Looking for research, I found this summary from a UK organisation that does independent reviews on road safety issues. They say seatbelts in front seats prevent about 45% of fatal injuries in front seat passengers. For rear-seat passengers the data are less clear.

So, last year probably about 45 people died on our roads because they weren’t wearing seatbelts. That’s a big enough number to worry about: we don’t need to double it.

August 15, 2017

Emoji backlash?

Q: Did you see that using emoji in work-related emails could hurt your career?


Q: Yes, it’s apparently a common email mistake

A: 😯

Q: The 549 study participants from 29 countries “were asked to read a work-related email from an unknown person, and were asked to evaluate the competence and warmth of the sender”

A:🤔 💻 🇹🇷

Q: Yes, they were from Amazon’s Mechanical Turk (paper)

A: 😕

Q: Ok, so they weren’t really work-related emails from someone they’d never met, in another country. But the participants were told to pretend they were.

A: 🙁

Q: And it undermined information sharing

A: 😕 👥 💻 🤐 ?

Q: The email replies to messages with emoji had fewer words in them on average

A: 🤔 🙂

Q: Ok, yes, that’s not necessarily a bad thing.

A: 👩‍💼👨🏽‍💼🗣 😀 ?

Q: Is it really common to use emoji in business email? Yes, they say nearly 20% of emails in one previous sample included emoji.

A: 🙄 🇬🇸🇸🇱🇸🇬🇸🇳🇸🇦 👩‍💼👨🏽‍💼🗣 😀 ?

Q: No, I suppose that wasn’t international emails between people who had never met or corresponded before.

A: 🙄

Q: So using emoji in formal emails to a complete stranger could be a bad idea?

A: 😴


August 14, 2017

Meters and litres

There have been a surprisingly large number of order-of-magnitude errors by people you’d expect to know better when commenting on Labour’s proposed water policy.  The Greens, last month, proposed a 10c/litre charge on water for bottling.  Labour are proposing a variable charge from one or two cents per cubic metre on irrigation up to “cents per litre, not ten cents” for bottled water not taken from a mains supply.

The conversion is fairly simple: 10c per litre is $100 per cubic metre, 1c per litre is $10 per cubic metre, one-one thousandth of a cent per litre is 1c per cubic metre.

How much does that come to for a cabbage or a carton of milk? According to Daniel Collins, the water taken from rivers or aquifers to produce a litre of milk varies from about 1L in the Waikato to about 250L in the Canterbury plains (you’ll see figures of 1000L, but these include needs met from local rainfall) .  So a 1c or 2c per cubic metre water charge would come out to less than a cent per litre of milk.

On the other hand, most of our milk isn’t produced for local consumption but for export as milk solids, at a bit more than 11 L of milk per kilogram.  At Canterbury water consumption, a 2c charge works out as about 6 cents per kg of milk solids. One Canterbury dairy farmer on Twitter estimated about twice that based on his production and consumption, so we’re at the right order of magnitude. Fonterra is currently paying $6.75 per kilogram of milk solids.

Horticulture is the other use that’s been in the news.  I found an estimate that, it takes 237L of water to produce 1kg of cabbage, ie, less than a quarter of a cubic metre, so less than 1 cent. Maybe NZ horticulture is less water-efficient than the average for the world, but that estimate, again, counts rainfall.

It’s hard to get up-to-date data, but in 2010 the total water use consented for horticulture, orchards, and viticulture was about 800 million cubic metres (PDF, table C-21), which would cost $8 million at 1c/cubic metre or $16 million at 2c/cubic metre; the amount actually used was lower.  In 2010, Horticulture NZ said the total production of the sector was worth $6 billion.

According to StatsNZ, total water for irrigation, other farming uses,  and industrial uses was consented at a maximum of about 8.5 billion cubic metres last year.  At 2c per cubic metre that would be 17 billion cents, or $170 million, if all the consented volume was taken and if there was no reduction in use as a result of charging.  Some fraction of the water would be priced a lot higher, and Labour is saying “less than $500 milllion“, which looks plausible.  That’s a fair sum of money, but it’s about two-thirds of one percent of Crown Revenue.

The cost to the water-using industries of a water charge isn’t trivial;  they’re going to notice the increase in their costs; this isn’t free money for the government or the taxpayer. I’m not going to comment on whether this is a good policy; that’s outside my expertise. But, some of the claims about costs have been off by huge factors, and people should be able to do basic maths better than that.


August 11, 2017

Different sorts of graphs

This bar chart from Figure.NZ was in Stuff today, with the lead

Working-age people receiving benefits are mostly in the prime of our working life – the ages of 25 to 54.


The numbers are correct, but the extent to which the graph fits the story is a bit misleading.  The main reason the two bars in the middle are higher is that they are 15-year age groups, when the first bar is a 7-year group and the last is a ten-year group.

Another way to show the data is to scale the bar widths proportional to the number of years and then scale the height so that the bar area matches the count of people. The bar height is now counts of people per year of age


This is harder to read for people who aren’t used to it, but arguably more informative. It suggests the 25-54 year groups may be the largest just because the groups are wider.

We really need population size data, since the number of people in NZ also varies by age group.  Showing the percentage receiving benefits in each age group gives a different picture again


It looks as though

  • “working age” people 25-39 and 40-54 make up a larger fraction of those receiving benefits than people 18-24 or 55-64
  • a person receiving benefits is more likely to be, say, 20 or 60 than 35 or 45.
  • the proportion of people receiving benefits increases with age

These can all be true; they’re subtly different questions. Part of the job of a statistician is to help you think about which one you wanted to ask.

August 9, 2017


  • From econ blog “Worthwhile Canadian Initiative”:  “The fraction of children earning more than their parents fell from approximately 90% for children born in 1940 to around 50% for children entering the labor market today. Not children. Boys, perhaps, but not children. “
  • From North and South, a story on what direct-to-consumer genetic testing might be good for.
  • There are lots of websites with useful and interesting data out there, but you need to worry about what the data mean. Kaiser Fung has an example from a Kaggle challenge involving Hollywood movies “Huge alarm bells should be going off in the analyst’s head right around now. There were only eleven movies about vampires? Only eleven martial arts movies? Only twelve movies involving superheroes?” (via Andrew Gelman)
  • Wired magazine reprints a Harper’s story about that 1984 revolution in numerical computing, the spreadsheet. “It is not far-fetched to imagine that the introduction of the electronic spreadsheet will have an effect like that brought about by the development during the Renaissance of double-entry bookkeeping. ” If anything, an underestimate.
August 8, 2017

Breast cancer alcohol twitter

Twitter is not an ideal format for science communication, because of the 140-character limitations: it’s easy to inadvertently leave something out.  Here’s one I was referred to this morning (link, so you can see if it is retracted)


Usually I’d think it was a bit unfair to go after this sort of thing on StatsChat.  The reason I’m making an exception here is the hashtag: this is a political statement by a person of mana.

There’s one gross inaccuracy (which I missed on first reading) and one sub-optimal presentation of risk.  To start off, though, there’s nothing wrong with the underlying number: unlike many of its ilk it isn’t an extrapolation from high levels of drinking and it isn’t obviously confounded, because moderate drinkers are otherwise in better health than non-drinkers on average.  The underlying number is that for each standard drink per day, the rate of breast cancer increases by a factor of about 1.1.

The gross inaccuracy is the lack of a per day qualifier, making the statement inaccurate by a factor of several thousand.  An average of one standard drink per day is not a huge amount, but it’s probably more than the average for women in NZ (given the  2007/08 New Zealand Alcohol and Drug Use Survey finding that about half of women drank alcohol less than weekly).

Relative rates are what the research produces, but people tend to think in absolute risks, despite the explicit “relative risk” in the tweet.  The rate of breast cancer in middle age (what the data are about) is fairly low. The lifetime risk for a 45 year old woman (if you don’t die of anything else before age 90) is about 12%.  A 10% increase in that is 13.2%, not 22%. It would take about 7 drinks per day to roughly double your risk (1.17=1.94)  — and you’d have other problems as well as breast cancer risk.