Posts written by Thomas Lumley (1553)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

August 26, 2015

The death of the novel?

So, the Guardian has a list of the top-100 best ever novels written in English.  There’s the usual problem that lit people have with genre (can it really be true that The Moonstone is the best detective novel in English?), but I’m not an expert in novels.

There have been complaints about the diversity of the list, and that’s where there’s a definite statistical anomaly. The books are old. For example, 33 of them come from before the start of the twentieth century and none of them from after the end of the twentieth century, even though many more novels were published in the latter period.

An article in Seed magazine, and its technical notes tried to estimate the number of new book authors each year over time. That’s not quite the same thing, since we aren’t considering only first books and are considering only novels. However, it’s a reasonable surrogate. Using their estimates, as many books were published before 1930 (55 places on the list) as after 2000 (no places on the list).  Here’s a graph, with the dots indicating books on the list and the lines indicating total published.


As I said, this isn’t my field. Maybe nineteenth century novels really were hundreds of times more likely to be ‘great’ than modern novels. But it’s not the only possible explanation.

August 25, 2015

Computation and art


Normally I wouldn’t be linking favourably to this scatterplot, which has an ill-defined sampling scheme, and where at least the y-axis data are objectively wrong.  On the other hand, normally the scatterplot would be there to convey information.  In this case it’s just an index to some beautiful animated triangular art


The point, and the relevance to this blog, is the way Matt Daniels has written software to make these pictures (relatively) easy to create.


Incidentally, before anyone starts complaining that sharks and fish are separate, that bit is exactly correct.  Fish (typical fish with bones, such as the swordfish in the animation) have a more recent common ancestor with sheep than with sharks.

August 23, 2015

Barcharts with delusions of grandeur

The cricket graphics system now allows 3-d barcharts projected over the playing field, and casting actual virtual shadows.


Yeah, nah.


  • I can’t resist mentioning a pointless number puzzle originating in New York. How does this sequence continue: 50, 42, 34, 23, 14,…
  • Big Data: from Jennifer Gardy, a Canadian genomic epidemiologist, on Twitter. This is 5000TB, ie, about 5 million gigabytes, of genetic sequence data CMtfy0YWIAE4PAQ
  • From Upshot at the New York Times again, How to Know Whether to Believe a Health Study. StatsChat readers will be familiar with most of this.
  • Another problem with errors in automatic classifications, for apps that purport to recognise bird songs
    If it tells you a nuthatch’s call is in fact a great tit (a bird which, by the way, is thought to have over 40 different types of call), you’ll take that mistake away with you and carry on misdiagnosing nuthatches.
    (On the other hand, I completely disagree with the argument that recognising bird songs should take work, dammit)
  • Nathan Yau, at Flowing Data, is recreating the Statistical Atlas of the United States, using modern data and the 1870-style graphics.
August 22, 2015

Changing who you count

The New York Times has a well-deserved reputation for data journalism, but anyone can have a bad day.  There’s a piece by Steven Johnson on the non-extinction of the music industry (which I think makes some good points), but which the Future of Music Coalition doesn’t like at all. And they also have some good points.

In particular, Johnson says

“According to the OES, in 1999 there were nearly 53,000 Americans who considered their primary occupation to be that of a musician, a music director or a composer; in 2014 more than 60,000 people were employed writing, singing, or playing music. That’s a rise of 15 percent.”


He’s right. This is a graph (not that you really need one)


The Future of Music Coalition give the numbers for each year, and they’re interesting. Here’s a graph of the totals:


There isn’t a simple increase; there’s a weird two-humped pattern. Why?

Well, if you look at the two categories, “Music Directors and Composers” and “Musicians and Singers”, making up the total, it’s quite revealing


The larger category, “Musicians and Singers”, has been declining.  The smaller category, “Music Directors and Composers” was going up slowly, then had a dramatic three-year, straight-line increase, then decreased a bit.

Going  into the Technical Notes for the estimates (eg, 2009), we see

May 2009 estimates are based on responses from six semiannual panels collected over a 3-year period

That means the three-year increase of 5000 jobs/year is probably a one-off increase of 15,000 jobs. Either the number of “Music Directors and Composers” more than doubled in 2009, or more likely there was a change in definitions or sampling approach.  The Future of Music Coalition point out that Bureau of Labor Statistics FAQs say this is a problem (though they’ve got the wrong link: it’s here, question F.1)

Challenges in using OES data as a time series include changes in the occupational, industrial, and geographical classification systems

In particular, the 2008 statistics estimate only 390 of these people as being employed in primary and secondary schools; the 2009 estimate is 6000, and the 2011 estimate is 16880. A lot of primary and secondary school teachers got reclassified into this group; it wasn’t a real increase.

When the school teachers are kept out of  “Music Directors and Composers”, to get better comparability across years, the change is from 53000 in 1999 to 47000 in 2014. That’s not a 15% increase; it’s an 11% decrease.

Official statistics agencies try not to change their definitions, precisely because of this problem, but they do have to keep up with a changing world. In the other direction, I wrote about a failure to change definitions that led the US Census Bureau to report four times as many pre-schoolers were cared for by fathers vs mothers.

August 20, 2015

The second-best way to prevent hangovers?

From Stuff: “Korean pears are the best way to prevent hangovers, say scientists.”

This is precisely not what scientists say; in fact, the scientist in question is even quoted (in the last line of the story) as not saying that.

Meanwhile, as a responsible scientist, she reminded that abstaining from excess alcohol consumption is the only certain way to avoid a hangover.

At least Stuff got ‘prevention’ in the headline. Many other sources, such as the Daily Mail, led with claims of a “hangover cure.”  The Mail also illustrated the story with a photo of the wrong species: the research was on the Asian species Pyrus pyrifolia,  rather than the European pear Pyrus communis. CSIRO hopes that European pears are effective, since that’s what Australia has vast quantities of, but they weren’t tested.

What Stuff doesn’t seem to have noticed is that this isn’t a new CSIRO discovery. The blog post certainly doesn’t go out of its way to make that obvious, but right at the bottom, after the cat picture, the puns, and the Q&A with the researcher, you can read

Manny also warns this is only a preliminary scoping study, with the results yet to be finalised. Ultimately, her team hope to deliver a comprehensive review of the scientific literature on pears, pear components and relevant health measures.

That is, the experimental study on Korean pears isn’t new research done at CSIRO. It’s research done in Korea, and published a couple of years ago. There’s nothing wrong with this, though it would have been nice to give credit, and it would have made the choice of Korean pears less mysterious.

The Korean researchers recruited a group of young Korean men, and gave alcohol (in the form of shoju), preceded by either Korean pear juice or placebo pear juice (pear-flavoured sweetened water).  Blood chemistry studies, as well as research in mice by the same group, suggest that the pear juice speeds up the metabolism of alcohol and acetaldehyde. This didn’t prevent hangovers, but it did seem to lead to a small reduction in hangover severity.

The study was really too small to be very convincing. Perhaps more importantly, the alcohol dose was nearly eleven standard drinks (540ml of 20% alcohol) over a short period of time, so you’d hope it was relevant to a fairly small group of people.  Even in Australia.


August 19, 2015

Stereotype and caricature

I’ve posted a few times about the maps, word clouds, and so on that show the most distinctive words by gender or state — sometimes they are even mislabelled as the “most common” words.  As I explained, these are often very rare words; it’s just that they are slightly less rare in one group than in the others.

An old post from the XKCD blog gives a really good example. Randall Munroe set up a survey to show people colours and ask for the colour name. He got five million responses, from over 200,000 sessions, and came up with nearly 1000 reasonably well-characterised colours.  You can download the complete data, if you care.

The survey asked participants about their chromosomal sex, because two of the colour receptor genes are on the X-chromosome and this is linked to colour blindness (and possibly to tetrachromatic vision). It turned out that the basic colour names were very similar between male and female respondents, though women were slightly more likely to use modifiers (“lime green” vs “green”).

However, Munroe also looked at the responses that differed most in frequency between men and women. These were all uncommon responses, but all from multiple people, and after extensive spam filtering.

You can probably guess which group is which:

  1. Dusty Teal
  2. Blush Pink
  3. Dusty Lavender
  4. Butter Yellow
  5. Dusky Rose


  1. Penis
  2. Gay
  3. WTF
  4. Dunno
  5. Baige

(Presumably this is a gender effect, not an X-linked language defect.)


August 17, 2015

How would you even study that?



“How would you even study that?” is an excellent question to ask when you see a surprising statistic in the media. Often the answer is “they didn’t,” but sometimes you get to find out about some really clever research technique.

More diversity pie-charts

These ones are from the Seattle Times, since that’s where I was last week.

IMAG0103, like many other tech companies, had been persuaded to release figures on gender and ethnicity for its employees. On the original figures, Amazon looked  different from the other companies, but Amazon is unusual in being a shipping-things-around company as well as a tech company. Recently, they released separate figures for the ‘labourers and helpers’ vs the technical and managerial staff.  The pie chart shows how the breakdown makes a difference.

In contrast to Kirsty Johnson’s pie charts last week, where subtlety would have been wasted  given the data and the point she was making, here I think it’s more useful to have the context of the other companies and something that’s better numerically than a pie chart.

This is what the original figures looked like:


Here’s the same thing with the breakdown of Amazon employees into two groups:


When you compare the tech-company half of Amazon to other large tech companies, it blends in smoothly.

As a final point, “diversity” is really the wrong word here. The racial/ethnic diversity of the tech companies is pretty close to that of the US labour force, if you measure in any of the standard ways used in ecology or data mining, such as entropy or Simpson’s index.   The issue isn’t diversity but equal opportunity; the campaigners, led by Jesse Jackson, are clear on this point, but the tech companies and often the media prefer to talk about diversity.


August 14, 2015


  • “As the polar ice caps melt and the earth churns through the Sixth Extinction, another unprecedented phenomenon is taking place, in the realm of sex,” says Vanity Fair.Yeah, nah” says New York magazine. If you only talk to top Tindr users, (especially in New York) you’re going to get strange ideas about sex.
  • “How Statistics guided me through life, death, and ‘The Price is Right'” by Elisa Long, in Washington Post. Dr Long writes about her breast cancer and her appearance on the famous US game show.
  • At Vox EU, an analysis of the environmental benefits or otherwise of electric cars. The cars don’t emit any pollution as they run, but the power has to come from somewhere. In about half of the US, enough of the electricity comes from coal to make electric cars worse than efficient petrol or diesel cars. In NZ my impression is that a predictable night-time load would largely come from hydro, so electric cars would be green. In Australia, probably not.
    holland fig1 7 aug
  • You probably saw the Herald story on speeding by NZTA staff. A nice example of using data (obtained under the Official Information Act) to show the extent of an issue