November 25, 2015

What do statisticians do all day?

The New Zealand Statistical Association is having its annual meeting at the moment in Christchurch. It’s hard for a lot of people to imagine how there could be new research in statistics, so here are some examples from the awards.

Maxine Pfannkuch won the Association’s lifetime achievement award, for her work on statistics education. She studies how people (mostly schoolkids) draw informal statistical conclusions from data and from graphics, and looks for ways to teach them to do it better. A lot of the improvements in the high-school stats curriculum are her fault.

Mark Holmes won the research award. His research is harder to explain in simple terms, but he studies random processes that accumulate over time — like the shape of the trail left by a randomly-moving point.

Blair Robertson won the junior research award. He used to be an applied mathematician, working on optimisation — finding the best value of a complicated function. He now uses similar techniques to come up with improved ways to choose sets of locations in space and time for environmental sampling.

Maarten Kruijver won a `young statistician’ talk award. He works in forensic statistics, looking at ways to estimate the chance that a DNA sample from a crime scene will coincidentally look as if it is from a close relative of someone in the police database.

Anjali Gupta won the other `young statistician’ talk award. She is studying a laser-based technique for measuring chemical composition of things, with forensics being one application.  She was studying the variation in measurements for the same object over time, to understand more about the accuracy of the technique.

Why we can’t trust crime analyses in New Zealand

Jarrod Gilbert has spent a lot of time hanging out with people in biker gangs.

That’s how he wrote his book, Patched, a history of gangs in New Zealand.  According to the Herald, it’s also the police’s rationale for not letting him have access to crime data. I don’t know whether it would be more charitable to the Police to accept that this is their real reason or not.

Conceivably, you might be concerned about access to these data for people with certain sorts of criminal connections. There might be ways to misuse the data, perhaps for some sort of scam on crime victims. No-one suggests that is  the sort of association with criminals that Dr Gilbert has.

It gets worse. According to Dr Gilbert, also writing in the Herald, the standard data access agreement for the data says police “retain the sole right to veto any findings from release.” Even drug companies don’t get away with those sorts of clauses nowadays.

To the extent these reports are true, we can’t entirely trust any analysis of New Zealand crime data that goes beyond what’s publicly available. There might be a lot of research that hasn’t been affected by censorship and threats to block future work, but we have no way of picking it out.

November 24, 2015

Book recommendations

It’s the time of year when people are asking “What can I buy for my favourite nerd?”. Here are some books s/he might like, a mixture of older and new.

  • Thing Explainer by Randall Munroe (of XKCD fame). A coffee-table book of annotated drawings, along the lines of his 2012 Up Goer Five. I reviewed this for the Listener.  It’s really good.
  • Eureka: Discovering Your Inner Scientist by Chad Orzel.  I’ve mentioned this book before on StatsChat. It’s a great look at how science works. In the process, it attacks a lot of the myths about scientists.
  • Thinking, Fast and Slow by Daniel Kahneman. With Amos Tversky, he pioneered the study of why people are bad at probability and risk assessments. He won a Nobel-like Prize for Economics shortly after the book came out. Unlike many books of its kind, it doesn’t need the subtitle “Why I am Right About Everything”.
  • The Immortal Life of Henrietta Lacks, by Rebecca Skloot. The HeLa cell line is a mainstay of laboratory research, but until fairly recently even most scientists didn’t know where it came from. Skloot’s book tells the story of an African-American woman treated for cervical cancer at Johns Hopkins, and how her cells lived on without her or her family’s knowledge.
  • How Not To Be Wrong by Jordan Ellenberg. Ellenberg is a highly-respected pure mathematician, but his book is about statistical thinking in everyday life.
  • The Canon, by Natalie Angier. A survey of the most important things we’ve learned about the universe. Includes a chapter on probability and statistics featuring the wonderful Deb Nolan.
  • The Secret Life of Money, by Daniel Davies and Tess Read. One of the many books trying to explain the world in terms of microeconomics or vice versa. Less everything-you-know-is-wrong and more entertaining writing than Freakonomics.  I’m not sure if reading Davies’s account of a visit to New Zealand will make you more or less likely to want the book.
  • The Philadephia Chromosome, by Jessica Wapner. This tells the story of the selective tyrosine kinase inhibitor imitanib (Gleevec),  and its (largely unfulfilled) promise of cancer treatment targeting the cause of disease without toxic side-effects.
  • The Disappearing Spoon, by Sam Kean. The eponymous spoon is made of gallium, which melts at about 30C; the book is an entertaining and informative survey of the periodic table.
November 23, 2015

Stat of the Week Competition: November 21 – 27 2015

Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.

Here’s how it works:

  • Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday November 27 2015.
  • Statistics can be bad, exemplary or fascinating.
  • The statistic must be in the NZ media during the period of November 21 – 27 2015 inclusive.
  • Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.

Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.


November 22, 2015

Helpful context

From the Herald

A study by sleep experts at Sealy UK found that those who kip on the right-hand side of the mattress are far more pessimistic than those who doze on the left.


Neil Robinson, Sealy’s top snooze analyst, said: “The research certainly highlights an interesting trend ” could it be possible that the left side of bed is the ‘right’ side?



  • There’s a survey put out by the World Bank with what it calls basic financial literacy questions. Lots of people didn’t give the intended answers.  As Felix Salmon explains, that’s because they were silly questions:

Nothing useful can be learned by going up to poor workers in, say, Afghanistan (to take the very first country on the list), and asking them this question. They don’t have banks, and if they do have banks they don’t have savings accounts, and if they do have savings accounts they don’t hold on to them for five years, and if they do hold on to them for five years they’ll probably end up with nothing at all…

  • From Andrew Gelman’s blog, a couple of posts on accidental and deliberate wrong answers in surveys.
  • Graeme Edgeler explains again why you don’t need deliberate wrong answers in the flag referendum
  • Some things shouldn’t be maps. One example is homes of 187 victims of child homicide over the past 23 years, mapped with the 2013 deprivation index in Stuff.  On top of the inappropriateness of the map, and the time misalignment, there does actually exist serious research on risk factors for child abuse, both here and abroad: it’s not a matter of Stuff ‘discovering’ things.
  • David Spiegelhalter on an example of misreporting of criticism of misreporting of stats
  • US artist Chad Hagen has a lovely set of prints titled “Nonsensical Infographics“, with the form of data visualisation but no content3626_largeview_cbca1479-548b-467f-9997-e25d0ff76662_large


November 20, 2015

Headline inflation

The breakthrough of the decade doesn’t happen most years, and the breakthrough of the year doesn’t happen most weeks, but you still need to put out a health news section.  If you do it by hyping whatever turns up, your headlines end up not having a lot of information value.

So, today, “Blood test for ovarian cancer ‘100% accurate‘” in the Herald is grade inflation.  The researchers at Georgia Tech have some impressive findings, but their test still hasn’t been evaluated on anyone other than the 95 women whose cancer status was known in advance and whose blood was used to develop the test. As the research paper says

…because the disease is in low prevalence in the general population (~0.1% in USA), a screening test must attain a positive predictive value (PPV) of >10%, with a specificity ≥99.6% and a sensitivity ≥75% to be of clinical relevance in the general population

That is, they want the test to give no more than 4 false positives per 1000 healthy women. So far, they’ve only looked at 49 healthy women.

The story is better than the headline on how significant this is, with an independent expert.

Dr Simon Newman, of Target Ovarian Cancer, said: “It is exciting preliminary research. It’s crucial to diagnose ovarian cancer promptly, as up to 90 per cent of women would live for five or more years if diagnosed at the earliest stage.

“However, this highly promising discovery needs significant further development and validation in large clinical trials before we know if it is suitable for screening the general population and works as well as predicted.

Even that’s exaggerated. We just don’t know what the survival would be with early diagnosis. At the moment, you have to be very fortunate to have your ovarian cancer detected at the earliest stage, and these tumours might be very non-representative.  We’ve seen real but smaller-than-expected benefits from screening in other cancers.

There are worse problems with the story than a bit of exaggeration, though. It gets the scientific idea completely wrong, saying:

But when Georgia Institute of Technology researchers looked at the blood of 46 women in the early stages of the disease and that of 49 healthy women, the cancerous samples contained different levels of 16 proteins compared with the healthy ones.

The innovative step in this research was to not use proteins. As the press release says

“People have been looking at proteins for diagnosis of ovarian cancer for a couple of decades, and the results have not been very impressive,”

Instead, the researchers looked at ‘metabolites’, smaller molecules produced by cell processes. Their hypothesis was that tumours might have varying genetic changes and varying proteins, but if they ended up as cancer they would have some cellular processes in common.


November 19, 2015

False positives

I searched for “Joe Hill” on Google a few months ago, and the “aren’t we clever” box popped up with:


The statistics and  computation behind these searches is impressive: in addition to all the usual Google stuff, the system realises that the – fairly common – words “joe” and “hill” occur together sufficiently often that they are probably a thing. Then it takes advantage of Wikipedia to realise that “joe hill” is the name of a person, not a geographical feature or a coffee shop (or, I suppose, profanity), and finds pictures and information. And it almost works — even with people who aren’t especially well known.

The gentleman on the left really is Joe Hill (author), aka Joseph Hillstrom King. One of his books has been made into a movie starring Daniel Radcliffe, so he’s definitely successful but not in any sense a mainstream celebrity. The gentleman on the right is someone else. People with an interest in labour history or folk music will recognise Joe Hill (activist), aka Joseph Hillstrom ,aka Joel Emmanuel Hägglund: I dreamed I saw Joe Hill last night, alive as you and me”. It’s an understandable mistake for the Google: the modern Joe Hill was named after the historical one, and there will be a lot of cross-referencing of the two. And it doesn’t really matter.

Joe Hill (activist) was involved in a rather more important false positive. The song says “they framed you on a murder charge”, and it’s only exaggerating a bit. There was strong circumstantial evidence and Hill refused to give any explanations, but it also appears the eyewitness testimony was manufactured. He was executed 100 years ago today.

November 18, 2015

Old-time graphics advice

  1. We must keep symbols to a minimum, so as not to overload the reader’s memory. Some ancient authors, by covering their cartograms with hieroglyphics, made them indecipherable.”
  2. “One of us recommends adopting scales for ordinate and abscissa so the average slope of the phenomenon corresponds to the tangent of the curve at an angle of 45◦”.
  3. “Areas are often used in graphic representations. However, they have the disadvantage of often misleading the reader even though they were designed according to indisputable geometric principles. Indeed, the eye has a hard time appreciating areas.”
  4. “We should not, as it is sometimes done, cut the bottom of the diagram under the pretext that it is useless. This arbitrary suppression distorts the chart by making us think that the variations of the function are more important than they really are.”
  5.  “In order to increase the means of expression without straining the reader’s memory, we often build cartograms with two colors. And, indeed, the reader can easily remember this simple formula: ‘The more the shade is red, the more the phenomenon studied surpasses average; the more the shade is blue, the more phenomenon studied is below average.’ ”

These are from a failed attempt to get the International Institute of Statistics to set up some standards for statistical graphics. In 1901.

(from Hadley Wickham)

November 16, 2015

Measuring gender

So, since we’re having a Transgender Week of Awareness at the moment, it seems like a good time to look at how statisticians ask people about gender, and why it’s harder than it looks.

By ‘harder than it looks’ I don’t just mean that it isn’t a binary question; we’re past that stage, I hope.  Also, this isn’t about biological sex — in genetics I do sometimes care how many X chromosomes someone has, but most questionnaires don’t need to know. It’s harder than it looks because there isn’t just one question.

The basic Male/Female binary question can be extended in (at least) two directions.  The first is to add categories to represent other ways people identify their gender beyond just male/female, which can be fluid over time, or can have more than two categories. Here a write-in option is useful since you almost certainly don’t know all the distinctions people care about across different cultures. In a specialised questionnaire you might even want to separate out questions about fluid/constant identity from non-binary/diversity, but for routine use that might be more than you need.

A second direction is to ask about transgender status, which is relevant for discrimination and (or thus) for some physical and mental health risks.  (Here you might want also want to find out about people who, say, identify as female but present as male.) We have very little idea how many people are transgender — it makes data on sexual orientation look really precise — and that’s a problem for service provision and in many other areas.

Life would get simpler for survey collectors if you combined these into a single question, or if you had a Male/Female/It’s Complicated question with follow-up questions for the third group. On the other hand, it’s pretty clear why trans people don’t like that approach. These really are different questions. For people whose answer to the first question is something like “it depends” or a culturally specific third option, the combination may not be too bad. The problem comes when answer to the second type of question might be “Trans (and yes I sometimes get comments behind my back at work but most people are fine)”, but the answer to the first “Female (and just as female as people with ovaries and a birth certificate, ok)”.

Earlier this year Stats New Zealand ran a discussion and  had a go at a better gender question, and it is definitely better than the old one, especially when it allows for multiple answers and for a write-in answer. They also have a ‘synonym list’ to help people work with free-text answers, although that’s going to be limited if all it does is map back to binary or three-way groups. What they didn’t do was to ask for different types of information separately. [edit: ie, they won’t let you unambiguously say ‘female’ in an identity question then ‘trans’ in a different question]

It’s true that for a lot of purposes you don’t need all this information. But then, for a lot of purposes you don’t actually need to know anything about gender.

(via Writehanded and Jennifer Katherine Shields)