October 25, 2016


From the election on the other side of the Pacific.

Wikileaks also shows how John Podesta rigged the polls by oversampling democrats, a voter suppression technique.

Now, as Josh Marshall at Talking Points Memo goes on to point out, the email in question is not to or from John Podesta, is eight years old, and refers to the Democrats internal polls not to public polls. So it’s kind of uninteresting. Except to me.  I’m a professional sampling nerd. I do research on oversampling; I publish papers on it; I write software about it: ways to do it and ways to correct for it. And just like a sailing nerd who has heard Bermuda rigging described as a threat to democracy, I’m going to explain more than you ever needed to know about oversampling.

The most basic form of oversampling in medical research has been widely used for over sixty years. If you want to study whether, say, smoking causes lung cancer, it’s very inefficient to take a representative sample of the population because most people, fortunately, don’t have lung cancer. You need to sample maybe 1000 people to get two people with lung cancer. If you have access to hospital records you could find maybe 200 people with lung cancer and 800 healthy control people.   Your case-control sample would have about the same cost as a representative sample of 1000 people, but nearly 100 times more information.  And there are more complex versions of the same idea.

Your case-control sample isn’t representative, but you can still learn things from it.  At a simple level, if the lung cancer cases are more likely to smoke than the controls in the sample, that will also be true in the population. The relationship won’t be the same as in the population, but it will be in the same direction.  For more detailed analysis we can undo the oversampling. Suppose we want to estimate the proportion of smokers in the population. The proportion of smokers in the sample is going to be too high, because we’ve oversampled lung-cancer patients, who are more likely to smoke. To be precise, we’ve got one hundred times too many lung cancer patients in the sample. We can fix that by giving each of them one hundred times less weight in estimating the population total. If 180 of the 200 lung cancer patients smoked, and 100 of the 800 controls did, you’d have a weighted numerator of 180×(1/100)+100×1, and a weighted denominator of 200×(1/100)+800×1, for an unbiased estimate of  12.7%, compared to the unweighted, biased (180+200)/1000 = 38%.

In polling, your question might be what issues are important to swing voters. You’d try to oversample swing voters to ask them, and not waste time and money annoying people whose minds were made up.  Obviously that would make your sample un-representative of the whole population. That’s the point; you want to talk to swing voters, not to a representative sample.  Or you might want to compare the thinking of (generally pro-trump) evangelical Christians and (often anti-Trump) Mormons. Again, if you oversampled conservative religious groups you’d end up with an unrepresentative sample; again, that would be the point. Oversampling isn’t the best strategy when your primary purpose is finding out what a representative sample thinks; it often is the best strategy when you want to know more about some smaller group of people.

However, if you also wanted an estimate of the overall popular vote you could easily undo the oversampling and downweight the swing voters in your sample to get an unbiased estimate as we did with the smoking rates.  You have to do that anyway;  even if you try to get a representative sample it probably won’t work because some groups of people are less likely to answer their phones and agree to talk to you.  The weighting you use to fix up accidental over- and under- sampling is exactly the same as the weighting you use when it’s deliberate.


Mitre 10 Cup Predictions for the Mitre 10 Cup Finals

Team Ratings for the Mitre 10 Cup Finals

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Canterbury 14.53 12.85 1.70
Tasman 10.59 8.71 1.90
Taranaki 7.38 8.25 -0.90
Auckland 6.55 11.34 -4.80
Counties Manukau 5.89 2.45 3.40
Otago 0.44 0.54 -0.10
Waikato -0.37 -4.31 3.90
Wellington -1.72 4.32 -6.00
North Harbour -2.53 -8.15 5.60
Manawatu -3.94 -6.71 2.80
Bay of Plenty -4.25 -5.54 1.30
Hawke’s Bay -5.76 1.85 -7.60
Northland -13.35 -19.37 6.00
Southland -16.96 -9.71 -7.30


Performance So Far

So far there have been 74 matches played, 52 of which were correctly predicted, a success rate of 70.3%.
Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Otago vs. Bay of Plenty Oct 21 27 – 20 9.10 TRUE
2 Wellington vs. North Harbour Oct 22 37 – 40 6.50 FALSE
3 Canterbury vs. Counties Manukau Oct 23 22 – 7 12.10 TRUE
4 Taranaki vs. Tasman Oct 23 29 – 41 3.60 FALSE


Predictions for the Mitre 10 Cup Finals

Here are the predictions for the Mitre 10 Cup Finals. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Otago vs. North Harbour Oct 28 Otago 7.00
2 Canterbury vs. Tasman Oct 29 Canterbury 7.90


October 24, 2016

Why so negative?

My StatsChat posts, and especially the ‘Briefly’ links, tend to be pretty negative about big data and algorithmic decision-making. I’m a statistician, and I work with large-scale personal genomic data, so you’d expect me to be more positive. This post is about why.

The phrase “devil’s advocate” has come to mean a guy on the internet arguing insincerely, or pretending to argue insincerely, just for the sake of being a dick. That’s not what it once meant. In the early eighteenth century, Pope Clement XI created the position of “Promoter of the Faith” to provide a skeptical examination of cases for sainthood. By the time a case for sainthood got to the Vatican, there would be a lot of support behind it, and one wouldn’t have to be too cynical to suspect there had been a bit of polishing of the evidence. The idea was to have someone whose actual job it was to ask the awkward questions — “devil’s advocate” was the nickname.  Most non-Catholics and many Catholics would argue that the position obviously didn’t achieve what it aimed to do, but the idea was important.

In the research world, statisticians are often regarded this way. We’re seen as killjoys: people who look at your study and find ways to undermine your conclusions. And we do. In principle you could imagine statisticians looking at a study and explaining why the results were much stronger than the investigators thought, but since people are really good at finding favourable interpretations without help, that doesn’t happen so much.

Machine learning includes some spectacular achievements, and has huge potential for improving our lives. It also has a lot of built-in support both because it scales well to making a few people very rich, and because it fits in with the human desire to know things about the world and about other people.

It’s important to consider the risks and harms of algorithmic decision making as well as the very real benefits. And it’s important that this isn’t left to people who can be dismissed as not understanding the technical issues.  That’s why Cathy O’Neil’s book Weapons of Math Destruction is important, and on a much smaller scale it’s why you’ll keep seeing stories about privacy or algorithmic prejudice here on StatsChat. As Section 162 (4) (a) (v) of the Education Act indicates, it’s my actual job.



  • I would never have guessed this was a problem, but “Data from three national surveys indicated that people are unaware that age is a risk factor for cancer. Moreover, those who were least aware perceived the highest risk of cancer regardless of age.” (free abstract but paywalled paper, via @RolfDegen)
  • Useful graph of uncertainty in vote margin and winner from Nate Silver on Twitter.
  • There’s a computer-personalised education system supported by Facebook that seems to be getting good results. On the other hand, the evidence for the effectiveness isn’t very good quality, and the handling of data privacy is weak. There’s going to be a lot of this sort of issue coming up in the data-based policy world. (Washington Post)

Not the Nobel Prize for Statistics

Q: There isn’t a Nobel Prize for Statistics, is there?

A: No. We already talked about that.

Q: But there is a new big prize?

A: Yes, a group of five statistics organisations collaborated to create the “International Prize in Statistics

Q: And did someone win it?

A: Yes. To the vast surprise of no-one, it was won by Sir David Cox. (PDF)

Q: So what did he do?

A: He invented the Cox model. (And the other Cox model, but it was the Cox model he got the prize for.)

Q: And what is the Cox model?

A: It’s a regression model for censored time-to-event data. That is, you’re interested in modelling the time until something happens (death, unemployment,graduation) and you don’t get to observe the actual time for some people — they were still alive, employed, or studying when you stopped collecting data.

Q: That sounds useful. But why hadn’t someone already done it?

A: It was 1972.

Q: Oh.

A: And they had, it’s just Cox’s model was better in some ways. In particular, it didn’t make assumptions about the rate of events over time, just about how different groups of people compared.

Q: Um..

A: Consider smokers and non-smokers. The model might say smokers get cancer at ten times the rate of non-smokers, but not have to assume anything about how those rates change with age.  Earlier models would have assumed the rates were constant over time, or that they had simple mathematical forms.

Q: And they don’t?

A: Exactly.

Q: Ok, that sounds like a step forward. The model was popular, I suppose.

A: Yes, the paper presenting it has over 30,000 citations. It has more citations with a typo in the page number than my most-cited first-author statistics paper has in total.

Q: That many people have read it?

A: I didn’t say they’d read it. Nowadays, they mostly haven’t; they have read other papers or textbooks that mention it.

Q: So why hasn’t someone come up with a better model since 1972?

A: They have, but the Cox model is good enough to stay popular. And it was helped to popularity by being computationally well-behaved and mathematically interesting.

Q: Mathematically interesting?

A: The model is “semiparametric”: it has both rigid constraints (the ratios of rates are constant over time) and completely flexible parts (the pattern of events over time).  The estimator that Cox proposed is very simple, and in particular doesn’t involve estimating the flexible part of the model. It’s very unusual for that to work well, so mathematical statisticians wanted to study it and work out how to duplicate its success.

Q: And did they?

A: Not really. They understand how it works, but it’s not something you can make work in general. Cox was lucky and/or brilliant.

Q: Did Cox do anything else important?

A: Lots. He wrote or co-wrote 17 books on different areas of statistics, several of which became classics. He’s written a few hundred other research papers. He’s had 63 PhD students (he was my advisors’ advisor’s advisor’s advisor). And ..

Q: Ok, enough already. Where did he study statistics?

A: He didn’t really. He got a degree in maths (in two years, because there was a war on), then went to work for the Wool Industry Research Association before doing a PhD. Later, he visited the US and thought he’d have to move there because there weren’t jobs in Britain (though that didn’t happen)

Q: Well, that part of his experience is still easy to duplicate in many countries.

A: Sadly, yes.


[updated because I have problems with reading comprehension]

Stat of the Week Competition: October 22 – 28 2016

Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.

Here’s how it works:

  • Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday October 28 2016.
  • Statistics can be bad, exemplary or fascinating.
  • The statistic must be in the NZ media during the period of October 22 – 28 2016 inclusive.
  • Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.

Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.


October 23, 2016

Psychic meerkats and Halloween masks

Prediction is hard — especially,  as the Danish proverb says, when it comes to the future. In the Rugby World Cup we had psychic meerkats. For the US elections the new bogus prediction trend is Halloween masks: allegedly, more masks are sold with the face of the candidate who goes on to win.

The first question with a claim like this one, especially given some of the people making it, is whether the historical claim is true.  In this case it’s true-ish.  The claim was made before the 2012 election, and while the data aren’t comprehensive, they are from the same big chain of stores each year. From 1980 to 2012, the mask rule has predicted the eventual winner of the presidency.  That’s actually an argument against it.

If there’s more to the mask sales than there is to psychic meerkats, it would have to be as a prediction of the popular vote — you’d need data from individual states to predict the weird US Electoral College. But if the mask rule got the 2000 election right, it must have got the popular vote wrong that year — George W. Bush won the electoral college, but lost the popular vote to Al Gore. From that point of view, we’re looking at 8 out of 9.

More importantly, 9 out of 9 isn’t all that impressive. Suppose you got your predictions by flipping a coin.  Your chance of getting either all heads for the Republican wins or all heads for the Democratic wins is 1 in 256, increasing to 1 in 128 if you’re allowed to choose which way to treat the 2000 election.  The chance of getting 8 of 9 agreement is much better: about 1 in 13.  If only one in a million people in the US had tried coming up with just one prediction rule each, you’d expect someone to get it perfect and dozens to get it nearly right.

Given these odds, it wouldn’t be surprising if, say, a US professional sports team had results agreeing with the Presidential results — and in fact, there was a rule based on the results for the Washington Redskins football team that worked from 1940 to 2000, was fudged to work in 2004, and then failed completely in 2012.    That’s 17/19 correct, but since the rule was first publicised in the run-up to the 2000 election, it’s 2/4 correct in actual use.

If you’re allowed to combine multiple variables it gets even easier to find rules. With anything from basic linear regression to a neural network you’d expect to get perfect prediction from five unrelated variables. Even restricting the models to be simple doesn’t help much.  I downloaded some OECD data on national GDP for various countries, and found that since 1980 the Republicans have won the popular vote precisely in years when the GDP of Sweden increased more than the GDP of Norway.

My advice is to stick with the psychic meerkats for entertainment and the opinion poll aggregators or the betting markets for prediction.

October 22, 2016

Stat of the Week fixed

Because of changes at WordPress, the Stat of the Week competition has been eating the URLs you submitted.




We’ve fixed it now.

Cheese addiction hoax again

Three more sites have fallen for the cheese addiction hoax

As you may remember, this story is very very loosely based on real research from the University of Michigan. However, the hoax version misrepresents which foods were most addictive and makes up an explanation based on the milk protein casein that isn’t mentioned in the real research at all.

The reason I’m calling this a hoax is that it wasn’t the fault of the researchers, their institution, or the journal, and it’s obvious to anyone who makes any attempt to scan the research paper that it doesn’t support the story. It isn’t an innocent mistake, and it isn’t a simple exaggeration like most misleading health science stories.

There’s a good post at Science News describing what was actually found.

October 20, 2016

Brute force and ignorance

At a conference earlier this week, a research team from Microsoft described a computer system for speech transcription. For the first time ever, this system did better than humans on a standard set of recordings.

What’s more impressive — and StatsChat relevant — is that this computer system does not understand anything about the conversations it writes down. The system does not know English, or any other human language, even in the sense that Siri does.

It has some preconceived notions about what tends to follow a particular word, pair of words, or triple of words, and about what sequences of sounds tend to follow each other, but nothing about nouns or verbs or how colorless green ideas sleep. As with modern image recognition, the system is just based on heaps and heaps of data and powerful computers.  It’s computing and statistics, not linguistics.

In a comment to a post at Language Log, the linguist Geoffrey Pullum says

I must confess that I never thought I would see this day. In the 1980s, I judged fully automated recognition of connected speech (listening to connected conversational speech and writing down accurately what was said) to be too difficult for machines, far more difficult than syntactic and semantic processing (taking an error-free written sentence as input, recognizing which sentence it was, analysing it into its structural parts, and using them to figure out its literal meaning). I thought the former would never be accomplished without reliance on the latter.

There are many problems where enough data is not available to construct a model with no understanding of the problem. There won’t be a shortage of work for human statisticians or linguists any time soon. But there are problems where brute force and ignorance works, and they aren’t always the ones we expect.