Posts filed under Random variation (132)

September 24, 2017

The polls

So, how did the polls do this time? First, the main result was predicted correctly: either side needs a coalition with NZ First.

In more detail, here are the results from Peter Ellis’s forecasts from the page that lets you pick coalitions.

Each graph has three arrows. The red arrow shows the 2014 results. The blue/black arrow pointing down shows the current provisional count and the implied number of seats, and the horizontal arrow points to Graeme Edgeler’s estimate of what the special votes will do (not because he claims any higher knowledge, but because his estimates are on a web page and explain how he did it).

First, for National+ACT+UnitedFuture


Second, for Labour+Greens


The result is well within  the uncertainty range of the predictions for Labour+Greens, and not bad for  National. This isn’t just because NZ politics is easy to predict: the previous election’s results are much further away. In particular, Labour really did gain a lot more votes than could reasonably have been expected a few months ago.


Update: Yes, there’s a lot of uncertainty. And, yes, that does  mean quoting opinion poll results to the nearest 0.1% is silly.

July 30, 2017

What are election polls trying to estimate? And is Stuff different?

Stuff has a new election ‘poll of polls’.

The Stuff poll of polls is an average of the most recent of each of the public political polls in New Zealand. Currently, there are only three: Roy Morgan, Colmar Brunton and Reid Research. 

When these companies release a new poll it replaces their previous one in the average.

The Stuff poll of polls differs from others by giving weight to each poll based on how recent it is.

All polls less than 36 days old get equal weight. Any poll 36-70 days old carries a weight of 0.67, 70-105 days old a weight 0.33 and polls greater than 105 days old carry no weight in the average.

In thinking about whether this is a good idea, we’d need to first think about what the poll is trying to estimate and about the reasons it doesn’t get that target quantity exactly right.

Officially, polls are trying to estimate what would happen “if an election were held tomorrow”, and there’s no interest in prediction for dates further forward in time than that. If that were strictly true, no-one would care about polls, since the results would refer only to the past two weeks when the surveys were done.

A poll taken over a two-week period is potentially relevant because there’s an underlying truth that, most of the time, changes more slowly than this.  It will occasionally change faster — eg, Donald Trump’s support in the US polls seems to have increased after James Comey’s claims about Clinton’s emails in the US, and Labour’s support in the UK polls increased after the election was called — but it will mostly change slower. In my view, that’s the thing people are trying to estimate, and they’re trying to estimate it because it has some medium-term predictive value.

In addition to changes in the underlying truth, there is the idealised sampling variability that pollsters quote as the ‘margin of error’. There’s also larger sampling variability that comes because polling isn’t mathematically perfect. And there are ‘house effects’, where polls from different companies have consistent differences in the medium to long term, and none of them perfectly match voting intentions as expressed at actual elections.

Most of the time, in New Zealand — when we’re not about to have an election — the only recent poll is a Roy Morgan poll, because  Roy Morgan polls more much often than anyone else.  That means the Stuff poll of polls will be dominated by the most recent Roy Morgan poll.  This would be a good idea if you thought that changes in underlying voting intention were large compared to sampling variability and house effects. If you thought sampling variability was larger, you’d want multiple polls from a single company (perhaps downweighted by time).  If you thought house effects were non-negligible, you wouldn’t want to downweight other companies’ older polls as aggressively.

Near an election, there are lots more polls, so the most recent poll from each company is likely to be recent enough to get reasonably high weight. The Stuff poll is then distinctive in that it complete drops all but the most recent poll from each company.

Recency weighting, however, isn’t at all unique to the Stuff poll of polls. For example, the poll of polls downweights older polls, but doesn’t drop the weight to zero once another poll comes out. Peter Ellis’s two summaries both downweight older polls in a more complicated and less arbitrary way; the same was true of Peter Green’s poll aggregation when he was doing it.  Curia’s average downweights even more aggressively than Stuff’s, but does not otherwise discard older polls by the same company. RadioNZ averages the only the four most recent available results (regardless of company) — they don’t do any other weighting for recency, but that’s plenty.

However, another thing recent elections have shown us is that uncertainty estimates are important: that’s what Nate Silver and almost no-one else got right in the US. The big limitation of simple, transparent poll of poll aggregators is that they say nothing useful about uncertainty.

April 14, 2017

Cyclone uncertainty

Cyclone Cook ended up a bit east of where it was expected, and so Auckland had very little damage.  That’s obviously a good thing for Auckland, but it would be even better if we’d had no actual cyclone and no forecast cyclone.  Whether the precautions Auckland took were necessary (at the time) or a waste  depends on how much uncertainty there was at the time, which is something we didn’t get a good idea of.

In the southeastern USA, where they get a lot of tropical storms, there’s more need for forecasters to communicate uncertainty and also more opportunity for the public to get to understand what the forecasters mean.  There’s scientific research into getting better forecasts, but also into explaining them better. Here’s a good article at Scientific American

Here’s an example (research page):


On the left is the ‘cone’ graphic currently used by the National Hurricane Center. The idea is that the current forecast puts the eye of the hurricane on the black line, but it could reasonably be anywhere in the cone. It’s like the little blue GPS uncertainty circles for maps on your phone — except that it also could give the impression of the storm growing in size.  On the right is a new proposal, where the blue lines show a random sample of possible hurricane tracks taking the uncertainty into account — but not giving any idea of the area of damage around each track.

There’s also uncertainty in the predicted rainfall.  NIWA gave us maps of the current best-guess predictions, but no idea of uncertainty.  The US National Weather Service has a new experimental idea: instead of giving maps of the best-guess amount, give maps of the lower and upper estimates, titled: “Expect at least this much” and “Potential for this much”.

In New Zealand, uncertainty in rainfall amount would be a good place to start, since it’s relevant a lot more often than cyclone tracks.

Update: I’m told that the Met Service do produce cyclone track forecasts with uncertainty, so we need to get better at using them.  It’s still likely more useful to experiment with rainfall uncertainty displays, since we get heavy rain a lot more often than cyclones. 

April 3, 2017

The recently ex-kids are ok

The New York Times had a story last week with the headline “Do Millennial Men Want Stay-at-Home Wives?”, and this depressing graphnyt

But, the graph doesn’t have any uncertainty indications, and while the General Social Survey is well-designed, that’s a pretty small age group (and also, an idiosyncratic definition of ‘millennial’)

So, I looked up the data and drew a graph with confidence intervals (full code here)


See the last point? The 2016 data have recently been released. Adding a year of data and uncertainty indications makes it clear there’s less support for the conclusion that it looked.

Other people did similar things: Emily Beam has a long post  including some context

The Pepin and Cotter piece, in fact, presents two additional figures in direct contrast with the garbage millennial theory – in Monitoring the Future, millennial men’s support for women in the public sphere has plateaued, not fallen; and attitudes about women working have continued to improve, not worsen. Their conclusion is, therefore, that they find some evidence of a move away from gender equality – a nuance that’s since been lost in the discussion of their work.

and Kieran Healy tweeted


As a rule if you see survey data (especially on a small subset of the population) without any uncertainty displayed, be suspicious.

Also, it’s impressive how easy these sorts of analysis are with modern technology. They used to require serious computing, expensive software, and potentially some work to access the data.  I did mine in an airport: commodity laptop, free WiFi, free software, user-friendly open-data archive.   One reason that basic statistics training has become much more useful in the past few decades is that so many of the other barriers to DIY analysis have been removed.

November 2, 2016

Lotto demographics

The headlines at both the Herald and Stuff say they’re about Lotto winners, but the vastly more numerous losers have to have basically the same demographics. That means any statistics drawn from a group of 12 winners are going to be very unreliable.

There some more reliable sources.  There’s (limited) information released by NZ Lotteries under the Official Information Act.  There’s also more detailed survey data from the 2012 Health and Lifestyles Survey (PDF)

Of the 12 people in today’s stories, 11 were men, even though men and women play Lotto at about the same rate. There’s a lot less variation by household income than I would have guessed. There is some variation by ethnicity, with Asians being less likely to play Lotto. People under 25 are a bit less likely to play. It’s all pretty boring.

I’ve complained a few times that clicky bogus polls have an error rate as bad as a random sample of about ten people, and are useless.  Here we have a random sample of about ten people, and it’s pretty useless.

Except as advertising.


October 18, 2016

The lack of change is the real story

The Chief Coroner has released provisional suicide statistics for the year to June 2016.  As I wrote last year, the rate of suicide in New Zealand is basically not changing.  The Herald’s story, by Martin Johnston, quotes the Chief Coroner on this point

“Judge Marshall interpreted the suicide death rate as having remained consistent and said it showed New Zealand still had a long way to go in turning around the unacceptably high toll of suicide.”

The headline and graphs don’t make this clear

Here’s the graph from the Herald


If you want a bar graph, it should go down to zero, and it would then show how little is changing


I’d prefer a line graph showing expected variation if there wasn’t any underlying change: the shading is one and two standard deviations around the average of the nine years’ rates


As Judge Marshall says, the suicide death rate has remained consistent. That’s our problem.  Focusing on the year to year variation misses the key point.

June 22, 2016

Making hospital data accessible

From the Guardian

The NHS is increasingly publishing statistics about the surgery it undertakes, following on from a movement kickstarted by the Bristol Inquiry in the late 1990s into deaths of children after heart surgery. Ever more health data is being collected, and more transparent and open sharing of hospital summary data and outcomes has the power to transform the quality of NHS services further, even beyond the great improvements that have already been made.

The problem is that most people don’t have the expertise to analyse the hospital outcome data, and that there are some easy mistakes to make (just as with school outcome data).

A group of statisticians and psychologists developed a website that tries to help, for the data on childhood heart surgery.  Comparisons between hospitals in survival rate are very tempting (and newsworthy) here, but misleading: there are many reasons children might need heart surgery, and the risk is not the same for all of them.

There are two, equally important, components to the new site. Underneath, invisible to the user, is a statistical model that predicts the surgery result for an average hospital, and the uncertainty around the prediction. On top is the display and explanation, helping the user to understand what the data are saying: is the survival rate at this hospital higher (or lower) than would be expected based on how difficult their operations are?

May 29, 2016

I’ma let you finish

Adam Feldman runs the blog Empirical SCOTUS, with analyses of data on the Supreme Court of the United States. He has a recent post (via Mother Jones) showing how often each judge was interrupted by other judges last year:


For those of you who don’t follow this in detail, Elena Kagan and Sonia Sotomayor are women.

Looking at the other end of the graph, though, shows something that hasn’t been taken into account. Clarence Thomas wasn’t interrupted at all. That’s not primarily because he’s a man; it’s primarily because he almost never says anything.

Interpreting the interruptions really needs some denominator. Fortunately, we have denominators. Adam Feldman wrote another post about them.

Here’s the number interruptions per 1000 words, with the judges sorted in order of  how much they speak


And here’s the same thing with interruption per 100 ‘utterances’


It’s still pretty clear that the female judges are interrupted more often (yes, this is statistically significant (though not very)). Taking the amount of speech into account makes the differences smaller, but, interestingly, also shows that Ruth Bader Ginsburg is interrupted relatively often.

Denominators do matter.

April 27, 2016

Not just an illusion

There’s a headline in the IndependentIf you think more celebrities are dying young this year, you’re wrong – it’s just a trick of the mind“. And, in a sense, Ben Chu is right. In a much more important sense, he’s wrong.

He argues that there are more celebrities at risk now, which there are. He says a lot of these celebrities are older than we realise, which they are. He says that the number of celebrity deaths this year is within the scope of random variation looking at recent times, which may well be the case. But I don’t think that’s the question.

Usually, I’m taking the other side of this point. When there’s an especially good or especially bad weekend for road crashes, I say that it’s likely just random variation, and not evidence for speeding tolerances or unsafe tourists or breath alcohol levels. That’s because usually the question is whether the underlying process is changing: are the roads getting safer or more dangerous.

This time there isn’t really a serious question of whether karma, global warming, or spiders from Mars are killing off celebrities.  We know it must be a combination of understandable trends and bad luck that’s responsible.  But there really have been more celebrities dying this year.   Prince is really dead. Bowie is really dead. Victoria Wood, Patty Duke, Ronnie Corbett, Alan Rickman, Harper Lee — 2016 has actually happened this way,  it hasn’t been (to steal a line from Daniel Davies) just a particularly inaccurate observation of the underlying population and mortality patterns.

April 17, 2016

Evil within?

The headlineSex and violence ‘normal’ for boys who kill women in video games: study. That’s a pretty strong statement, and the claim quotes imply we’re going to find out who made it. We don’t.

The (much-weaker) take-home message:

The researchers’ conclusion: Sexist games may shrink boys’ empathy for female victims.

The detail:

The researchers then showed each student a photo of a bruised girl who, they said, had been beaten by a boy. They asked: On a scale of one to seven, how much sympathy do you have for her?

The male students who had just played Grand Theft Auto – and also related to the protagonist – felt least bad for her. with an empathy mean score of 3. Those who had played the other games, however, exhibited more compassion. And female students who played the same rounds of Grand Theft Auto had a mean empathy score of 5.3.

The important part is between the dashes: male students who related more to the protagonist in Grand Theft Auto had less empathy for a female victim.  There’s no evidence given that this was a result of playing Grand Theft Auto, since the researchers (obviously) didn’t ask about how people who didn’t play that game related to its protagonist.

What I wanted to know was how the empathy scores compared by which game the students played, separately by gender. The research paper didn’t report the analysis I wanted, but thanks to the wonders of Open Science, their data are available.

If you just compare which game the students were assigned to (and their gender), here are the means; the intervals are set up so there’s a statistically significant difference between two groups when their intervals don’t overlap.


The difference between different games is too small to pick out reliably at this sample size, but is less than half a point on the scale — and while the ‘violent/sexist’ games might reduce empathy, there’s just as much evidence (ie, not very much) that the ‘violent’ ones increase it.

Here’s the complete data, because means can be misleading


The data are consistent with a small overall impact of the game, or no real impact. They’re consistent with a moderately large impact on a subset of susceptible men, but equally consistent with some men just being horrible people.

If this is an issue you’ve considered in the past, this study shouldn’t be enough to alter your views much, and if it isn’t an issue you’ve considered in the past, it wouldn’t be the place to start.