Posts filed under Random variation (93)

July 2, 2014

What’s the actual margin of error?

The official maximum margin of error for an election poll with a simple random sample of 1000 people is 3.099%. Real life is more complicated.

In reality, not everyone is willing to talk to the nice researchers, so they either have to keep going until they get a representative-looking number of people in each group they are interested in, or take what they can get and reweight the data — if young people are under-represented, give each one more weight. Also, they can only get a simple random sample of telephones, so there are more complications in handling varying household sizes. And even once they have 1000 people, some of them will say “Dunno” or “The Conservatives? That’s the one with that nice Mr Key, isn’t it?”

After all this has shaken out it’s amazing the polls do as well as they do, and it would be unrealistic to hope that the pure mathematical elegance of the maximum margin of error held up exactly.  Survey statisticians use the term “design effect” to describe how inefficient a sampling method is compared to ideal simple random sampling. If you have a design effect of 2, your sample of 1000 people is as good as an ideal simple random sample of 500 people.

We’d like to know the design effect for individual election polls, but it’s hard. There isn’t any mathematical formula for design effects under quota sampling, and while there is a mathematical estimate for design effects after reweighting it isn’t actually all that accurate.  What we can do, thanks to Peter Green’s averaging code, is estimate the average design effect across multiple polls, by seeing how much the poll results really vary around the smooth trend. [Update: this is Wikipedia's graph, but I used Peter's code]


I did this for National because it’s easiest, and because their margin of error should be close to the maximum margin of error (since their vote is fairly close to 50%). The standard deviation of the residuals from the smooth trend curve is 2.1%, compared to 1.6% for a simple random sample of 1000 people. That would be a design effect of (2.1/1.6)2, or 1.8.  Based on the Fairfax/Ipsos numbers, about half of that could be due to dropping the undecided voters.

In principle, I could have overestimated the design effect this way because sharp changes in party preference would look like unusually large random errors. That’s not a big issue here: if you re-estimate using a standard deviation estimator that’s resistant to big errors (the median absolute deviation) you get a slightly larger design effect estimate.  There may be sharp changes, but there aren’t all that many of them, so they don’t have a big impact.

If the perfect mathematical maximum-margin-of-error is about 3.1%, the added real-world variability turns that into about 4.2%, which isn’t that bad. This doesn’t take bias into account — if something strange is happening with undecided voters, the impact could be a lot bigger than sampling error.


June 4, 2014

How much disagreement should there be?

The Herald

Thousands of school students are being awarded the wrong NCEA grades, a review of last year’s results has revealed.

Nearly one in four grades given by teachers for internally marked work were deemed incorrect after checking by New Zealand Qualifications Authority moderators.

That’s not actually true, because moderators don’t deem grades to be incorrect. That’s not what moderators are for.  What the report says (pp105-107 in case you want to scroll through it) is that in 24% of cases the moderator and the internal assessor disagreed on grade, and in 12% they disagreed on whether the standard had been achieved.

What we don’t know is how much disagreement is appropriate. The only way the moderator’s assessment could be considered error-free is if you define the ‘right answer’ to be ‘whatever the moderator says’, which is obviously not appropriate. There always will be some variation between moderators, and some variation between schools, and what we want to know is whether there is too much.

The report is a bit disappointing from that point of view.  At the very least, there should have been some duplicate moderation. That is, some pieces of work should have been sent to two different moderators, so we could have an idea of the between-moderator agreement rate. Then, if we were willing to assume that moderators collectively were infallible (though not individually), we could estimate how much less reliable the internal assessments were.

Even better would be to get some information on how much variation there is between schools in the disagreement: if there is very little variation, the schools may be doing about as well as is possible, but if there is a lot of variation between schools it would suggest some schools aren’t assessing very reliably.


May 28, 2014

‘Balanced’ Lotto reporting

From ChCh Press

Are you feeling lucky?

The number drawn most often in Saturday night’s Lotto is one.

The second is seven, the third is lucky 13, followed by 21, 38 and 12.

And if you are selecting a Powerball for Saturday’s draw, the record suggests two is a much better pick than seven.

The numbers are from Lotto Draw Frequency data provided by Lotto NZ for the 1406 Lottery family draws held to last Wednesday.

The Big Wednesday data shows the luckiest numbers are 30, 12, 20, 31, 28 and 16. And heads is drawn more often (232) than tails (216), based on 448 draws to last week.

In theory, selecting the numbers drawn most often would result in more prizes and avoiding the numbers drawn least would result in fewer losses. The record speaks for itself.

Of course this is utter bollocks. The record is entirely consistent with the draw being completely unpredictable, as you would also expect it to be if you’ve ever watched a Lotto draw on television and seen how they work.

This story is better than the ones we used to see, because it does go on and quote people who know what they are talking about, who point out that predicting this way isn’t going to work, and then goes on to say that many people must understand this because they do just take random picks.  On the other hand, that’s the sort of journalistic balance that gets caricatured as “Opinions differ on shape of Earth.”

In world historical terms it doesn’t really matter how these lottery stories are written, but they are missing a relatively a simple opportunity to demonstrate that a paper understands the difference between fact and fancy and thinks it matters.

May 23, 2014

Is Roy Morgan weird?

There seems to be a view that the Roy Morgan political opinion poll is more variable than the others, even to the extent that newspapers are willing to say so, eg, Stuff on May 7

The National Party has taken a big hit in the latest Roy Morgan poll, shedding 6 points to 42.5 per cent in the volatile survey.

I was asked about this on Twitter this morning, so I went to get Peter Green’s data and aggregation model to see what it showed. In fact, there’s not much difference between the major polling companies in the variability of their estimates. Here, for example, are poll-to-poll changes in the support for National in successive polls for four companies



And here are their departures from the aggregated smooth trend



There really is not much to see here. So why do people feel that Roy Morgan comes out with strange results more often? Probably because Roy Morgan comes out with results more often.

For example, the proportion of poll-to-poll changes over 3 percentage points is 0.22 for One News/Colmar Brunton, 0.18 for Roy Morgan, and 0.23 for 3 News/Reid Research, all about the same, but the number of changes over 3 percentage points in this time frame is 5 for One News/Colmar Brunton, 14 for Roy Morgan, and 5 for 3 News/Reid Research.

There are more strange results from Roy Morgan than for the others, but it’s mostly for the same reason that there are more burglaries in Auckland than in the other New Zealand cities.

May 9, 2014

Terrible, horrible, no good, very bad month

From Stuff

The road toll has moved into triple figures for 2014 following the deadliest April in four years.

Police are alarmed by the rising number of deaths, that are a setback after the progress in 2013 when 254 people died in crashes in the whole year – the lowest annual total since 1950.

So far this year 102 people have died on the roads, 15 more than at the same point in 2013, Assistant Commissioner Road Policing Dave Cliff said today.

The problem with this sort of story is how it omits the role of random variation — bad luck.  The Police are well aware that driving mistakes usually do not lead to crashes, and that the ones which do are substantially a matter of luck, because that’s key to their distracted driver campaign. As I wrote recently, their figures on the risks from distracted driving are taken from a large US study which grouped together a small number of actual crashes with a lot of incidents of risky driving that had no real consequence.

The importance of bad luck in turning bad driving into disaster means that the road toll will vary a lot. The margin of error around a count of 102 is about +/- 20, so it’s not clear we’re seeing more than misfortune in the change.  This is especially true because last year was the best on record, ever. We almost certainly had good luck last year, so the fact that it’s wearing off a bit doesn’t mean there has been a real change in driver behaviour.

It was a terrible, horrible, no good, very bad month on the roads, but some months are like that. Even in New Zealand.

May 8, 2014

Think I’ll go eat worms

This table is from a University of California alumni magazine



Jeff Leek argues at Simply Statistics that the big problem with Big Data is they, too, forgot statistics.

May 5, 2014

Verging on a borderline trend

From Matthew Hankins, via a Cochrane Collaboration blog post, the first few items on an alphabetical list of ways to describe failure to meet a statistical significance threshold

a barely detectable statistically significant difference (p=0.073)
a borderline significant trend (p=0.09)
a certain trend toward significance (p=0.08)
a clear tendency to significance (p=0.052)
a clear trend (p<0.09)
a clear, strong trend (p=0.09)
a considerable trend toward significance (p=0.069)
a decreasing trend (p=0.09)
a definite trend (p=0.08)
a distinct trend toward significance (p=0.07)
a favorable trend (p=0.09)
a favourable statistical trend (p=0.09)
a little significant (p<0.1)
a margin at the edge of significance (p=0.0608)
a marginal trend (p=0.09)
a marginal trend toward significance (p=0.052)
a marked trend (p=0.07)
a mild trend (p<0.09)

Often there’s no need to have a threshold and people would be better off giving an interval estimate including the statistical uncertainty.

The defining characteristic of the (relatively rare) situations where a threshold is needed is that you either pass the threshold or you don’t. A marked trend towards a suggestion of positive evidence is not meeting the threshold.

Weight gain lie factor

From  Malaysian newspaper The Star, via Twitter, an infographic that gets the wrong details right



The designer went to substantial effort to make the area of each figure proportional to the number displayed (it says something about modern statistical computing that the my quickest way to check this was read the image file in R, use cluster analysis to find the figures, then tabulate).

However, it’s not remotely true that typical Malaysians weigh nearly four times as much as typical Cambodians. The number is the proportion above a certain BMI threshold, and that changes quite fast as mean weight increases.  Using 1971 US figures for the variability of BMI, you’d get this sort of range of proportion overweight with a 23% range in mean weight between the highest and lowest countries.

April 11, 2014

The favourite never wins?

From Deadspin, an analysis of accuracy in 11 million tournament predictions (‘brackets’) for the US college basketball competition, and 53 predictions by experts



Stephen Pettigrew’s analysis shows the experts average more points than the general public (651681 vs 604.4). What he doesn’t point out explicitly is that picking the favourites, which corresponds to the big spike at 680 points, does rather better than the average expert.


March 31, 2014

Election poll averaging

The DimPost posted a new poll average and trend, which gives an opportunity to talk about some of the issues in interpretation (you should also listen to Sunday’s Mediawatch episode)

The basic chart looks like this


The scatter of points around the trend line shows the sampling uncertainty.  The fact that the blue dots are above the line and the black dots are below the line is important, and is one of the limitations of NZ polls.  At the last election, NZ First did better, and National did worse, than in the polling just before the election. The trend estimates basically assume that this discrepancy will keep going in the future.  The alternative, since we’ve basically got just one election to work with, is to assume it was just a one-off fluke and tells us nothing.

We can’t distinguish these options empirically just from the poll results, but we can think about various possible explanations, some of which could be disproved by additional evidence.  One possibility is that there was a spike in NZ First popularity at the expense of National right at the election, because of Winston Peters’s reaction to the teapot affair.  Another possibility is that landline telephone polls systematically undersample NZ First voters. Another is that people are less likely to tell the truth about being NZ First voters (perhaps because of media bias against Winston or something).  In the US there are so many elections and so many polls that it’s possible to estimate differences between elections and polls, separately for different polling companies, and see how fast they change over time. It’s harder here. (update: Danyl Mclauchlan points me to this useful post by Gavin White)

You can see some things about different polling companies. For example, in the graph below, the large red circles are the Herald-Digipoll results. These seem a bit more variable than the others (they do have a slightly smaller sample size) but they don’t seem biased relative to the other polls.  If you click on the image you’ll get the interactive version. This is the trend without bias correction, so the points scatter symmetrically around the trend lines but the trend misses the election result for National and NZ First.