Posts filed under Random variation (95)

July 27, 2014

Air flight crash risk

David Spiegelhalter, Professor of the Public Understanding of Risk at Cambridge University, has looked at the chance of getting three fatal plane crashes in the same 8-day period, based on the average rate of fatal crashes over the past ten years.  He finds that if you look at all 8-day periods in ten years, three crashes is actually the most likely way for the worst week to turn out.

He does this with maths. It’s easier to do it by computer simulation: arrange the 91 crashes randomly among the 3650 days and count up the worst week. When I do this 10,000 times (which takes seconds). I get



The recent crashes were separate tragedies with independent causes — two different types of accident and one deliberate shooting — they aren’t related like, say, the fires in the first Boeing Dreamliners were. There’s no reason for the recent events should make you more worried about flying.

July 24, 2014

Weak evidence but a good story

An example from Stuff, this time

Sah and her colleagues found that this internal clock also affects our ability to behave ethically at different times of day. To make a long research paper short, when we’re tired we tend to fudge things and cut corners.

Sah measured this by finding out the chronotypes of 140 people via a standard self-assessment questionnaire, and then asking them to complete a task in which they rolled dice to win raffle tickets – higher rolls, more tickets.

Participants were randomly assigned to either early morning or late evening sessions. Crucially, the participants self-reported their dice rolls.

You’d expect the dice rolls to average out to around 3.5. So the extent to which a group’s average exceeds this number is a measure of their collective result-fudging.

“Morning people tended to report higher die-roll numbers in the evening than the morning, but evening people tended to report higher numbers in the morning than the evening,” Sah and her co-authors wrote.

The research paper is here.  The Washington Post, where the story was taken from, has a graph of the results, and they match the story. Note that this is one of the very few cases where starting a bar chart at zero is a bad idea. It’s hard to roll zero on a standard die.



The research paper also has a graph of the results, which makes the effect look bigger, but in this case is defensible as 3.5 really is “zero” for the purposes of the effect they are studying



Unfortunately,neither graph has any indication of uncertainty. The evidence of an effect is not negligible, but it is fairly weak (p-value of 0.04 from 142 people). It’s easy to imagine someone might do an experiment like this and not publish it if they didn’t see the effect they expected, and it’s pretty certain that you wouldn’t be reading about the results if they didn’t see the effect they expected, so it makes sense to be a bit skeptical.

The story goes on to say

These findings have pretty big implications for the workplace. For one, they suggest that the one-size-fits-all 9-to-5 schedule is practically an invitation to ethical lapses.

Even assuming that the effect is real and that lying about a die roll in a psychological experiment translates into unethical behaviour in real life, the findings don’t say much about the ’9-to-5′ schedule. For a start, none of the testing was conducted between 9am and 5pm.


July 2, 2014

What’s the actual margin of error?

The official maximum margin of error for an election poll with a simple random sample of 1000 people is 3.099%. Real life is more complicated.

In reality, not everyone is willing to talk to the nice researchers, so they either have to keep going until they get a representative-looking number of people in each group they are interested in, or take what they can get and reweight the data — if young people are under-represented, give each one more weight. Also, they can only get a simple random sample of telephones, so there are more complications in handling varying household sizes. And even once they have 1000 people, some of them will say “Dunno” or “The Conservatives? That’s the one with that nice Mr Key, isn’t it?”

After all this has shaken out it’s amazing the polls do as well as they do, and it would be unrealistic to hope that the pure mathematical elegance of the maximum margin of error held up exactly.  Survey statisticians use the term “design effect” to describe how inefficient a sampling method is compared to ideal simple random sampling. If you have a design effect of 2, your sample of 1000 people is as good as an ideal simple random sample of 500 people.

We’d like to know the design effect for individual election polls, but it’s hard. There isn’t any mathematical formula for design effects under quota sampling, and while there is a mathematical estimate for design effects after reweighting it isn’t actually all that accurate.  What we can do, thanks to Peter Green’s averaging code, is estimate the average design effect across multiple polls, by seeing how much the poll results really vary around the smooth trend. [Update: this is Wikipedia's graph, but I used Peter's code]


I did this for National because it’s easiest, and because their margin of error should be close to the maximum margin of error (since their vote is fairly close to 50%). The standard deviation of the residuals from the smooth trend curve is 2.1%, compared to 1.6% for a simple random sample of 1000 people. That would be a design effect of (2.1/1.6)2, or 1.8.  Based on the Fairfax/Ipsos numbers, about half of that could be due to dropping the undecided voters.

In principle, I could have overestimated the design effect this way because sharp changes in party preference would look like unusually large random errors. That’s not a big issue here: if you re-estimate using a standard deviation estimator that’s resistant to big errors (the median absolute deviation) you get a slightly larger design effect estimate.  There may be sharp changes, but there aren’t all that many of them, so they don’t have a big impact.

If the perfect mathematical maximum-margin-of-error is about 3.1%, the added real-world variability turns that into about 4.2%, which isn’t that bad. This doesn’t take bias into account — if something strange is happening with undecided voters, the impact could be a lot bigger than sampling error.


June 4, 2014

How much disagreement should there be?

The Herald

Thousands of school students are being awarded the wrong NCEA grades, a review of last year’s results has revealed.

Nearly one in four grades given by teachers for internally marked work were deemed incorrect after checking by New Zealand Qualifications Authority moderators.

That’s not actually true, because moderators don’t deem grades to be incorrect. That’s not what moderators are for.  What the report says (pp105-107 in case you want to scroll through it) is that in 24% of cases the moderator and the internal assessor disagreed on grade, and in 12% they disagreed on whether the standard had been achieved.

What we don’t know is how much disagreement is appropriate. The only way the moderator’s assessment could be considered error-free is if you define the ‘right answer’ to be ‘whatever the moderator says’, which is obviously not appropriate. There always will be some variation between moderators, and some variation between schools, and what we want to know is whether there is too much.

The report is a bit disappointing from that point of view.  At the very least, there should have been some duplicate moderation. That is, some pieces of work should have been sent to two different moderators, so we could have an idea of the between-moderator agreement rate. Then, if we were willing to assume that moderators collectively were infallible (though not individually), we could estimate how much less reliable the internal assessments were.

Even better would be to get some information on how much variation there is between schools in the disagreement: if there is very little variation, the schools may be doing about as well as is possible, but if there is a lot of variation between schools it would suggest some schools aren’t assessing very reliably.


May 28, 2014

‘Balanced’ Lotto reporting

From ChCh Press

Are you feeling lucky?

The number drawn most often in Saturday night’s Lotto is one.

The second is seven, the third is lucky 13, followed by 21, 38 and 12.

And if you are selecting a Powerball for Saturday’s draw, the record suggests two is a much better pick than seven.

The numbers are from Lotto Draw Frequency data provided by Lotto NZ for the 1406 Lottery family draws held to last Wednesday.

The Big Wednesday data shows the luckiest numbers are 30, 12, 20, 31, 28 and 16. And heads is drawn more often (232) than tails (216), based on 448 draws to last week.

In theory, selecting the numbers drawn most often would result in more prizes and avoiding the numbers drawn least would result in fewer losses. The record speaks for itself.

Of course this is utter bollocks. The record is entirely consistent with the draw being completely unpredictable, as you would also expect it to be if you’ve ever watched a Lotto draw on television and seen how they work.

This story is better than the ones we used to see, because it does go on and quote people who know what they are talking about, who point out that predicting this way isn’t going to work, and then goes on to say that many people must understand this because they do just take random picks.  On the other hand, that’s the sort of journalistic balance that gets caricatured as “Opinions differ on shape of Earth.”

In world historical terms it doesn’t really matter how these lottery stories are written, but they are missing a relatively a simple opportunity to demonstrate that a paper understands the difference between fact and fancy and thinks it matters.

May 23, 2014

Is Roy Morgan weird?

There seems to be a view that the Roy Morgan political opinion poll is more variable than the others, even to the extent that newspapers are willing to say so, eg, Stuff on May 7

The National Party has taken a big hit in the latest Roy Morgan poll, shedding 6 points to 42.5 per cent in the volatile survey.

I was asked about this on Twitter this morning, so I went to get Peter Green’s data and aggregation model to see what it showed. In fact, there’s not much difference between the major polling companies in the variability of their estimates. Here, for example, are poll-to-poll changes in the support for National in successive polls for four companies



And here are their departures from the aggregated smooth trend



There really is not much to see here. So why do people feel that Roy Morgan comes out with strange results more often? Probably because Roy Morgan comes out with results more often.

For example, the proportion of poll-to-poll changes over 3 percentage points is 0.22 for One News/Colmar Brunton, 0.18 for Roy Morgan, and 0.23 for 3 News/Reid Research, all about the same, but the number of changes over 3 percentage points in this time frame is 5 for One News/Colmar Brunton, 14 for Roy Morgan, and 5 for 3 News/Reid Research.

There are more strange results from Roy Morgan than for the others, but it’s mostly for the same reason that there are more burglaries in Auckland than in the other New Zealand cities.

May 9, 2014

Terrible, horrible, no good, very bad month

From Stuff

The road toll has moved into triple figures for 2014 following the deadliest April in four years.

Police are alarmed by the rising number of deaths, that are a setback after the progress in 2013 when 254 people died in crashes in the whole year – the lowest annual total since 1950.

So far this year 102 people have died on the roads, 15 more than at the same point in 2013, Assistant Commissioner Road Policing Dave Cliff said today.

The problem with this sort of story is how it omits the role of random variation — bad luck.  The Police are well aware that driving mistakes usually do not lead to crashes, and that the ones which do are substantially a matter of luck, because that’s key to their distracted driver campaign. As I wrote recently, their figures on the risks from distracted driving are taken from a large US study which grouped together a small number of actual crashes with a lot of incidents of risky driving that had no real consequence.

The importance of bad luck in turning bad driving into disaster means that the road toll will vary a lot. The margin of error around a count of 102 is about +/- 20, so it’s not clear we’re seeing more than misfortune in the change.  This is especially true because last year was the best on record, ever. We almost certainly had good luck last year, so the fact that it’s wearing off a bit doesn’t mean there has been a real change in driver behaviour.

It was a terrible, horrible, no good, very bad month on the roads, but some months are like that. Even in New Zealand.

May 8, 2014

Think I’ll go eat worms

This table is from a University of California alumni magazine



Jeff Leek argues at Simply Statistics that the big problem with Big Data is they, too, forgot statistics.

May 5, 2014

Verging on a borderline trend

From Matthew Hankins, via a Cochrane Collaboration blog post, the first few items on an alphabetical list of ways to describe failure to meet a statistical significance threshold

a barely detectable statistically significant difference (p=0.073)
a borderline significant trend (p=0.09)
a certain trend toward significance (p=0.08)
a clear tendency to significance (p=0.052)
a clear trend (p<0.09)
a clear, strong trend (p=0.09)
a considerable trend toward significance (p=0.069)
a decreasing trend (p=0.09)
a definite trend (p=0.08)
a distinct trend toward significance (p=0.07)
a favorable trend (p=0.09)
a favourable statistical trend (p=0.09)
a little significant (p<0.1)
a margin at the edge of significance (p=0.0608)
a marginal trend (p=0.09)
a marginal trend toward significance (p=0.052)
a marked trend (p=0.07)
a mild trend (p<0.09)

Often there’s no need to have a threshold and people would be better off giving an interval estimate including the statistical uncertainty.

The defining characteristic of the (relatively rare) situations where a threshold is needed is that you either pass the threshold or you don’t. A marked trend towards a suggestion of positive evidence is not meeting the threshold.

Weight gain lie factor

From  Malaysian newspaper The Star, via Twitter, an infographic that gets the wrong details right



The designer went to substantial effort to make the area of each figure proportional to the number displayed (it says something about modern statistical computing that the my quickest way to check this was read the image file in R, use cluster analysis to find the figures, then tabulate).

However, it’s not remotely true that typical Malaysians weigh nearly four times as much as typical Cambodians. The number is the proportion above a certain BMI threshold, and that changes quite fast as mean weight increases.  Using 1971 US figures for the variability of BMI, you’d get this sort of range of proportion overweight with a 23% range in mean weight between the highest and lowest countries.