Posts filed under Random variation (124)

April 27, 2016

Not just an illusion

There’s a headline in the IndependentIf you think more celebrities are dying young this year, you’re wrong – it’s just a trick of the mind“. And, in a sense, Ben Chu is right. In a much more important sense, he’s wrong.

He argues that there are more celebrities at risk now, which there are. He says a lot of these celebrities are older than we realise, which they are. He says that the number of celebrity deaths this year is within the scope of random variation looking at recent times, which may well be the case. But I don’t think that’s the question.

Usually, I’m taking the other side of this point. When there’s an especially good or especially bad weekend for road crashes, I say that it’s likely just random variation, and not evidence for speeding tolerances or unsafe tourists or breath alcohol levels. That’s because usually the question is whether the underlying process is changing: are the roads getting safer or more dangerous.

This time there isn’t really a serious question of whether karma, global warming, or spiders from Mars are killing off celebrities.  We know it must be a combination of understandable trends and bad luck that’s responsible.  But there really have been more celebrities dying this year.   Prince is really dead. Bowie is really dead. Victoria Wood, Patty Duke, Ronnie Corbett, Alan Rickman, Harper Lee — 2016 has actually happened this way,  it hasn’t been (to steal a line from Daniel Davies) just a particularly inaccurate observation of the underlying population and mortality patterns.

April 17, 2016

Evil within?

The headlineSex and violence ‘normal’ for boys who kill women in video games: study. That’s a pretty strong statement, and the claim quotes imply we’re going to find out who made it. We don’t.

The (much-weaker) take-home message:

The researchers’ conclusion: Sexist games may shrink boys’ empathy for female victims.

The detail:

The researchers then showed each student a photo of a bruised girl who, they said, had been beaten by a boy. They asked: On a scale of one to seven, how much sympathy do you have for her?

The male students who had just played Grand Theft Auto – and also related to the protagonist – felt least bad for her. with an empathy mean score of 3. Those who had played the other games, however, exhibited more compassion. And female students who played the same rounds of Grand Theft Auto had a mean empathy score of 5.3.

The important part is between the dashes: male students who related more to the protagonist in Grand Theft Auto had less empathy for a female victim.  There’s no evidence given that this was a result of playing Grand Theft Auto, since the researchers (obviously) didn’t ask about how people who didn’t play that game related to its protagonist.

What I wanted to know was how the empathy scores compared by which game the students played, separately by gender. The research paper didn’t report the analysis I wanted, but thanks to the wonders of Open Science, their data are available.

If you just compare which game the students were assigned to (and their gender), here are the means; the intervals are set up so there’s a statistically significant difference between two groups when their intervals don’t overlap.

gtamean

The difference between different games is too small to pick out reliably at this sample size, but is less than half a point on the scale — and while the ‘violent/sexist’ games might reduce empathy, there’s just as much evidence (ie, not very much) that the ‘violent’ ones increase it.

Here’s the complete data, because means can be misleading

gtaswarm

The data are consistent with a small overall impact of the game, or no real impact. They’re consistent with a moderately large impact on a subset of susceptible men, but equally consistent with some men just being horrible people.

If this is an issue you’ve considered in the past, this study shouldn’t be enough to alter your views much, and if it isn’t an issue you’ve considered in the past, it wouldn’t be the place to start.

March 11, 2016

Getting to see opinion poll uncertainty

Rock’n Poll has a lovely guide to sampling uncertainty in election polls, guiding you step by step to see how approximate the results would be in the best of all possible worlds. Highly recommended.

Of course, we’re not in the best of all possible worlds, and in addition to pure sampling uncertainty we have ‘house effects’ due to different methodology between polling firms and ‘design effects’ due to the way the surveys compensate for non-response.  And on top of that there are problems with the hypothetical question ‘if an election were held tomorrow’, and probably issues with people not wanting to be honest.

Even so, the basic sampling uncertainty gives a good guide to the error in opinion polls, and anything that makes it easier to understand is worth having.

poll-land

(via Harkanwal Singh)

March 7, 2016

Crime reports in NZ

The Herald Insights section has a multi-day exploration of police burglary reports, starting with a map at the Census meshblock level.

burglary

When you have counts of things on a map there’s always an issue of denominators and areas.  There’s the “one cow, one vote” phenomenon where rural areas dominate the map, and also the question of whether to show the raw count, the fraction of the population, or something else.  Burglaries are especially tricky in this context, because the crime location need not be a household, and the perpetrator need not live nearby, so the meshblock population really isn’t the right denominator.  The Herald hasn’t standardised, which I think is a reasonable default.

It’s also an opportunity to link again to Graeme Edgeler’s discussions of  why ‘burglary’ is a wider category than most people realise.

September 29, 2015

When variation is the story

A familiar trope of alternative cancer therapy is the patients who were given just months to live and are still alive years later.  The implication is that their survival is surprising and the cancer therapy was responsible.  Falling foul of the post hoc ergo propter hoc fallacy isn’t the big problem here. The big problem is that it’s not surprising that some people live a lot longer than the median.

Our intuition for variation is developed on measurements that aren’t like cancer survival. Most adults are pretty close to the average height — very few are more than a foot taller or shorter. Most people in Western countries die at close to the average age: for example, by the NZ life tables, the median life expectancy for NZ men born now is 81 years, and the tables predict half will die between 73 and 88 years.

For many types of cancer, survival isn’t like that. Here’s a graph from a big Canadian study of breast cancer

F1.large

The median survival for women with stage IV cancer in this group is about a year; half of them are still alive after a year. About half of those are still alive after two years; about half of those are still alive after three years, and some live much longer.

Variation in cancer survival — the long tail — should be welcome, but not surprising. Some people will be alive three, four, five or more years after ‘being given a year to live’.  We should be just as cautious about crediting them with finding a cure as we should be about blaming those who died sooner.

 

August 28, 2015

Trying again

CNbxnQDWgAAXlKL

This graph is from the Open Science Framework attempt to replicate 100 interesting results in experimental psychology, led by Brian Nozek and published in Science today.

About a third of the experiments got statistically significant results in the same direction as the originals.  Averaging all the experiments together,  the effect size was only half that seen originally, but the graph suggests another way to look at it.  It seems that about half the replications got basically the same result as the original, up to random variation, and about half the replications found nothing.

Ed Yong has a very good article about the project in The Atlantic. He says it’s worse than psychologists expected (but at least now they know).  It’s actually better than I would have expected — I would have guessed that the replicated effects would average quite a bit smaller than the originals.

The same thing is going to be true for a lot of small-scale experiments in other fields.

July 24, 2015

Are beneficiaries increasingly failing drug test?

Stuff’s headline is “Beneficiaries increasingly failing drug tests, numbers show”.

The numbers are rates per week of people failing or refusing drug tests. The number was 1.8/week for the first 12 weeks of the policy and 2.6/week for the whole year 2014, and, yes, 2.6 is bigger than 1.8.  However, we don’t know how many tests were performed or demanded, so we don’t know how much of this might be an increase in testing.

In addition, if we don’t worry about the rate of testing and take the numbers at face value, the difference is well within what you’d expect from random variation, so while the numbers are higher it would be unwise to draw any policy conclusions from the difference.

On the other hand, the absolute numbers of failures are very low when compared to the estimates in the Treasury’s Regulatory Impact Statement.

MSD and MoH have estimated that once this policy is fully implemented, it may result in:

• 2,900 – 5,800 beneficiaries being sanctioned for a first failure over a 12 month period

• 1,000 – 1,900 beneficiaries being sanctioned for a second failure over a 12 month period

• 500 – 1,100 beneficiaries being sanctioned for a third failure over a 12 month period.

The numbers quoted by Stuff are 60 sanctions in total over eighteen months, and 134 test failures over twelve months.  The Minister is quoted as saying the low numbers show the program is working, but as she could have said the same thing about numbers that looked like the predictions, or numbers that were higher than the predictions, it’s also possible that being off by an order of magnitude or two is a sign of a problem.

 

June 11, 2015

Comparing all the treatments

This story didn’t get into the local media, but I’m writing about it because it illustrates the benefit of new statistical methods, something that’s often not visible to outsiders.

From a University of Otago press release about the work of A/Prof Suetonia Palmer

The University of Otago, Christchurch researcher together with a global team used innovative statistical analysis to compare hundreds of research studies on the effectiveness of blood-pressure-lowering drugs for patients with kidney disease and diabetes. The result: a one-stop-shop, evidence-based guide on which drugs are safe and effective.

They link to the research paper, which has interesting looking graphics like this:

netmeta

The red circles represent blood-pressuring lowering treatments that have been tested in patients with kidney disease and diabetes, with the lines indicating which comparisons have been done in randomised trials. The circle size shows how many trials have used a drug; the line width shows how many trials have compared a given pair of drugs.

If you want to compare, say, endothelin inhibitors with ACE inhibitors, there aren’t any direct trials. However, there are two trials comparing endothelin inhibitors to placebo, and ten trials comparing placebo to ACE inhibitors. If we estimate the advantage of endothelin inhibitors over placebo and subtract off the advantage of ACE inhibitors over placebo we will get an estimate of the advantage of endothelin inhibitors over ACE inhibitors.

More generally, if you want to compare any two treatments A and B, you look at all the paths in the network between A and B, add up differences along the path to get an estimate of the difference between A and B, then take a suitable weighted average of the estimates along different paths. This statistical technique is called ‘network meta-analysis’.

Two important technical questions remain: what is a suitable weighted average, and how can you tell if these different estimates are consistent with each other? The first question is relatively straightforward (though quite technical). The second question was initially the hard one.  It could be for example, that the trials involving placebo had very different participants from the others, or that old trials had very different participants from recent trials, and their conclusions just could not be usefully combined.

The basic insight for examining consistency is that the same follow-the-path approach could be used to compare a treatment to itself. If you compare placebo to ACE inhibitors, ACE inhibitors to ARB, and ARB to placebo, there’s a path (a loop) that gives an estimate of how much better placebo is than placebo. We know the true difference is zero; we can see how large the estimated difference is.

In this analysis, there wasn’t much evidence of inconsistency, and the researchers combined all the trials to get results like this:

netmeta-ci

The ‘forest plot’ shows how each treatment compares to placebo (vertical line) in terms of preventing death. We can’t be absolutely sure than any of them are better, but it definitely looks as though ACE inhibitors plus calcium-channel blockers or ARBs, and ARBs alone, are better. It could be that aldosterone inhibitors are much better, but also could be that they are worse. This sort of summary is useful as an input to clinical decisions, and also in deciding what research should be prioritised in the future.

I said the analysis illustrated progress in statistical methods. Network meta-analysis isn’t completely new, and its first use was also in studying blood pressure drugs, but in healthy people rather than people with kidney disease. Here are those results

netmeta-me

There are different patterns for which drug is best across the different events being studied (heart attack, stroke, death), and the overall patterns are different from those in kidney disease/diabetes. The basic analysis is similar; the improvements since this 2003 paper are more systematic and flexible ways of examining inconsistency, and new displays of the network of treatments.

‘Innovative statistical techniques’ are important, but the key to getting good results here is a mind-boggling amount of actual work. As Dr Palmer put it in a blog interview

Our techniques are still very labour intensive. A new medical question we’re working on involves 20-30 people on an international team, scanning 5000-6000 individual reports of medical trials, finding all the relevant papers, and entering data for about 100-600 reports by hand. We need to build an international partnership to make these kind of studies easier, cheaper, more efficient, and more relevant.

At this point, I should confess the self-promotion aspect of the post.  I invented the term “network meta-analysis” and the idea of using loops in the network to assess inconsistency.  Since then, there have been developments in statistical theory, especially by Guobing Lu and A E Ades in Bristol, who had already been working on other aspects of multiple-treatment analysis. There have also been improvements in usability and standardisation, thanks to Georgia Salanti and others in the Cochrane Collaboration ‘Comparing Multiple Interventions Methods Group’.  In fact, network meta-analysis has grown up and left home to the extent that the original papers often don’t get referenced. And I’m fine with that. It’s how progress works.

 

June 8, 2015

Meddling kids confirm mānuka honey isn’t panacea

The Sunday Star-Times has a story about a small, short-term, unpublished randomised trial of mānuka honey for preventing minor illness. There are two reasons this is potentially worth writing about: it was done by primary school kids, and it appears to be the largest controlled trial in humans for prevention of illness.

Here are the results (which I found from the Twitter account of the school’s lab, run by Carole Kenrick, who is  named in the story)CGuGbSiWoAACzbe

The kids didn’t find any benefit of mānuka honey over either ordinary honey or no honey. Realistically, that just means they managed to design and carry out the study well enough to avoid major biases. The reason there aren’t any controlled prevention trials in humans is that there’s no plausible mechanism for mānuka honey to help with anything except wound healing. To its credit, the SST story quotes a mānuka producer saying exactly this:

But Bray advises consumers to “follow the science”.

“The only science that’s viable for mānuka honey is for topical applications – yet it’s all sold and promoted for ingestion.”

You might, at a stretch, say mānuka honey could affect bacteria in the gut, but that’s actually been tested, and any effects are pretty small. Even in wound healing, it’s quite likely that any benefit is due to the honey content rather than the magic of mānuka — and the trials don’t typically have a normal-honey control.

As a primary-school science project, this is very well done. The most obvious procedural weakness is that mānuka honey’s distinctive flavour might well break their attempts to blind the treatment groups. It’s also a bit small, but we need to look more closely to see how that matters.

When you don’t find a difference between groups, it’s crucial to have some idea of what effect sizes have been ruled out.  We don’t have the data, but measuring off the graphs and multiplying by 10 weeks and 10 kids per group, the number of person-days of unwellness looks to be in the high 80s. If the reported unwellness is similar for different kids, so that the 700 days for each treatment behave like 700 independent observations, a 95% confidence interval would be 0±2%.  At the other extreme, if 0ne kid had 70 days unwell, a second kid had 19, and the other eight had none, the confidence interval would be 0±4.5%.

In other words, the study data are still consistent with manūka honey preventing about one day a month of feeling “slightly or very unwell”, in a population of Islington primary-school science nerds. At three 5g servings per day that would be about 500g honey for each extra day of slightly improved health, at a cost of $70-$100, so the study basically rules out manūka honey being cost-effective for preventing minor unwellness in this population. The study is too small to look at benefits or risks for moderate to serious illness, which remain as plausible as they were before. That is, not very.

Fortunately for the mānuka honey export industry, their primary market isn’t people who care about empirical evidence.

June 7, 2015

What does 80% accurate mean?

From Stuff (from the Telegraph)

And the scientists claim they do not even need to carry out a physical examination to predict the risk accurately. Instead, people are questioned about their walking speed, financial situation, previous illnesses, marital status and whether they have had previous illnesses.

Participants can calculate their five-year mortality risk as well as their “Ubble age” – the age at which the average mortality risk in the population is most similar to the estimated risk. Ubble stands for “UK Longevity Explorer” and researchers say the test is 80 per cent accurate.

There are two obvious questions based on this quote: what does it mean for the test to be 80 per cent accurate, and how does “Ubble” stand for “UK Longevity Explorer”? The second question is easier: the data underlying the predictions are from the UK Biobank, so presumably “Ubble” comes from “UK Biobank Longevity Explorer.”

An obvious first guess at the accuracy question would be that the test is 80% right in predicting whether or not you will survive 5 years. That doesn’t fly. First, the test gives a percentage, not a yes/no answer. Second, you can do a lot better than 80% in predicting whether someone will survive 5 years or not just by guessing “yes” for everyone.

The 80% figure doesn’t refer to accuracy in predicting death, it refers to discrimination: the ability to get higher predicted risks for people at higher actual risk. Specifically, it claims that if you pick pairs of  UK residents aged 40-70, one of whom dies in the next five years and the other doesn’t, the one who dies will have a higher predicted risk in 80% of pairs.

So, how does it manage this level of accuracy, and why do simple questions like self-rated health, self-reported walking speed, and car ownership show up instead of weight or cholesterol or blood pressure? Part of the answer is that Ubble is looking only at five-year risk, and only in people under 70. If you’re under 70 and going to die within five years, you’re probably sick already. Asking you about your health or your walking speed turns out to be a good way of finding if you’re sick.

This table from the research paper behind the Ubble shows how well different sorts of information predict.

si2

Age on its own gets you 67% accuracy, and age plus asking about diagnosed serious health conditions (the Charlson score) gets you to 75%.  The prediction model does a bit better, presumably it’s better at picking up a chance of undiagnosed disease.  The usual things doctors nag you about, apart from smoking, aren’t in there because they usually take longer than five years to kill you.

As an illustration of the importance of age and basic health in the prediction, if you put in data for a 60-year old man living with a partner/wife/husband, who smokes but is healthy apart from high blood pressure, the predicted percentage for dying is 4.1%.

The result comes with this well-designed graphic using counts out of 100 rather than fractions, and illustrating the randomness inherent in the prediction by scattering the four little red people across the panel.

ubble

Back to newspaper issues: the Herald also ran a Telegraph story (a rather worse one), but followed it up with a good repost from The Conversation by two of the researchers. None of these stories mentioned that the predictions will be less accurate for New Zealand users. That’s partly because the predictive model is calibrated to life expectancy, general health positivity/negativity, walking speeds, car ownership, and diagnostic patterns in Brits. It’s also because there are three questions on UK government disability support, which in our case we have not got.