Posts filed under Significance (16)

May 28, 2015

Junk food science

In an interesting sting on the world of science journalism, John Bohannon and two colleagues, plus a German medical doctor, ran a small randomised experiment on the effects of chocolate consumption, and found better weight loss in those given chocolate. The experiment was real and the measurements were real, but the medical journal  was the sort that published their paper two weeks after submission, with no changes.

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.

Bohannon and his conspirators were doing this deliberately, but lots of people do it accidentally. Their study was (deliberately) crappier than average, but since the journalists didn’t ask, that didn’t matter. You should go read the whole thing.

Finally, two answers for obvious concerns: first, the participants were told the research was for a documentary on dieting, not that it was in any sense real scientific research. Second: no, neither Stuff nor the Herald fell for it.

 [Update: Although there was participant consent, there wasn’t ethics committee review. An ethics committee probably wouldn’t have allowed it. Hilda Bastian on Twitter]

July 24, 2014

Weak evidence but a good story

An example from Stuff, this time

Sah and her colleagues found that this internal clock also affects our ability to behave ethically at different times of day. To make a long research paper short, when we’re tired we tend to fudge things and cut corners.

Sah measured this by finding out the chronotypes of 140 people via a standard self-assessment questionnaire, and then asking them to complete a task in which they rolled dice to win raffle tickets – higher rolls, more tickets.

Participants were randomly assigned to either early morning or late evening sessions. Crucially, the participants self-reported their dice rolls.

You’d expect the dice rolls to average out to around 3.5. So the extent to which a group’s average exceeds this number is a measure of their collective result-fudging.

“Morning people tended to report higher die-roll numbers in the evening than the morning, but evening people tended to report higher numbers in the morning than the evening,” Sah and her co-authors wrote.

The research paper is here.  The Washington Post, where the story was taken from, has a graph of the results, and they match the story. Note that this is one of the very few cases where starting a bar chart at zero is a bad idea. It’s hard to roll zero on a standard die.



The research paper also has a graph of the results, which makes the effect look bigger, but in this case is defensible as 3.5 really is “zero” for the purposes of the effect they are studying



Unfortunately,neither graph has any indication of uncertainty. The evidence of an effect is not negligible, but it is fairly weak (p-value of 0.04 from 142 people). It’s easy to imagine someone might do an experiment like this and not publish it if they didn’t see the effect they expected, and it’s pretty certain that you wouldn’t be reading about the results if they didn’t see the effect they expected, so it makes sense to be a bit skeptical.

The story goes on to say

These findings have pretty big implications for the workplace. For one, they suggest that the one-size-fits-all 9-to-5 schedule is practically an invitation to ethical lapses.

Even assuming that the effect is real and that lying about a die roll in a psychological experiment translates into unethical behaviour in real life, the findings don’t say much about the ‘9-to-5’ schedule. For a start, none of the testing was conducted between 9am and 5pm.


April 4, 2014

Thomas Lumley’s latest Listener column

…”One of the problems in developing drugs is detecting serious side effects. People who need medication tend to be unwell, so it’s hard to find a reliable comparison. That’s why the roughly threefold increase in heart-attack risk among Vioxx users took so long to be detected …”

Read his column, Faulty Powers, here.

November 27, 2013

Interpretive tips for understanding science

From David Spiegelhalter, William Sutherland, and Mark Burgman, twenty (mostly statistical) tips for interpreting scientific findings

To this end, we suggest 20 concepts that should be part of the education of civil servants, politicians, policy advisers and journalists — and anyone else who may have to interact with science or scientists. Politicians with a healthy scepticism of scientific advocates might simply prefer to arm themselves with this critical set of knowledge.

A few of the tips, without their detailed explication:

  • Differences and chance cause variation
  • No measurement is exact
  • Bigger is usually better for sample size
  • Controls are important
  • Beware the base-rate fallacy
  • Feelings influence risk perception
November 19, 2013

Tune in, turn on, drop out?

From online site Games and Learning

A massive study of some 11,000 youngsters in Britain has found that playing video games, even as early as five years old, does not lead to later behavior problems.

This is real research, looking at changes over time in a large number of children and it does find that the associations between ‘screen time’ and later behaviour problems are weak. On the other hand, the research paper concludes

 Watching TV for 3 h or more at 5 years predicted a 0.13 point increase (95% CI 0.03 to 0.24) in conduct problems by 7 years, compared with watching for under an hour, but playing electronic games was not associated with conduct problems.

When you see “was not associated”, you need to look carefully: are they claiming evidence of absence or just weakness of evidence. Here are the estimates in a graphical form, comparing changes in a 10-point questionnaire about conduct.



The data largely rule out average differences as big as half a point, so this study does provide evidence there isn’t a big impact (in the UK). However, it’s pretty clear from the graph that the data don’t provide any real support for a difference between TV and videogames.  The estimates for TV are more precise, and for that reason the TV estimate is ‘statistically significant’ and the videogames one isn’t, but that’s not evidence of difference.

It’s also interesting  that there’s mild support in the data for ‘None’ being worse than a small amount. Here the precision is higher for the videogame estimate, because there are very few children who watch no TV (<2%).

June 27, 2013

Making sense of uncertainty

Sense about Science (a British charity whose name, unusually, is actually accurate) have just launched a publication “Making Sense of Uncertainty”, following their previous guides for the public and journalists that cover screening, medical tests, chemical stories, statistics, and radiation.

Researchers in climate science, disease modelling, epidemiology, weather forecasting and natural hazard prediction say that we should be relieved when scientists describe the uncertainties in their work. It doesn’t necessarily mean that we cannot make decisions – we might well have ‘operational knowledge’ – but it does mean that there is greater confidence about what is known and unknown.
Launching a guide to Making Sense of Uncertainty at the World Conference of Science Journalists today, researchers working in some of the most significant, cutting edge fields say that if policy makers and the public are discouraged by the existence of uncertainty, we miss out on important discussions about the development of new drugs, taking action to mitigate the impact of natural hazards, how to respond to the changing climate and to pandemic threats.
Interrogated with the question ‘But are you certain?’, they say, they have ended up sounding defensive or as though their results are not meaningful. Instead we need to embrace uncertainty, especially when trying to understand more about complex systems, and ask about operational knowledge: ‘What do we need to know to make a decision? And do we know it?’ 

Guide to reporting clinical trials

From the World Conference of Science Journalists, via @roobina (Ruth Francis), ten tweets on reporting clinical trials

  1. Was this #trial registered before it began? If not then check for rigged design, or hidden negative results on similar trials.
  2. Is primary outcome reported in paper the same as primary outcome spec in protocol? If no report maybe deeply flawed.
  3. Look for other trials by co or group, or on treatment, on registries to see if it represents cherry picked finding
  4. ALWAYS mention who funded the trial. Do any of ethics committee people have some interest with the funding company
  5. Will country where work is done benefit? Will drug be available at lower cost? Is disorder or disease a problem there
  6. How many patients were on the trial, and how many were in each arm?
  7. What was being compared (drug vs placebo? Drug vs standard care? Drug with no control arm?
  8. Be precise about people/patient who benefited – advanced disease, a particular form of a disease?
  9. Report natural frequencies: “13 people per 10000 experienced x”, rather than “1.3% of people experienced x”
  10. NO relative risks. Paint findings clearly: improved survival by 3%: BAD. Ppl lived 2 months longer on average: GOOD

Who says you can’t say anything useful in 140 characters?

June 20, 2013

Does success in education rely on having certain genes?

If you have read media stories recently that say ‘yes’, you’d better read this article from the Genetic Literacy Project …

May 7, 2013

Modestly significant

From a comment piece in Stuff, by Bruce Robertson (of Hospitality NZ)

In the past five years, the level of hazardous drinking has significantly decreased for men (from 30 per cent to 26 per cent) and marginally decreased for women (13 per cent to 12 per cent).

There was a modest but important drop in the rates of hazardous drinking among Maori adults, with the rate falling from 33 per cent to 29 per cent in the latest survey.

As @tui_talk pointed out on Twitter, that’s a four percentage point decrease described as “significant” for men and “modest” for Maori.

At first I thought this might be a confusion of “statistically significant” with “significant”, with the decrease in men being statistically significant but the difference in Maori not, but in fact the MoH report being referenced says (p4)

As a percentage of all Māori adults, hazardous drinking patterns significantly decreased from 2006/07 (33%) to 2011/12 (29%). 



April 11, 2013

Power failure threatens neuroscience

A new research paper with the cheeky title “Power failure: why small sample size undermines the reliability of neuroscience” has come out in a neuroscience journal. The basic idea isn’t novel, but it’s one of these statistical points that makes your life more difficult (if more productive) when you understand it.  Small research studies, as everyone knows, are less likely to detect differences between groups.  What is less widely appreciated is that even if a small study sees a difference between groups, it’s more likely not to be real.

The ‘power’ of a statistical test is the probability that you will detect a difference if there really is a difference of the size you are looking for.  If the power is 90%, say, then you are pretty sure to see a difference if there is one, and based on standard statistical techniques, pretty sure not to see a difference if there isn’t one. Either way, the results are informative.

Often you can’t afford to do a study with 90% power given the current funding system. If you do a study with low power, and the difference you are looking for really is there, you still have to be pretty lucky to see it — the data have to, by chance, be more favorable to your hypothesis than they should be.   But if you’re relying on the  data being more favorable to your hypothesis than they should be, you can see a difference even if there isn’t one there.

Combine this with publication bias: if you find what you are looking for, you get enthusiastic and send it off to high-impact research journals.  If you don’t see anything, you won’t be as enthusiastic, and the results might well not be published.  After all, who is going to want to look at a study that couldn’t have found anything, and didn’t.  The result is that we get lots of exciting neuroscience news, often with very pretty pictures, that isn’t true.

The same is true for nutrition: I have a student doing a Honours project looking at replicability (in a large survey database) of the sort of nutrition and health stories that make it to the local papers. So far, as you’d expect, the associations are a lot weaker when you look in a separate data set.

Clinical trials went through this problem a while ago, and while they often have lower power than one would ideally like, there’s at least no way you’re going to run a clinical trial in the modern world without explicitly working out the power.

Other people’s reactions