February 14, 2016

Not 100% accurate

Q: Did you see there’s a new, 100% accurate cancer test?

A: No.

Q: It only uses a bit of saliva, and it can be done at home?

A: No.

Q: No?

A: Remember what I’ve said about ‘too good to be true’?

Q: So how accurate is it?

A: ‘It’ doesn’t really exist?

Q: But it “will enter full clinical trials with lung cancer patients later this year.”

A: That’s not a test for cancer. The phrase “lung cancer patients” is a hint.

Q: So what is it a test for?

A: It’s a test for whether a particular drug will work in a patient’s lung cancer

Q: Oh. That’s useful, isn’t it?

A: Definitely

Q: And that’s 100% accurate?

A: <tilts head, raises eyebrows>

Q: Too good to be true?

A: The test is very good at getting the same results that you would get from analysing a surgical specimen. Genetically it’s about 95% accurate in a small set of data reported in January. In clinical trials, 50% of people with the right tumour genetics responded to the drug. So you could say the test is 95% accurate or 50% accurate.

Q: That still sounds pretty good, doesn’t it?

A: Yes, if the trial this year gets results like the preliminary data it would be very impressive.

Q: And he does this with just a saliva sample?

A: Yes, it turns out that a little bit of tumour DNA ends up pretty much anywhere you look, and modern genetic technology only needs a few molecules.

Q: Could this technology be used for detecting cancer, too?

A: In principle, but we’d need to know it was accurate. At the moment, according to the abstract for the talk that prompted the story, they might be able to  detect 80% of oral cancer. And they don’t seem to know how often a cell with one of the mutations might turn up in someone who wouldn’t go on to get cancer. Since oral cancer is rare, the test would need to be extremely accurate and inexpensive to be worth using in healthy people.

Q: What about other more common cancers?

A: In principle, maybe, but most cancers are rare when you get down to the level of specific genetic mutations.  It’s conceivable, but it’s not happening in the two-year time frame that the story gives.


February 13, 2016

Neanderthal DNA: how could they tell?

As I said in August

“How would you even study that?” is an excellent question to ask when you see a surprising statistic in the media. Often the answer is “they didn’t,” but sometimes you get to find out about some really clever research technique.

There are stories around, such as the one in Stuff, about modern disease due to Neanderthal genes (press release).

The first-ever study directly comparing Neanderthal DNA to the human genome confirmed a wide range of health-related associations — from the psychiatric to the podiatric — that link modern humans to our broad-browed relatives.

It’s basically true, although as with most genetic studies the genetic effects are really, really small. There’s a genetic variant that doubles your risk of nicotine dependence, but only 1% of Europeans have it. The researchers estimate that Neanderthal genetic variants explain about 1% of depression and less than half  a percent of cardiovascular disease. But that’s not zero, and it wasn’t so long ago that the idea of interbreeding was thought very unlikely.

Since hardly any Neanderthals have had their genome sequenced, how was this done? There are two parts to it: a big data part and a clever genetics part.

The clever genetics part (paper) uses the fact that Neanderthals and modern humans, since their ancestors had been separated for a long time (350,000 years), had lots of little, irrelevant differences in DNA accumulated as mutations– like a barcode sequence.  Given a long enough snippet of genome, we can match it up either to the modern human barcode or the Neanderthal barcode. Neanderthals are recent enough (50,000 years is maybe 2500 generations) that many of the snippets of Neanderthal genome we inherit are long enough to match up the barcodes reliably.  The researchers looked at genome sequences from the 1000 Genomes Project, and found genetic variants existing today that are part of genome snippets which appear Neanderthal.  These genetic variants are what they looked at.

The Big Data is a collection of medical records at nine major hospitals in the US, together with DNA samples. This nothing like a random sample, and the disease data are from ICD9 diagnostic codes rather than detailed medical record review, but quantity helps.

Using the DNA samples, they can see which people have each of the  Neanderthal-looking genetic variants, and what diseases these people have — and find the very small differences.

This isn’t really medical research. The lead researcher quoted in the news is an evolutionary geneticist, and the real story is genetics: even though the Neanderthals vanished 50,000 years ago, we can still see enough of their genome to learn new things about how they were different from us.


Detecting gravitational waves

The LIGO gravitational wave detector is an immensely complex endeavour, a system capable of detecting minute gravitational waves, and of not detecting everything else.

To this end, the researchers relied on every science from astronomy to, well, perhaps not zymurgy, but at least statistics. If you want to know “did we just hear two black holes collide?” it helps to know what it will sound like when two black holes collide right at the very limit of audibility, and how likely you are to hear noises like that just from motorbikes, earthquakes, and Superbowl crowds.  That is, you want a probability model for the background noise and a probability model for the sound of colliding black holes, so you can compute the likelihood ratio between them — how much evidence is in this signal.

One of the originators of some of the methods used by LIGO is Renate Meyer, an Associate Professor in the Stats department. Here’s her comments to the Science Media Centre, and a post on the department website

Just one more…

NPR’s Planet Money ran an interesting podcast in mid-January of this year. I recommend you take the time to listen to it.

The show discussed the idea that there are problems in the way that we do science — in this case that our continual reliance on hypothesis testing (or statistical significance) is leading to many scientifically spurious results. As a Bayesian, that comes as no surprise. One section of the show, however, piqued my pedagogical curiosity:

STEVE LINDSAY: OK. Let’s start now. We test 20 people and say, well, it’s not quite significant, but it’s looking promising. Let’s test another 12 people. And the notion was, of course, you’re just moving towards truth. You test more people. You’re moving towards truth. But in fact – and I just didn’t really understand this properly – if you do that, you increase the likelihood that you will get a, quote, “significant effect” by chance alone.

KESTENBAUM: There are lots of ways you can trick yourself like this, just subtle ways you change the rules in the middle of an experiment.

You can think about situations like this in terms of coin tossing. If we conduct a single experiment where there are only two possible outcomes, let us say “success” and “failure”, and if there is genuinely nothing affecting the outcomes, then any “success” we observe will be due to random chance alone. If we have a hypothetical fair coin — I say hypothetical because physical processes can make coin tossing anything but fair — we say the probability of a head coming up on a coin toss is equal to the probability of a tail coming up and therefore must be 1/2 = 0.5. The podcast describes the following experiment:

KESTENBAUM: In one experiment, he says, people were told to stare at this computer screen, and they were told that an image was going to appear on either the right site or the left side. And they were asked to guess which side. Like, look into the future. Which side do you think the image is going to appear on?

If we do not believe in the ability of people to predict the future, then we think the experimental subjects should have an equal chance of getting the right answer or the wrong answer.

The binomial distribution allows us to answer questions about multiple trials. For example, “If I toss the coin 10 times, then what is the probability I get heads more than seven times?”, or, “If the subject does the prognostication experiment described 50 times (and has no prognostic ability), what is the chance she gets the right answer more than 30 times?”

When we teach students about the binomial distribution we tell them that the number of trials (coin tosses) must be fixed before the experiment is conducted, otherwise the theory does not apply. However, if you take the example from Steve Lindsay, “..I did 20 experiments, how about I add 12 more,” then it can be hard to see what is wrong in doing so. I think the counterintuitive nature of this relates to general misunderstanding of conditional probability. When we encounter a problem like this, our response is “Well I can’t see the difference between 10 out of 20, versus 16 out of 32.” What we are missing here is that the results of the first 20 experiments are already known. That is, there is no longer any probability attached to the outcomes of these experiments. What we need to calculate is the probability of a certain number of successes, say x given that we have already observed y successes.

Let us take the numbers given by Professor Lindsay of 20 experiments followed a further 12. Further to this we are going to describe “almost significant” in 20 experiments as 12, 13, or 14 successes, and “significant” as 23 or more successes out of 32. I have chosen these numbers because (if we believe in hypothesis testing) we would observe 15 or more “heads” out of 20 tosses of a fair coin fewer than 21 times in 1,000 (on average). That is, observing 15 or more heads in 20 coin tosses is fairly unlikely if the coin is fair. Similarly, we would observe 23 or more heads out of 32 coin tosses about 10 times in 1,000 (on average).

So if we have 12 successes in the first 20 experiments, we need another 11 or 12 successes in the second set of experiments to reach or exceed our threshold of 23. This is fairly unlikely. If successes happen by random chance alone, then we will get 11 or 12 with probability 0.0032 (about 3 times in 1,000). If we have 13 successes in the first 20 experiments, then we need 10 or more successes in our second set to reach or exceed our threshold. This will happen by random chance alone with probability 0.019 (about 19 times in 1,000). Although it is an additively huge difference, 0.01 vs 0.019, the probability of exceeding our threshold has almost doubled. And it gets worse. If we had 14 successes, then the probability “jumps” to 0.073 — over seven times higher. It is tempting to think that this occurs because the second set of trials is smaller than the first. However, the phenomenon exists then as well.

The issue exists because the probability distribution for all of the results of experiments considered together is not the same as the probability distribution for results of the second set of experiments given we know the results of the first set of experiment. You might think about this as being like a horse race where you are allowed to make your bet after the horses have reached the half way mark — you already have some information (which might be totally spurious) but most people will bet differently, using the information they have, than they would at the start of the race.

February 12, 2016

Meet Statistics summer scholar Rickaan Muirhead

Rickaan MuirheadEvery summer, the Department of Statistics offers scholarships to a number of students so they can work with staff on real-world projects. Rickaan, right, is working on pōhutukawa regeneration at Tiritiri Mātangi with Professor Chris Triggs. Rickaan explains:

“Tiritiri Mātangi is an offshore island in the Hauraki Gulf which, since 1984, has been undergoing ecological restoration, led by Supporters of Tiritiri Mātangi. Due to the capacity for pōhutukawa trees to support the early growth of native ecosystems, they were planted extensively across the island at the outset of the project.

“However, the pōhutukawa survival rate was much better than expected, resulting in dense pōhutukawa-dominated forests with almost no regeneration of other plant species. So pōhutukawa stands were thinned to encourage the natural diversification of the plant and animal communities beneath.

“To gauge the success of this endeavour, monitoring of plant regeneration and changes in bird and insect populations has been underway since 2010. A significant amount of data has now been collected, which I will analyse during my research to explore the regeneration of plant, animal and insect communities in these transformed pōhutukawa forests.

“The science surrounding ecological restoration is a hot topic worldwide in the face of exceptional rates of deforestation and extinction. The Tiritiri Mātangi project has captured the interest of the international conservation movement due to its innovative scientific and public-inclusive practices. This project will thus inform both local and international science surrounding restoration ecology, as well as support this valuable eco-sanctuary.

“I graduated in early 2015 with a Bachelor of Science, specialising in Quantitative Ecology and Modelling. I have just completed a Postgraduate Diploma in Science in Biosecurity and Conservation, and will be undertaking Masters study this year exploring Quantitative Ecology.

“I was initially drawn to statistics as it is very useful, and ubiquitous in life sciences. However, during my studies I’ve gained a much greater interest in its inner workings, and have found applying my knowledge exceptionally rewarding.

“In my spare time this summer, I’m hoping to get involved with some conservation projects in the community and read some novels.”


February 11, 2016

Anti-smacking law

Family First has published an analysis that they say shows the anti-smacking law has been ineffective and harmful.  I think the arguments that it has worsened child abuse are completely unconvincing, but as far as I can tell there isn’t any good evidence that is has helped.  Part of the problem is that the main data we have are reports of (suspected) abuse, and changes in the proportion of cases reported are likely to be larger than changes in the underlying problem.

We can look at  two graphs from the full report. The first is notifications to Child, Youth and Family


The second is ‘substantiated abuse’ based on these notifications


For the first graph, the report says “There is no evidence that this can be attributed simply to increased reporting or public awareness.” For the second, it says “Is this welcome decrease because of an improving trend, or has CYF reached ‘saturation point’ i.e. they simply can’t cope with the increased level of notifications and the amount of work these notifications entail?”

Notifications have increased almost eight-fold since 2001. I find it hard to believe that this is completely real: that child abuse was rare before the turn of the century and became common in such a straight-line trend. Surely such a rapid breakdown in society would be affected to some extent by the unemployment  of the Global Financial Crisis? Surely it would leak across into better-measured types of violent crime? Is it no longer true that a lot of abusing parents were abused themselves?

Unfortunately, it works both ways. The report is quite right to say that we can’t trust the decrease in notifications;  without supporting evidence it’s not possible to disentangle real changes in child abuse from changes in reporting.

Child homicide rates are also mentioned in the report. These have remained constant, apart from the sort of year to year variation you’d expect from numbers so small. To some extent that argues against a huge societal increase in child abuse, but it also shows the law hasn’t had an impact on the most severe cases.

Family First should be commended on the inclusion of long-range trend data in the report. Graphs like the ones I’ve copied here are the right way to present these data honestly, to allow discussion. It’s a pity that the infographics on the report site don’t follow the same pattern, but infographics tend to be like that.

The law could easily have had quite a worthwhile effect on the number and severity of cases child abuse, or not. Conceivably, it could even have made things worse. We can’t tell from this sort of data.

Even if the law hasn’t “worked” in that sense, some of the supporters would see no reason to change their minds — in a form of argument that should be familiar to Family  First, they would say that some things are just wrong and the law should say so.  On the other hand, people who supported the law because they expected a big reduction in child abuse might want to think about how we could find out whether this reduction has occurred, and what to do if it hasn’t.

February 10, 2016

Cheese addiction yet again

So, for people just joining us, there is a story making the rounds of the world media that cheese is literally addictive because the protein casein stimulates the same brain receptors as opiates like heroin.

The story references research at the University of Michigan, which doesn’t show anything remotely related to the claims (according not just to me but to the lead researcher on the study). This isn’t anything subtle; there is not one word related to the casein story in the paper. The story is made up out of nothing; it’s not an exaggeration or misunderstanding.

This time the story is in GQ magazine. It references the December version (from the Standard), but adds some of the distinctively wrong details of earlier versions (“published in the US National Library of Medicine”)

If I were a science journalist, I think I’d be interested in who was pushing this story and how they’d fooled so many people.

Summer daze

From the Herald:

“Our brains are sharper in summer than in winter, a study has found,” 

The tests were done during various weeks over the course of a year. The subjects performed better during the summer”

It takes some effort to find which study, because neither the journal nor any of the researchers are named.  It turns out to be this research in Proceedings of the National Academy of Sciences. The press release says

Performance on both tasks remained constant, but the brain resources used to complete each task changed with the seasons.

If you look in more detail (behind a paywall, sadly), the researchers found the same performance on the tests but with higher brain activity in summer than in winter for one tasks, and higher activity in autumn than spring for the other.  I’d tend to interpret that as our brains being in lower gear in summer — higher revs for the same actual speed — but the research paper didn’t come down one way or the other on this issue.

February 8, 2016

Nice cup of tea

Q: So, did you see that tea prevents hip fractures now?

A: The Herald story? Yes.

Q: Is this mice?

A: No, people.

Q: But it’s not experimental, it’s just correlation?

A: Yes, but better than usual.

Q: Why?

A: The two big problems with diet studies are that measurements are horribly inaccurate and that people who have healthy diets are also weird in lots of other ways which it’s hard to disentangle.

Q: So tea is different because it’s easy to measure?

A: They divided women into “3+ cups per day”, “Basically never”, and a middle group.  It’s not hard to distinguish people who drink tea as their main liquid intake from people who hardly ever drink it.

Q: And the healthy weirdo effect?

A: Tea drinking isn’t really thought of as a health thing, and (among 80 year old Australian women) it’s probably not a social class marker either.

Q: You usually complain about extrapolation as well. Did they really measure bone density?

A: Better than that, they really measured hip fractures.

Q: So we can believe it?

A: Perhaps?

Q: What’s your issue with this one, then?

A: The evidence isn’t overwhelmingly strong, and we wouldn’t have heard about this study through the media it if it wasn’t positive. And there have been a lot of other studies with mixed results.

Q: As good as this one?

A: No, but still.

Q: Would it have helped if they’d measured bone density?

A: They did, and they found it didn’t explain all the correlation with tea drinking, which is a bit surprising.

Q: Don’t you always tell people those mediation analyses are unreliable?

A: I suppose. But bone mineral density is measured pretty accurately, so it should work better than usual.

Q: Why don’t you want tea to be beneficial?

A: I do want tea to be beneficial; I drink a lot of it. But almost nothing prevents hip fractures, and this is a big difference.

Q: How big?

A: More than 40% reduction for hip fractures, which is in the ballpark of what the potent bisphosphonate drugs manage.

Q: So should we drink tea?

A: It’s unlikely to be harmful, and it might help with fractures. If you hate the taste, though, this probably isn’t strong enough evidence to force yourself to drink it.

Q: How about eating green tea KitKat?

A: I’m fairly sure consumption of that was low in this cohort of elderly Australian women.

February 7, 2016

Why do we care?

From the history of the Manchester Statistical Society

Manchester Statistical Society was a pioneering organisation: It was the first organisation in Britain to study social problems systematically and to collect statistics for social purposes. In 1834 it was the first organisation to carry out a house-to-house social survey.

The Society was formed in September 1833 at a time of severe social problems. Few of the founders were statisticians in the modern, technical sense. But, they were interested in improving the state of the people and believed that establishing the facts regarding social problems was a necessary first step. 

From an earlier organisation in London (via)

‘The privation and misery endured by the productive classes of society in Great Britain in 1816 and 1817, led to the formation of an Association in London, for the purpose of investigating the nature and extent of that misery; and of ascertaining, if possible, how far it resulted from avoidable or from unavoidable causes; and how far repetitions of similar ills were likely or not to occur’.

Groups of wealthy men saying they want to improve society isn’t new.  Nor is it new that they don’t know enough to do much good.  What was different in the early nineteenth century was that they recognised they didn’t know enough. The Statistical Societies were founded to provide information about social problems that went beyond any individual’s range of anecdotes, because the truth mattered.

The range of statistics has broadened immensely since then, especially with the help of computers. At the foundation is still the principle that one person’s reckons aren’t enough: the world is more complicated than that and the truth matters.

I’m not arguing that statistics has to be Important and Serious. If you want to know whether The Rock plays the same music at the same time each day or who is likely to win the rugby, statistics can help there. If you care enough about how other people eat their cereal, that’s a valid topic for investigation. The bottom line is that you do actually care about the answer; that the truth matters.

For a depressingly large fraction of surveys in the news today, no-one really cares whether the answer is accurate or even what the question means. Maybe it’s ok to have sections of the newspaper where facts aren’t really relevant — you need to ask a journalist, not me. But when the truth doesn’t matter, stop pretending to use statistics.