Posts filed under Correlation vs Causation (67)

August 20, 2014

Good neighbours make good fences

Two examples of neighbourly correlations, at least one of which is not causation

1. A (good) Herald story today, about research in Michigan that found people who got on well with their neighbours were less likely to have heart attacks

2. An old Ministry of Justice report showing people who told their neighbours whenever they went away were much less likely to get burgled.

The burglary story is the one we know is mostly not causal.  People who tell their neighbours whenever they go on holiday were about half as likely to have experienced a burglary, but only about one burglary in seven happened while the residents were on holiday. There must be something else about types of neighbourhoods or relationships with neighbours that explains most of the correlation.

I’m pretty confident the heart-disease story works the same way.  The researchers had some possible explanations

The mechanism behind the association was not known, but the team said neighbourly cohesion could encourage physical activities such as walking, which counter artery clogging and disease.

That could be true, but is it really more likely that talking to your neighbours makes you walk around the neighbourhood or work in the garden, or that walking around the neighbourhood and working in the garden leads to talking to your neighbours? On top of that, the correlation with neighbourly cohesion was rather stronger then the correlation previously observed with walking.

July 22, 2014

Lack of correlation does not imply causation

From the Herald

Labour’s support among men has fallen to just 23.9 per cent in the latest Herald-DigiPoll survey and leader David Cunliffe concedes it may have something to do with his “sorry for being a man” speech to a domestic violence symposium.

Presumably Mr Cunliffe did indeed concede it might have something to do with his statement; and there’s no way to actually rule that out as a contributing factor. However

Broken down into gender support, women’s support for Labour fell from 33.4 per cent last month to 29.1 per cent; and men’s support fell from 27.6 per cent last month to 23.9 per cent.

That is, women’s support for Labour fell by 4.2 percentage points (give or take about 4.2) and men’s by 3.7 percentage points (give or take about 4.2). This can’t really be considered evidence for a gender-specific Labour backlash. Correlations need not be causal, but here there isn’t even a correlation.

June 23, 2014


My attention was drawn on Twitter to this post at The Political Scientist arguing that the election poll reporting is misleading because they don’t report the results for the relatively popular “Undecided” party.  The post is making a good point, but there are two things I want to comment on. Actually, three things. The zeroth thing is that the post contains the numbers, but only as screenshots, not as anything useful.

The first point is that the post uses correlation coefficients to do everything, and these really aren’t fit for purpose. The value of correlation coefficients is that they summarise the (linear part of the) relationship between two variables in a way that doesn’t involve the units of measurement or the direction of effect (if any). Those are bugs, not features, in this analysis. The question is how the other party preferences have changed with changes in the ‘Undecided’ preference — how many extra respondents picked Labour, say, for each extra respondent who gave a preference. That sort of question is answered  (to a straight-line approximation) by regression coefficients, not correlation coefficients.

When I do a set of linear regressions, I estimate that changes in the Undecided vote over the past couple of years have split approximately  70:20:3.5:6.5 between Labour:National:Greens:NZFirst.  That confirms the general conclusion in the post: most of the change in Undecided seems to have come from  Labour. You can do the regressions the other way around and ask where (net) voters leaving Labour have gone, and find that they overwhelmingly seem to have gone to Undecided.

What can we conclude from this? The conclusion is pretty limited because of the small number of polls (9) and the fact that we don’t actually have data on switching for any individuals. You could fit the data just as well by saying that Labour voters have switched to National and National voters have switched to Undecided by the same amount — this produces the same counts, but has different political implications. Since the trends have basically been a straight line over this period it’s fairly easy to get alternative explanations — if there had been more polls and more up-and-down variation the alternative explanations would be more strained.

The other limitation in conclusions is illustrated by the conclusion of the post

There’s a very clear story in these two correlations: Put simply, as the decided vote goes up so does the reported percentage vote for the Labour Party.

Conversely, as the decided vote goes up, the reported percentage vote for the National party tends to go down.

The closer the election draws the more likely it is that people will make a decision.

But then there’s one more step – getting people to put that decision into action and actually vote.

We simply don’t have data on what happens when the decided vote goes up — it has been going down over this period — so that can’t be the story. Even if we did have data on the decided vote going up, and even if we stipulated that people are more likely to come to a decision near the election, we still wouldn’t have a clear story. If it’s true that people tend to come to a decision near the election, this means the reason for changes in the undecided vote will be different near an election than far from an election. If the reasons for the changes are different, we can’t have much faith that the relationships between the changes will stay the same.

The data provide weak evidence that Labour has lost support to ‘Undecided’ rather than to National over the past couple of years, which should be encouraging to them. In the current form, the data don’t really provide any evidence for extrapolation to the election.


[here’s the re-typed count of preferences data, rounded to the nearest integer]

June 3, 2014

Are girl hurricanes less scary?

There’s a new paper out in the journal PNAS claiming that hurricanes with female names cause three times as many deaths as those with male names (because people don’t give girl hurricanes the proper respect). Ed Yong does a good job of explaining why this is probably bogus, but no-one seems to have drawn any graphs, which I think make the situation a lot clearer. (more…)

January 14, 2014

Causation, counterfactuals, and Lotto

A story in the Herald illustrates a subtle technical and philosophical point about causation. One of Saturday’s Lotto winners says

“I realised I was starving, so stopped to grab a bacon and egg sandwich.

“When I saw they had a Lotto kiosk, I decided to buy our Lotto tickets while I was there.

“We usually buy our tickets at the supermarket, so I’m glad I followed my gut on this one,” said one of the couple, who wish to remain anonymous.

Assuming it was a random pick, it’s almost certainly true that if they had not bought the ticket at that Lotto kiosk at that time, they would not have won.  On the other hand, if Lotto is honest, buying at that kiosk wasn’t a good strategy — it had no impact on the chance of winning.

There is a sense in which buying the bacon-and-egg sandwich was a cause of the win, but it’s not a very useful sense of the word ’cause’ for most statistical purposes.

November 7, 2013

Why you should eat in crowded food halls

There’s a couple of posts being promoted on the internet about an important and relatively subtle form of selection bias.  Epidemiologists know it as Berkson’s Paradox, in modern causal inference terminology it’s ‘conditioning on colliders’, and for an economist it’s a consequence of production-possibility frontier.

The basic issue is very simple. As Gabriel Rossman puts it at The Atlantic

 There is no ontological reason why we can’t have shoes that are both hideous and uncomfortable but rather there is a practical reason in that nobody wears shoes that are terrible in every way and so such shoes don’t make it unto the market. 

In the same way, there’s no necessary reason why cricketers who are good at bowling have to be bad at batting.  Being able to deliver the ball so it misleads or outpaces the batsman doesn’t make it any harder to spot bowling trickery or to react fast. And in fact, if you look at 12-year-olds, often the same kids are good at batting and bowling.  In international-level cricket, though, all-rounders are pretty rare, and someone who can take 5 wickets in an Test innings is very unlikely to be able to score a Test century.  The slight positive correlation you see in kids turns into a strong negative correlation in adults. The reason is that getting into an international cricket team requires you to be very, very good at batting or very, very good at bowling. Since it’s more likely that you’re very, very good at one thing than two, most international cricketers are either batsmen or bowlers, but not both. Among those who are selected, there’s a negative correlation.

There are examples in the social sciences: opposition to marijuana legalisation is positively correlated with opposition to government wealth redistribution in the US as a whole, but uncorrelated among Republican voters.

There are examples in medicine: the genetic variant Factor V Leiden is strongly associated with deep-vein thrombosis in the population in general, but not at all predictive of recurrence in people who have already had one.

And there are examples in dining: for a given price, a successful restaurant has to do well enough on some combination of food quality, pleasant ambience, trendiness, etc. So these will end up negatively correlated, and if you want good inexpensive food in downtown Auckland, try one of the Asian food courts.

(via @gnat, who points to one of the posts and notes: Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this.)

October 27, 2013

Fast-food outlets and obesity

Everyone knows that areas with more fast-food stores have more overweight people, and it certainly makes sense that fast food is bad for you. Like almost everything else, though, it gets more complicated when you start looking carefully.

Firstly, earlier this year Eric Crampton wrote in NBR about some research by an economics PhD student, Rachel Webb, who was trying to take advantage of this well-known relationship to unpick some aspects of correlation vs causation in the relationship between mother’s weight and infant’s birthweight. She found that, actually, areas in New Zealand with more fast-food outlets didn’t have more obesity to any useful and consistent extent.

Secondly, there’s new research on diet and fast food using data from the big NHANES surveys in the USA.  It confirms, as you might expect, that people who eat more fast food also eat less healthily at other times.



October 10, 2013

Innovation and indexes

The 2013 Global Innovation Index is out, with writeups in Scientific American and the NZ internets, but not this year in the NZ press. Stuff, instead, tells us “Low worker engagement holds NZ back”, quoting Gallup’s ’employee engagement’ figure of 23% for NZ, without much attempt to compare to other countries.

The two international rankings are very different: of the 16 countries above us in the Global Innovation Index, 13 have significantly lower employee engagement ratings, one (Denmark) is about the same, and one (USA) is higher (one, Hong Kong, is missing because Gallup lumps it in with the rest of the PRC).  It’s also important to consider what is behind these ratings. If you search on  “Gallup employee engagement”, you get results mostly focused on Gallup’s consulting services — getting you to worry about employee engagement is one of the ways they make money.  The Global Innovation Index, on the other hand, came from a business school and was initially sponsored by the Confederation of Indian Industry  and has now expanded with wider sponsorship and academic involvement: it’s not biased in any way that’s obviously relevant to New Zealand.

With any complicated scoring system, different countries will do well on different components of the score.  If you believe, with the authors of Why Nations Fail,  that quality of institutions is the most important factor, you might focus on the “Institutions” component of the innovation index, where New Zealand is in third place. If you’re AMP econonomist Bevan Graham you might think the ‘business sophistication’ component is more important and note that NZ falls to 28th.

If you want NZ innovation to improve, the reverse approach might be more helpful: look at where NZ ranks poorly, and see if these are things we want to change (innovation isn’t everything) and how we might change them.



October 2, 2013

Cough, choke, history

If the PubMed research database is still surviving the US government shutdown, you can read a paper published 63 years ago today on lung cancer

In England and Wales the phenomenal increase in the
number of deaths attributed to cancer of the lung provides
one of the most striking changes in the pattern of

mortality recorded by the Registrar-General. For example,
in the quarter of a century between 1922 and 1947 the
annual number of deaths recorded increased from 612 to
9,287, or roughly fifteenfold. This remarkable increase is,
of course, out of all proportion to the increase of population

Some people were arguing that the increase was just due to better diagnosis of lung cancer, and even  those who believed in a real increase weren’t sure of the reason

Two main causes have from time to time been put forward:
(1) a general atmospheric pollution from the exhaust

fumes of cars, from the surface dust of tarred roads, and
from gas-works, industrial plants, and coal fires; and
(2) the smoking of tobacco.

Richard Doll and Austin Bradford Hill decided to compare histories of smoking in lung cancer patients and those in hospital for other reasons. As you know, they found that the lung cancer patients were much more likely to be heavy smokers. It’s also interesting to read what other possibilities they considered, and how they tried to rule them out.

This sort of study isn’t completely definitive, and, famously, the eminent statistician and geneticist (and heavy smoker) R. A. Fisher was never convinced. He thought that genetic factors might well be responsible. Further evidence was provided by experiments in animals (such the ‘smoking beagles‘ of Duke University) showed that smoking really could cause cancer. Also, much more recently, studies of twins and studies that actually measured genotypes showed that genetic differences weren’t a big enough contributor to lung cancer to explain the correlation.

In contrast to, say, alcohol or opium, tobacco has been a public health problem only for about a century: tobacco smoking became very widespread in men during the first world war. With a bit of effort and some luck, future generations might see it as an inexplicable historical anomaly, like a deadly version of canasta.

August 18, 2013

Correlation, genetics, and causation

There’s an interesting piece on cannabis risks at Project Syndicate. One of the things they look at is the correlation between frequent cannabis use and psychosis.  Many people are, quite rightly, unimpressed with the sort of correlation, since it isn’t hard to come up with explanations for psychosis causing cannabis use or for other factors causing both.

However, there is also some genetic data.  The added risk of psychosis seems to be confined to people with two copies of a particular genetic variant in a gene called AKT1. This is harder to explain as confounding (assuming the genetics has been done right), and is one of the things genetics is useful for. This isn’t just a one-off finding; it was found in one study and replicated in another.

On the other hand, the gene AKT1 doesn’t seem to be very active in brain cells, making it more likely that the finding is just a coincidence.  This is one of the things bioinformatics is good for.

In times like these it’s good to remember Ben Goldacre’s slogan “I think you’ll find it’s a bit more complicated than that.”