Posts written by Thomas Lumley (1539)


Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

July 28, 2015

Recreational genotyping: potentially creepy?

Two stories from this morning’s Twitter (via @kristinhenry)

  • 23andMe has made available a programming interface (API) so that you can access and integrate your genetic information using apps written by other people.  Someone wrote and published code that could be used to screen users based on sex and ancestry. (Buzzfeed, FastCompany). It’s not a real threat, since apps with more than 20 users need to be reviewed by 23andMe, and since users have to agree to let the code use their data, and since Facebook knows far more about you than 23andMe, but it’s not a good look.
  • Google’s Calico project also does cheap public genotyping and is combining their DNA data (more than a million people) with family trees from This is how genetic research used to be done: since we know how DNA is inherited, connecting people with family trees deep into the past provides a lot of extra information. On the other hand, it means that if a few distantly-related people sign up for Calico genotying, Google will learn a lot about the genomes of all their relatives.

It’s too early to tell whether the people who worry about this sort of thing will end up looking prophetic or just paranoid.

July 27, 2015

Cheat sheet on polling margin of error

The “margin of error” in a poll is the number you add and subtract to get a 95% confidence interval for the underlying proportion (under the simplest possible mathematical model for polling).  Pollers typically quote the “maximum margin of error”, which is the margin of error when the reported value is 50%. When the reported value is 0.7%, reporting the maximum margin of error (3.1%) is not helpful.  The Conservative Party is unpopular, but it’s not possible for them to have negative support, and not likely that they have nearly 4%.

Here is a cheat sheet, an expanded version of one I posted last year. The first column is the reported proportion and the remaining columns are the lower and upper ends of the 95% confidence interval for a sample of size 1000 (Here’s the code).   The Conservative Party interval is  (0.3%,1.4%), not (-2.4%, 3.8%).

       l    u
0.1  0.0  0.6
0.2  0.0  0.7
0.3  0.1  0.9
0.4  0.1  1.0
0.5  0.2  1.2
0.6  0.2  1.3
0.7  0.3  1.4
0.8  0.3  1.6
0.9  0.4  1.7
1.0  0.5  1.8
1.5  0.8  2.5
2.0  1.2  3.1
2.5  1.6  3.7
3.0  2.0  4.3
3.5  2.4  4.8
4.0  2.9  5.4
4.5  3.3  6.0
5.0  3.7  6.5
10   8.2 12.0
15  12.8 17.4
20  17.6 22.6
25  22.3 27.8
30  27.2 32.9
35  32.0 38.0
50  46.9 53.1

As you can see, the margin downwards is smaller than the margin upwards for small numbers (because you can’t have fewer than no supporters). By the time you get to 30% or so, the interval is pretty close to what you’d get with the maximum margin of error, but below 10% the maximum margin of error is seriously misleading.

You can get a reasonable approximation to these numbers by taking the number (not percent) of supporters (eg, 0.7% is 7 out of 1000), taking the square root, adding and subtracting 1, then squaring again: (then converting back into percent: ie, dividing by 10 for a poll of 1000).

    approx l approx u
0.1     0.00     0.40
0.2     0.02     0.58
0.3     0.05     0.75
0.4     0.10     0.90
0.5     0.15     1.05
0.6     0.21     1.19
0.7     0.27     1.33
0.8     0.33     1.47
0.9     0.40     1.60
1       0.47     1.73
1.5     0.83     2.37
2       1.21     2.99
2.5     1.60     3.60
3       2.00     4.20
3.5     2.42     4.78
4       2.84     5.36
4.5     3.26     5.94
5       3.69     6.51
10      8.10    12.10
15     12.65    17.55
20     17.27    22.93
25     21.94    28.26
30     26.64    33.56
35     31.36    38.84
50     45.63    54.57

which is pretty easy on a calculator, or with an Excel macro. For example, for 1000-person polls, if you put the reported percentage in the A1 cell, use =(sqrt(A1*10)-1)^2/10 and =(sqrt(A1*10)+1)^2/10


    • Profile of Auckland Stats almnus Hadley Wickham at Priceonomics
    • The kiwi (Apteryx, not Actinidia) genome was recently sequenced by a non-NZ research group. There’s a push for NZ-led sequencing of nationally-significant genomes: a taonga genomes project
    • Linguist Jack Grieve (@JWGrieve) has been tweeting maps of various swearwords on (US) Twitter. These are relative to total number of tweets, so the don’t have the usual problem

  • From Jonathan Marshall, the age distribution of NZ electorates, and their political hue: there’s a clear trend, and Ilam seems a bit of an outlier.




July 25, 2015

Some evidence-based medicine stories

  • Ben Goldacre has a piece at Buzzfeed, which is nonetheless pretty calm and reasonable, talking about the need for data transparency in clinical trials
  • The Alltrials campaign, which is trying to get regulatory reform to ensure all clinical trials are published, was joined this week by a group of pharmaceutical company investors.  This is only surprising until you think carefully: it’s like reinsurance companies and their interest in global warming — they’d rather the problems would go away, but there’s not profit in just ignoring them.
  • The big potential success story of scanning the genome blindly is a gene called PCSK9: people with a broken version have low cholesterol. Drugs that disable PCSK9 lower cholesterol a lot, but have not (yet) been shown to prevent or postpone heart disease. They’re also roughly 100 times more expensive than the current drugs, and have to be injected. None the less, they will probably go on sale soon.
    A survey of a convenience sample of US cardiologists found that they were hoping to use the drugs in 40% of their patients who have already had a heart attack, and 25% of those who have not yet had one.
July 24, 2015

Are beneficiaries increasingly failing drug test?

Stuff’s headline is “Beneficiaries increasingly failing drug tests, numbers show”.

The numbers are rates per week of people failing or refusing drug tests. The number was 1.8/week for the first 12 weeks of the policy and 2.6/week for the whole year 2014, and, yes, 2.6 is bigger than 1.8.  However, we don’t know how many tests were performed or demanded, so we don’t know how much of this might be an increase in testing.

In addition, if we don’t worry about the rate of testing and take the numbers at face value, the difference is well within what you’d expect from random variation, so while the numbers are higher it would be unwise to draw any policy conclusions from the difference.

On the other hand, the absolute numbers of failures are very low when compared to the estimates in the Treasury’s Regulatory Impact Statement.

MSD and MoH have estimated that once this policy is fully implemented, it may result in:

• 2,900 – 5,800 beneficiaries being sanctioned for a first failure over a 12 month period

• 1,000 – 1,900 beneficiaries being sanctioned for a second failure over a 12 month period

• 500 – 1,100 beneficiaries being sanctioned for a third failure over a 12 month period.

The numbers quoted by Stuff are 60 sanctions in total over eighteen months, and 134 test failures over twelve months.  The Minister is quoted as saying the low numbers show the program is working, but as she could have said the same thing about numbers that looked like the predictions, or numbers that were higher than the predictions, it’s also possible that being off by an order of magnitude or two is a sign of a problem.


July 23, 2015

Diversity is (very slightly) good for you

This isn’t in the local news, but there are stories about it in the world media: a new paper in Nature on associations between genetic diversity and various desirable characteristics.  I’m one of the authors — and so is pretty much everyone else, since this research combines analyses from over 100 cohort studies.  The Nature paper is actually the second publication in this area that I’ve worked on.  My first Auckland MSc student in Statistics, Anish Scaria, did some analysis for a different definition of genetic diversity, and that plus data from a smaller group of cohort studies was published last year.

What did we do? Humans, like most animals and many plants1, have two copies of our complete genome2. We looked at how similar these two copies were, essentially measuring small amounts of inbreeding from distant ancestors.

Each cohort study had measured a large number of binary genetic variants, ranging from 300,000 to 1,000,000. In the first paper we looked at just the proportion of variants where the two copies were the same3. In the new paper we looked at contiguous chunks of genome where all the variants were the same in the two copies, which gives a more sensitive indication of the chunks of genome being inherited from the same distant ancestor. We compared people based on the proportion of genome that was in these contiguous chunks.

The comparisons were done separately within each cohort and the associations were then averaged: obviously you would get different genetic diversity in a cohort from Iceland versus a cohort of African-Americans, and we need to make sure that sort of difference didn’t get incorporated in the analysis. Similarly, for cohorts that recruited people of different ancestries, the comparisons were done between people of the same basic ancestry and averaged.

Our first paper found that people with more difference between their two genomic copies lived (very slightly) longer on average; the new paper found that (to a very small extent) they were taller, had higher average scores on IQ tests, and had lower cholesterol. The basic direction of the results wasn’t surprising, but the lack of association for specific diseases and risk factors was — there was no sign of a difference in diabetes, for example.

Scientifically, the data provide a little bit of extra support for height and whatever IQ tests measure having been under evolutionary selection, and a bit of negative evidence on diabetes and heart disease having been under evolutionary selection in human history. And also a bit of support for the idea that you can actually get more than a hundred groups of independent and fiercely territorial academics to work together sometimes.



1. Some important crop plants, such as wheat, cabbage, and sugarcane, are insanely more complicated
2. Yes, I’m ignoring the sex chromosomes here.
3. “Homozygous” is the technical term.

July 22, 2015

Are reusable shopping bags deadly?

There’s a research report by two economists arguing that San Francisco’s bag on plastic shopping bags has led to a nearly 50% increase in deaths from foodborne disease, an increase of about 5.5 deaths per year.  I was asked my opinion on Twitter. I don’t believe it.

What the analysis does show is some evidence that emergency room visits for foodborne disease have increased: the researchers analysed admissions for E. coli, Salmonella, and Campylobacter infection, and found an increase in San Francisco but not in neighbouring counties. There’s a statistical issue in that the number of counties is small and the standard error estimates tend to be a bit unreliable in that setting, but that’s not prohibitive. There’s also a statistical issue in that we don’t know which (if any) infections were related to contamination of raw food, but again that’s not prohibitive.

The problem with the analysis of deaths is the definition: the deaths in the analysis were actually all of the ICD10 codes A00-A09. Most of this isn’t foodborne bacterial disease, and a lot of the deaths from foodborne bacterial disease will be in settings where shopping bags are irrelevant. In particular, two important contributors are

  • Clostridium difficile infections after antibiotic use, which has a fairly high mortality rate
  • Diarrhoea in very frail elderly people, in residential aged care or nursing homes.

In the first case, this has nothing to do with food. In the second case, it’s often person-to-person transmission (with norovirus a leading cause), but even if it is from food, the food isn’t carried in reusable shopping bags.

Tomás Aragón with the San Francisco department of Public Health, has a more detailed breakdown of the death data than were available to the researchers. His memo I think is too negative on the statistical issues, but the data underlying the A00-A09 categories are pretty convincing:


Category A021 is Salmonella (other than typhoid); A048 and A049 are other miscellaneous bacterial infections; A081 and A084 are viral. A090 and A099 are left-over categories that are supposed to exclude foodborne disease but will capture some cases where the mechanism of infection wasn’t known.  A047 is Clostridium difficile.   The apparent signal is in the wrong place. It’s not obvious why the statistical analysis thinks it has found evidence of an effect of the plastic-bag ban, but it is obvious that it hasn’t.

Here, for comparison, are New Zealand mortality data for specific foodborne infections, from, the most recent year available


Over the three years, there were only ten deaths where the underlying cause was one of these food-borne illnesses — a lot of people get sick, but very few die.


The mortality data don’t invalidate the analysis of hospital admissions, where there’s a lot more information and it is actually about (potentially) foodborne diseases.  More data from other cities — especially ones that are less atypical than San Francisco — would be helpful here, and it’s possible that this is a real effect of reusing bags. The economic analysis,however, relies heavily on the social costs of deaths.

July 20, 2015

Pie chart of the day

From the Herald (squashed-trees version, via @economissive)


For comparison, a pie of those aged 65+ in NZ regardless of where they live, based on national population estimates:


Almost all the information in the pie is about population size; almost none is about where people live.

A pie chart isn’t a wonderful way to display any data, but it’s especially bad as a way to show relationships between variables. In this case, if you divide by the size of the population group, you find that the proportion in private dwellings is almost identical for 65-74 and 75-84, but about 20% lower for 85+.  That’s the real story in the data.

July 19, 2015


  • In the interests of balance, a post at Public Address by Rob Salmond, who did the analysis in the ‘Chinese names’ real-estate leak.  And a robust twitter discussion with him, Keith Ng, and Tze Ming Mok.
  • Stats New Zealand has a new standard question about gender identity (as distinguished from sex), acknowledging that it isn’t as simple as some people would like it to be.
  • The most important aspects of health seem to vary by age: “older raters gave significantly more weight to functional limitations and social functioning and less to morbidities and pain experience, compared to younger raters.” (via @hildabast)
  • Priceonomics has a post on the most common and most distinctive ingredients in recipes from around the world. The list illustrates the problem with the ‘distinctiveness’ metric (as Kieran Healy pointed out: whiskey is really not the distinctive signature of Irish food).  It also shows up other problems: for example, “African” and “Asian” are both listed as cuisines. Fundamentally, the limitation in is the recipe lists and the approximations made: galangal shows up as a reasonable candidate for most-distinctive Thai ingredient partly because there aren’t any substitutes; cayenne is the most widely used ingredient in the Mexican recipes because it’s being substituted for other chillis.
July 16, 2015

Don’t just sit there, do something

The Herald’s story on sitting and cancer is actually not as good as the Daily Mail story it’s edited from. Neither one gives the journal or researchers (the paper is here). Both mention a previous study, but the Mail goes into more useful detail.

The basic finding is

Longer leisure-time spent sitting, after adjustment for physical activity, BMI and other factors, was associated with risk of total cancer in women (RR=1.10, 95% CI 1.04-1.17 for >6 hours vs. <3 hours per day), but not men (RR=1.00, 95% CI 0.96-1.05)

The lack of association in men was a surprise, and strongly suggests that the result for women shouldn’t be believed. It’s also notable that while the estimated associations with a few types of cancer look strong, the lower limits on the confidence intervals don’t look strong:

risk of multiple myeloma (RR=1.65, 95% CI 1.07-2.54), invasive breast cancer (RR=1.10, 95% CI 1.00-1.21), and ovarian cancer (RR=1.43, 95% CI 1.10-1.87).

Since the researchers looked at eighteen subsets of cancer in addition to all types combined, and these are the top three, the real lower limits are even lower.

The stories referred to previous research, published last year, which summarised many previous studies of sitting and cancer risk.  That’s good, but the summary wasn’t entirely accurate. From the Herald:

Previous research by the University of Regensburg in Germany found that spending too much time sitting raised the risk of bowel and lung cancer in both men and women.

In fact, the previous research didn’t look separately at men and women (or, at least, didn’t report doing so). While you would expect similar results in men and women, that study doesn’t address the question.

The Mail does have one apparently good contextual point

However, this previous study – which reviewed 43 other studies – did not find a link between sitting and a higher risk of breast and ovarian cancer. 

But when you look at the actual figures, there’s no real inconsistency between the two studies: they both report weak evidence of higher risk; it’s just a question of whether the lower end of the confidence interval happens to cross the ‘no difference’ line for a particular subset of cancers.

Overall, this is a pretty small risk difference to detect from observational data. If you didn’t already think that long periods of sitting could be bad for you, this wouldn’t be a reason to start.