Posts filed under Social Media (70)

May 1, 2015

Have your say on the 2018 census


StatsNZ has a discussion forum on the 2018 Census


They say

The discussion on Loomio will be open from 30 Apr to 10 Jun 2015.

Your discussions will be considered as an input to final decision making.

Your best opportunity to influence census content is to make a submission. Statistics NZ will use this 2018 Census content determination framework to make final decisions on content. The formal submission period will be open from 18 May until 30 Jun 2015 via

So, if you have views on what should be asked and how it should be asked, join in the discussion and/or make a submission


March 23, 2015

Cricket visualisations


March 20, 2015

Ideas that didn’t pan out

One way medical statisticians are trained into skepticism over their careers is seeing all the exciting ideas from excited scientists and clinicians that don’t turn out to work. Looking at old hypotheses is a good way to start. This graph is from a 1986 paper in the journal Medical Hypotheses, and the authors are suggesting pork consumption is important in multiple sclerosis, because there’s a strong correlation between rates of multiple sclerosis and pork consumption across countries:


This wasn’t a completely silly idea, but it was never anything but suggestive, for two main reasons. First, it’s just a correlation. Second, it’s not even a correlation at the level of individual people — the graph is just as strong support for the idea that having neighbours who eat pork causes multiple sclerosis. Still, dietary correlations across countries have been useful in research.

If you wanted to push this idea today, as a Twitter account claiming to be from a US medical practice did, you’d want to look carefully at the graph rather than just repeating the correlation. There are some countries missing, and other countries that might have changed over the past three decades.

In particular, the graph does not have data for Korea, Taiwan, or China. These have high per-capita pork consumption, and very low rates of multiple sclerosis — and that’s even more true of Hong Kong, and specifically of Chinese people in Hong Kong.  In the other direction, the hypothesis would imply very low levels of multiple sclerosis among US and European Jews. I don’t have data there, but in people born in Israel the rate of multiple sclerosis is moderate among those of Ashkenazi heritage and low in others, which would also mess up the correlations.

You might also notice that the journal is (or was) a little non-standard, or as it said  “intended as a forum for unconventional ideas without the traditional filter of scientific peer review”.

Most of this information doesn’t even need a university’s access to scientific journals — it’s just out on the web.  It’s a nice example of how an interesting and apparently strong correlation can break down completely with a bit more data.

March 19, 2015

Model organisms

The flame retardant chemicals in your phone made zebra fish “chubby”, says the caption on this photo at Zebra fish, as it explains, are a common model organism for medical research, so this could be relevant to people


On the other hand, as @LewSOS points out on Twitter, it doesn’t seem to be having the same effect on the model organisms in the photo.

What’s notable about the story is how much better it is than the press release, which starts out

Could your electronics be making you fat? According to University of Houston researchers, a common flame retardant used to keep electronics from overheating may be to blame.

The story carefully avoids repeating this unsupported claim.  Also, the press release doesn’t link to the research paper, or even say where it was published (or even that it was published). That’s irritating in the media but unforgivable in a university press release.   When you read the paper it turns out the main research finding was that looking at fat accumulation in embryonic zebrafish (which is easy because they are transparent, one of their other advantages over mice) was a good indication of weight gain later in life, and might be a useful first step in deciding which chemicals were worth testing in mice.

So, given all that, does your phone or computer actually expose you to any meaningful amount of this stuff?

The compounds in question, Tetrabromobisphoneol A (TBBPA) and tetrachlorobisphenol A (TCBPA) can leach out of the devices and often end up settling on dust particles in the air we breathe, the study found.

That’s one of the few mistakes in the story: this isn’t what the study found, it’s part of the background information. In any case, the question is how much leaches out. Is it enough to matter?

The European Union doesn’t think so

The highest inhalation exposures to TBBP-A were found in the production (loading and mixing) of plastics, with 8-hour time-weighted-averages (TWAs) up to 12,216 μg/m3 . At the other end of the range, offices containing computers showed TBBP-A air concentrations of less than 0.001 μg/m3 . TBBP-A exposures at sites where computers were shredded, or where laminates were manufactured ranged from 0.1 to 75 μg/m3 .

You might worry about the exposures from plastics production, and about long-term environmental accumulations, but it looks like TBBP-A from being around a phone isn’t going to be a big contributor to obesity. That’s also what the international comparisons would suggest — South Korea and Singapore have quite a lot more smartphone ownership than Australia, and Norway and Sweden are comparable, all with much less obesity.

March 18, 2015

Awful graphs about interesting data


Today in “awful graphs about interesting data” we have this effort that I saw on Twitter, from a paper in one of the Nature Reviews journals.


As with some other recent social media examples, the first problem is that the caption isn’t part of the image and so doesn’t get tweeted. The numbers are the average number of drug candidates at each stage of research to end up with one actual drug at the end. The percentage at the bottom is the reciprocal of the number at the top, multiplied by 60%.

A lot of news coverage of research is at the ‘preclinical’ stage, or is even earlier, at the stage of identifying a promising place to look.  Most of these never get anywhere. Sometimes you see coverage of a successful new cancer drug candidate in Phase I — first human studies. Most of these never get anywhere.  There’s also a lot of variation in how successful the ‘successes’ are: the new drugs for Hepatitis C (the first column) are a cure for many people; the new Alzheimer’s drugs just give a modest improvement in symptoms.  It looks as those drugs from MRSA (antibiotic-resistant Staph. aureus) are easier, but that’s because there aren’t many really novel preclinical candidates.

It’s an interesting table of numbers, but as a graph it’s pretty dreadful. The 3-d effect is purely decorative — it has nothing to do with the represntation of the numbers. Effectively, it’s a bar chart, except that the bars are aligned at the centre and have differently-shaped weird decorative bits at the ends, so they are harder to read.

At the top of the chart,  the width of the pale blue region where it crosses the dashed line is the actual data value. Towards the bottom of the chart even that fails, because the visual metaphor of a deformed funnel requires the ‘Launch’ bar to be noticeably narrower than the ‘Registration’ bar. If they’d gone with the more usual metaphor of a pipeline, the graph could have been less inaccurate.

In the end, it’s yet another illustration of two graphical principles. The first: no 3-d graphics. The second: if you have to write all the numbers on the graph, it’s a sign the graph isn’t doing its job.

March 16, 2015

Maps, colours, and locations

This is part of a social media map, of photographs taken in public places in the San Francisco Bay Area


The colours are trying to indicate three social media sites: Instagram is yellow, Flickr is magenta, Twitter is cyan.

Encoding three variables with colour this way doesn’t allow you to easily read off differences, but you can see clusters and then think about how to decode them into data. The dark green areas are saturated with photos.  Light green urban areas have Instagram and Twitter, but not much Flickr.  Pink and orange areas lack Twitter — mostly these track cellphone coverage and population density, but not entirely.  The pink area in the center of the map is spectacular landscape without many people; the orange blob on the right is the popular Angel Island park.

Zooming in on Angel Island shows something interesting: there are a few blobs with high density across all three social media systems. The two at the top are easily explained: the visitor centre and the only place on the island that sells food. The very dense blob in the middle of the island, and the slightly less dense one below it are a bit strange. They don’t seem to correspond to any plausible features.


My guess is that these are a phenomenon we’ve seen before, of locations being mapped to the center of some region if they can’t be specified precisely.

Automated data tends to be messy, and making serious use of it means finding out the ways it lies to you. Wayne Dobson doesn’t have your cellphone, and there isn’t a uniquely Twitter-worthy bush in the middle of Angel Island.


March 14, 2015

Ok, but it matters in theory

Some discussion on Twitter about political polling and whether political journalists understood the numbers led to the question:

If you poll 500 people, and candidate 1 is on 35% and candidate 2 is on 30%, what is the chance candidate 2 is really ahead?

That’s the wrong question. Well, no, actually it’s the right question, but it is underdetermined.

The difficulty is related to the ‘base-rate‘ problem in testing for rare diseases: it’s easy to work out the probability of the data given the way the world is, but you want the probability the world is a certain way given the data. These aren’t the same.

If you want to know how much variability there is in a poll, the usual ‘maximum margin of error’ is helpful.  In theory, over a fairly wide range of true support, one poll in 20 will be off by more than this, half being too high and half being too low. In theory it’s 3% for 1000 people, 4.5% for 500. For minor parties, I’ve got a table here. In practice, the variability in NZ polls is larger than in theoretically perfect polls, but we’ll ignore that here.

If you want to know about change between two polls, the margin of error is about 1.4 times higher. If you want to know about difference between two candidates, the computations are trickier. When you can ignore other candidates and undecided voters, the margin of error is about twice the standard value, because a vote added to one side must be taken away from the other side, and so counts twice.

When you can’t ignore other candidates, the question isn’t exactly answerable without more information, but Jonathan Marshall has a nice app with results for one set of assumptions. Approximately, instead of the margin of error for the difference being (2*square root (1/N)) as in the simple case, you replace the 1 by the sum of the two candidate estimates, so  (2*square root (0.35+0.30)/N).  The margin of error is about 7%.  If the support for the two candidates were equal, there would be about a 9% chance of seeing candidate 1 ahead of candidate 2 by at least 5%.

All this, though, doesn’t get you an answer to the question as originally posed.

If you poll 500 people, and candidate 1 is on 35% and candidate 2 is on 30%, what is the chance candidate 2 is really ahead?

This depends on what you knew in advance. If you had been reasonably confident that candidate 1 was behind candidate 2 in support you would be justified in believing that candidate 1 had been lucky, and assigning a relatively high probability that candidate 2 is really ahead. If you’d thought it was basically impossible for candidate 2 to even be close to candidate 1, you probably need to sit down quietly and re-evaluate your beliefs and the evidence they were based on.

The question is obviously looking for an answer in the setting where you don’t know anything else. In the general case this turns out to be, depending on your philosophy, either difficult to agree on or intrinsically meaningless.  In special cases, we may be able to agree.


  1. for values within the margin of error, you had no strong belief that any value was more likely than any other
  2. there aren’t values outside the margin of error that you thought were much more likely than those inside

we can roughly approximate your prior beliefs by a flat distribution, and your posterior beliefs by a Normal distribution with mean at the observed data value and with standard error equal to the margin of error.

In that case, the probability of candidate 2 being ahead is 9%, the same answer as the reverse question.  You could make a case that this was a reasonable way to report the result, at least if there weren’t any other polls and if the model was explicitly or implicitly agreed. When there are other polls, though, this becomes a less convincing argument.

TL;DR: The probability Winston is behind given that he polls 5% higher isn’t conceptually the same as the probability that he polls 5% higher given that he is behind.  But, if we pretend to be in exactly the right state of quasi-ignorance, they come out to be the same number, and it’s roughly 1 in 10.

March 12, 2015

Variation and mean

A lot of statistical reporting focuses on means, or other summaries of where a distribution lies. Often, though, variation is important. has a story about variation in costs of lab tests at California hospitals, based on a paper in BMJ OpenVox says

The charge for a lipid panel ranged from $10 to $10,169. Hospital prices for a basic metabolic panel (which doctors use to measure the body’s metabolism) were $35 at one facility — and $7,303 at another

These are basically standard lab tests, so there’s no sane reason for this sort of huge variation. You’d expect some variation with volume of tests and with location, but nothing like what is seen.

What’s not clear is how much this is really just variation in how costs are attributed. A hospital needs a blood lab, which has a lot of fixed costs. Somehow these costs have to be spread over individual tests, but there’s no unique way to do this.  It would be interesting to know if the labs with high charges for one test tend to have high charges for others, but the research paper doesn’t look at relationships between costs.

The Vox story also illustrates a point about reporting, with this graph


If you look carefully, there’s something strange about the graph. The brown box second from the right is ‘lipid panel’, and it goes up to a bit short of $600, not to $10169. Similarly, the ‘metabolic panel’, the right-most box, goes up to $1000 on the graph and $7303 in the story.

The graph is taken from the research paper. In the research paper it had a caption explaining that the ‘whiskers’ in the box plot go to the 5th and 95th percentiles (a non-standard but reasonable choice). This caption fell off on the way to, and no-one seems to have noticed.

March 5, 2015

Showing us the money

The Herald is running a project to crowdsource data entry and annotation for NZ political donations and expenses: it’s something that’s hard to automate and where local knowledge is useful. Today, they have an interactive graph for 2014 election donations and have made the data available


February 27, 2015

Siberian hamsters or Asian gerbils

Every year or so there is a news story along the lines of”Everything you know about the Black Death is Wrong”. I’ve just been reading a couple of excellent posts  by Alison Atkin on this year’s one.

The Herald’s version of the story (which they got from the Independent) is typical (but she has captured a large set of headlines)

The Black Death has always been bad publicity for rats, with the rodent widely blamed for killing millions of people across Europe by spreading the bubonic plague.

But it seems that the creature, in this case at least, has been unfairly maligned, as new research points the finger of blame at gerbils.


The scientists switched the blame from rat to gerbil after comparing tree-ring records from Europe with 7711 historical plague outbreaks.

That isn’t what the research paper (in PNAS) says. And it would be surprising if it did: could it really be true that Asian gerbils were spreading across Europe for centuries without anyone noticing?

The abstract of the paper says

The second plague pandemic in medieval Europe started with the Black Death epidemic of 1347–1353 and killed millions of people over a time span of four centuries. It is commonly thought that after its initial introduction from Asia, the disease persisted in Europe in rodent reservoirs until it eventually disappeared. Here, we show that climate-driven outbreaks of Yersinia pestis in Asian rodent plague reservoirs are significantly associated with new waves of plague arriving into Europe through its maritime trade network with Asia. This association strongly suggests that the bacterium was continuously reimported into Europe during the second plague pandemic, and offers an alternative explanation to putative European rodent reservoirs for how the disease could have persisted in Europe for so long.

If the researchers had found repeated, prevously unsuspected, invasions of Europe by hordes of gerbils, they would have said so in the abstract. They don’t. Not a gerbil to be seen.

The hypothesis is that plague was repeatedly re-imported from Asia (where affected a lots of species, including, yes, gerbils) to European rats, rather than persisting at low levels in European rats between the epidemics. Either way, once the epidemic got to Europe, it’s all about the rats [update: and other non-novel forms of transmission]

In this example, for a change, it doesn’t seem that the press release is responsible. Instead, it looks like progressive mutations in the story as it’s transmitted, with the great gerbil gradually going from an illustrative example of a plague host in Asia to the rodent version of Attila the Hun.

Two final remarks. First, the erroneous story is now in the Wikipedia entry for the great gerbil (with a citation to the PNAS paper, so it looks as if it’s real). Second, when the story is allegedly about the confusion between two species of rodent, it’s a pity the Herald stock photo isn’t the right species.


[Update: Wikipedia has been fixed.]