Posts filed under Graphics (370)

April 26, 2017

Simplifying to make a picture

1. has maps of the ancestry structure of North America, based on people who sent DNA samples in for their genotype service (click to embiggen)ncomms14238-f3

To make these maps, they looked for pairs of people whose DNA showed they were distant relatives, then simplified the resulting network into relatively stable clusters. They then drew the clusters on a map and coloured them according to what part of the world those people’s distant ancestors probably came from.  In theory, this should give something like a map of immigration into the US (and to a lesser extent, of remaining Native populations).  The map is a massive oversimplification, but that’s more or less the point: it simplifies the data to highlight particular patterns (and, necessarily, to hide others).  There’s a research paper, too.


2. In a satire on predictive policing, The New Inquiry has an app showing high-risk neighbourhoods for financial crime. There’s also a story at Buzzfeed.


The app uses data from the US Financial Regulatory Authority (FINRA), and models the risk of financial crime using the usual sort of neighbourhood characteristics (eg number of liquor licenses, number of investment advisers).


3. The Sydney Morning Herald had a social/political quiz “What Kind of Aussie Are You?”.


They also have a discussion of how they designed the 7 groups.  Again, the groups aren’t entirely real, but are a set of stories told about complicated, multi-dimensional data.


The challenge in any display of this type is to remove enough information that the stories are visible, but not so much that they aren’t true– and not everyone will agree on whether you’ve succeeded.

March 8, 2017

Yes, November 19


The graph is from a Google Trends search for  “International Men’s Day“.

There are two peaks. In the majority of years, the larger peak is on International Women’s Day, and the smaller peak is on the day itself.

March 7, 2017

The amazing pizzachart

From YouGov (who seem to already be regretting it).


This obviously isn’t a pie chart, because the pieces are the same size but the numbers are different. It’s not really a graph at all; it’s an idiosyncratically organised, illustrated table.  It gets worse, though. The pizza picture itself isn’t doing any productive work in this graphic: the only information it conveys is misleading. There’s a clear impression given that particular ingredients go together, when that’s not how the questions were asked. And as the footnote says, there are a lot of popular ingredients that didn’t even make it on to the graphic.



October 30, 2016

Suboptimal ways to present risk

Graeme Edgeler nominated this, from PBS Frontline, to @statschat as a bad graph


It’s actually almost a good graph, but I think it’s trying to do too many things at once. There are two basic numerical facts: the number of people trying to cross the Mediterranean to escape the Syrian crisis has gone down substantially; the number of deaths has stayed about the same.

If you want to show the increase in risk, it’s much more effective to use a fixed, round denominator —  the main reason to use this sort of graph is that people pick up risk information better as frequencies than as fractions.

Here’s the comparison using the same denominator, 269, for the two years. It’s visually obvious that there has been a three-fold increase in death rate.


It’s harder to convey all the comparisons clearly in one graph. A mosaic plot would work for higher proportions, which we can all hope doesn’t become a relevant fact.


October 18, 2016

The lack of change is the real story

The Chief Coroner has released provisional suicide statistics for the year to June 2016.  As I wrote last year, the rate of suicide in New Zealand is basically not changing.  The Herald’s story, by Martin Johnston, quotes the Chief Coroner on this point

“Judge Marshall interpreted the suicide death rate as having remained consistent and said it showed New Zealand still had a long way to go in turning around the unacceptably high toll of suicide.”

The headline and graphs don’t make this clear

Here’s the graph from the Herald


If you want a bar graph, it should go down to zero, and it would then show how little is changing


I’d prefer a line graph showing expected variation if there wasn’t any underlying change: the shading is one and two standard deviations around the average of the nine years’ rates


As Judge Marshall says, the suicide death rate has remained consistent. That’s our problem.  Focusing on the year to year variation misses the key point.

September 1, 2016

Transport numbers

Auckland Transport released new patronage data, and FigureNZ tidied it up to make it easily computer-readable, so I thought I’d look at some of it.  What I’m going to show is a decomposition of the data into overall trends, seasonal variation, and random stuff just happening. As usual, click to embiggen the pictures.

First, the trends: rides are up.


It’s hard to see the trend in ferry use, so here’s a version on a log scale — meaning that the same proportional trend would look the same for all three modes of transport


Train use is increasing (relatively) faster than bus or ferry use.  There’s also an interesting bump in the middle that we’ll get back to.

Now, the seasonal patterns. Again, these are on a logarithmic scale, so they show relative variation


The clearest signal is that ferry use peaks in summer, when the other modes are at their minimum. Also, the Christmas minimum is a bit lower for trains: to see this, we can combine the two graphs:


It’s not surprising that train use falls by more: they turn the trains off for a lot of the holiday period.

Finally, what’s left when you subtract the seasonal and trend components:


The highest extra variation in both train and ferry rides was in September and October 2011: the Rugby World Cup.


August 20, 2016

The statistical significance filter

Attention conservation notice: long and nerdy, but does have pictures.

You may have noticed that I often say about newsy research studies that they are are barely statistically significant or they found only weak evidence, but that I don’t say that about large-scale clinical trials. This isn’t (just) personal prejudice. There are two good reasons why any given evidence threshold is more likely to be met in lower-quality research — and while I’ll be talking in terms of p-values here, getting rid of them doesn’t solve this problem (it might solve other problems).  I’ll also be talking in terms of an effect being “real” or not, which is again an oversimplification but one that I don’t think affects the point I’m making.  Think of a “real” effect as one big enough to write a news story about.


This graph shows possible results in statistical tests, for research where the effect of the thing you’re studying is real (orange) or not real (blue).  The solid circles are results that pass your statistical evidence threshold, in the direction you wanted to see — they’re press-releasable as well as publishable.

Only about half the ‘statistically significant’ results are real; the rest are false positives.

I’ve assumed the proportion of “real” effects is about 10%. That makes sense in a lot of medical and psychological research — arguably, it’s too optimistic.  I’ve also assumed the sample size is too small to reliably pick up plausible differences between blue and yellow — sadly, this is also realistic.


In the second graph, we’re looking at a setting where half the effects are real and half aren’t. Now, of the effects that pass the threshold, most are real.  On the other hand, there’s a lot of real effects that get missed.  This was the setting for a lot of clinical trials in the old days, when they were done in single hospitals or small groups.


The third case is relatively implausible hypotheses — 10% true — but well-designed studies.  There are still the same number of false positives, but many more true positives.  A better-designed study means that positive results are more likely to be correct.


Finally, the setting of well-conducted clinical trials intended to be definitive, the sort of studies done to get new drugs approved. About half the candidate treatments work as intended, and when they do, the results are likely to be positive.   For a well-designed test such as this, statistical significance is a reasonable guide to whether the effect is real.

The problem is that the media only show a subset of the (exciting) solid circles, and typically don’t show the (boring) empty circles. So, what you see is


where the columns are 10% and 50% proportion of studies having a true effect, and the top and bottom rows are under-sized and well-design studies.


Knowing the threshold for evidence isn’t enough: the prior plausibility matters, and the ability of the study to demonstrate effects matters. Apparent effects seen in small or poorly-designed studies are less likely to be true.

August 18, 2016

Rigorously deidentified pie


Via Dale Warburton on Twitter, this graph comes from page 7 of the 2016 A-League Injury Report (PDF) produced by Professional Footballers Australia — the players’ association for the round-ball game.  It seems to be a sensible and worthwhile document, except for this pie chart. They’ve replaced the club names with letters, presumably for confidentiality reasons. Which is fine. But the numbers written on the graph bear no obvious relationship to the sizes of the pie wedges.

It’s been a bad week for this sort of thing: a TV barchart that went viral this week had the same sort of problem.

August 15, 2016

Graph of the week

From a real estate agent who will remain nameless


Another example of the rule ‘if you have to write out all the numbers, the graph isn’t doing its work.”

August 4, 2016

Garbage numbers

This appeared on Twitter


Now, I could just about believe NZ was near the bottom of the OECD, but to accept zero recycling and composting is a big ask.  Even if some of the recycling ends up in landfill, surely not all of it does.  And the garden waste people don’t charge enough to be putting all my wisteria clippings into landfill.

So, I looked up the source (updated link). It says to see the Annex Notes. Here’s the note for New Zealand

New Zealand: Data refer to amount going to landfill

The data point for New Zealand is zero by definition — they aren’t counting any of the recycling and composting.

When the most you can hope for is that the lies in the graph will be explained in the footnotes, you need to read the footnotes.