Posts filed under Graphics (310)

March 30, 2015

Aspect ratios and not starting at zero

The vertical axis on a bar chart must start at zero. The very rare exceptions are ones that prove the rule: where ‘zero’ isn’t zero. Otherwise, the axis starts at zero or it isn’t a bar chart. The whole point of bar charts is that the length of the bar is proportional to the data value.

Line charts and scatterplots are different.  They don’t need to be tied down to zero, and the axis scales can be chosen to make the information as clear as possible. With great power comes great responsibility, as we can see from the following pair of line graphs of oil drilling in the US.



It’s pretty obvious that these come from people with different communications agendas. Or, it would be, except they are from the same story at Bloomberg.

Neither graph has an ideal aspect ratio. The flat one is too flat: you can’t see the wobbles over time in number of rigs. The tall one is too tall: the number of rigs has halved, but it looks as though it has crashed much more than that.

Bill Cleveland has a useful default rule for scaling line graphs: the median slope of the line segments should be about 45 degrees. The orange line on the tall graph isn’t far off that, but the blue line is steeper.  The 45-degree rule would give a graph like this:


In fact, there is plenty of room to start the blue axis at zero, but that’s not always the right choice.

Here, in a sadly-appropriate pairing, is the Keeling Curve, the graph of atmospheric CO2 concentrations at Mauna Loa observatory, in a visualisation paper from Berkeley.


There’s no sense at all in having the vertical axis start at zero. Zero is just not a relevant value of atmospheric CO2. What’s more interesting, though, is how the two scalings show different information. The upper graph is scaled so the year-to-year changes have slope centred at 45 degrees. This makes it easier to see that the CO2 increase is accelerating. The lower graph is scaled so the month to month changes have slope centred at 45 degrees, making it easier to see the shape of the seasonal pattern.

Different vertical scaling can be used just to mislead the reader, but it can also be used to make data more readable and to communicate more effectively.

March 23, 2015

Cricket visualisations


Population genetic history mapped

Most stories about population genetic ancestry tend to be based on pure male-line or pure female-line ancestry, which can be unrepresentative.  That’s especially true when you’re looking at invasions — invaders probably leave more Y-chromosomes behind than the rest of the genome.  There’s a new UK study that used data on the whole genome from a few thousand British people, chosen because all four of their grandparents lived close together.  The idea is that this will measure population structure at the start of the twentieth century, before people started moving around so much.

Here’s the map of ancestry clusters. As the story in the Guardian explains, one thing it shows that the Romans and Normans weren’t big contributors to population ancestry, despite their impact on culture.


March 18, 2015

Awful graphs about interesting data


Today in “awful graphs about interesting data” we have this effort that I saw on Twitter, from a paper in one of the Nature Reviews journals.


As with some other recent social media examples, the first problem is that the caption isn’t part of the image and so doesn’t get tweeted. The numbers are the average number of drug candidates at each stage of research to end up with one actual drug at the end. The percentage at the bottom is the reciprocal of the number at the top, multiplied by 60%.

A lot of news coverage of research is at the ‘preclinical’ stage, or is even earlier, at the stage of identifying a promising place to look.  Most of these never get anywhere. Sometimes you see coverage of a successful new cancer drug candidate in Phase I — first human studies. Most of these never get anywhere.  There’s also a lot of variation in how successful the ‘successes’ are: the new drugs for Hepatitis C (the first column) are a cure for many people; the new Alzheimer’s drugs just give a modest improvement in symptoms.  It looks as those drugs from MRSA (antibiotic-resistant Staph. aureus) are easier, but that’s because there aren’t many really novel preclinical candidates.

It’s an interesting table of numbers, but as a graph it’s pretty dreadful. The 3-d effect is purely decorative — it has nothing to do with the represntation of the numbers. Effectively, it’s a bar chart, except that the bars are aligned at the centre and have differently-shaped weird decorative bits at the ends, so they are harder to read.

At the top of the chart,  the width of the pale blue region where it crosses the dashed line is the actual data value. Towards the bottom of the chart even that fails, because the visual metaphor of a deformed funnel requires the ‘Launch’ bar to be noticeably narrower than the ‘Registration’ bar. If they’d gone with the more usual metaphor of a pipeline, the graph could have been less inaccurate.

In the end, it’s yet another illustration of two graphical principles. The first: no 3-d graphics. The second: if you have to write all the numbers on the graph, it’s a sign the graph isn’t doing its job.

March 17, 2015

Bonus problems

If you hadn’t seen this graph yet, you probably would have soon.


The claim “Wall Street bonus were double the earnings of all full-time minimum wage workers in 2014″ was made by the Institute for Policy Studies (which is where I got the graph) and fact-checked by the Upshot blog at the New York Times, so you’d expect it to be true, or at least true-ish. It probably isn’t, because the claim being checked was missing an important word and is using an unfortunate definition of another word. One of the first hints of a problem is the number of minimum wage workers: about a million, or about 2/3 of one percent of the labour force.  Given the usual narrative about the US and minimum-wage jobs, you’d expect this fraction to be higher.

The missing word is “federal”. The Bureau of Labor Statistics reports data on people paid at or below the federal minimum wage of $7.25/hour, but 29 states have higher minimum wages so their minimum-wage workers aren’t counted in this analysis. In most of these states the minimum is still under $8/hr. As a result, the proportion of hourly workers earning no more than federal minimum wage ranges from 1.2% in Oregon to 7.2% in Tennessee (PDF).  The full report — and even the report infographic — say “federal minimum wage”, but the graph above doesn’t, and neither does the graph from Mother Jones magazine (it even omits the numbers of people)

On top of those getting state minimum wage we’re still short quite a lot of people, because “full-time” is defined by 35 or more hours per week at your principal job.  If you have multiple part-time jobs, even if you work 60 or 80 hours a week, you are counted as part-time and not included in the graph.

Matt Levine writes:

There are about 167,800 people getting the bonuses, and about 1.03 million getting full-time minimum wage, which means that ballpark Wall Street bonuses are 12 times minimum wage. If the average bonus is half of total comp, a ratio I just made up, then that means that “Wall Street” pays, on average, 24 times minimum wage, or like $174 an hour, pre-tax. This is obviously not very scientific but that number seems plausible.

That’s slightly less scientific than the graph, but as he says, is plausible. In fact, it’s not as bad as I would have guessed.

What’s particularly upsetting is that you don’t need to exaggerate or use sloppy figures on this topic. It’s not even that controversial. Lots of people, even technocratic pro-growth economists, will tell you the US minimum wage is too low.  Lots of people will argue that Wall St extracts more money from the economy than it provides in actual value, with much better arguments than this.

By now you might think to check carefully that the original bar chart is at least drawn correctly.  It’s not. The blue bar is more than half the height of the red bar, not less than half.

March 16, 2015

Maps, colours, and locations

This is part of a social media map, of photographs taken in public places in the San Francisco Bay Area


The colours are trying to indicate three social media sites: Instagram is yellow, Flickr is magenta, Twitter is cyan.

Encoding three variables with colour this way doesn’t allow you to easily read off differences, but you can see clusters and then think about how to decode them into data. The dark green areas are saturated with photos.  Light green urban areas have Instagram and Twitter, but not much Flickr.  Pink and orange areas lack Twitter — mostly these track cellphone coverage and population density, but not entirely.  The pink area in the center of the map is spectacular landscape without many people; the orange blob on the right is the popular Angel Island park.

Zooming in on Angel Island shows something interesting: there are a few blobs with high density across all three social media systems. The two at the top are easily explained: the visitor centre and the only place on the island that sells food. The very dense blob in the middle of the island, and the slightly less dense one below it are a bit strange. They don’t seem to correspond to any plausible features.


My guess is that these are a phenomenon we’ve seen before, of locations being mapped to the center of some region if they can’t be specified precisely.

Automated data tends to be messy, and making serious use of it means finding out the ways it lies to you. Wayne Dobson doesn’t have your cellphone, and there isn’t a uniquely Twitter-worthy bush in the middle of Angel Island.


March 12, 2015

Election donation maps

There are probably some StatChat readers who don’t read the NZ Herald, so I’ll point out that I have a post on the data blog about election donations.

Variation and mean

A lot of statistical reporting focuses on means, or other summaries of where a distribution lies. Often, though, variation is important. has a story about variation in costs of lab tests at California hospitals, based on a paper in BMJ OpenVox says

The charge for a lipid panel ranged from $10 to $10,169. Hospital prices for a basic metabolic panel (which doctors use to measure the body’s metabolism) were $35 at one facility — and $7,303 at another

These are basically standard lab tests, so there’s no sane reason for this sort of huge variation. You’d expect some variation with volume of tests and with location, but nothing like what is seen.

What’s not clear is how much this is really just variation in how costs are attributed. A hospital needs a blood lab, which has a lot of fixed costs. Somehow these costs have to be spread over individual tests, but there’s no unique way to do this.  It would be interesting to know if the labs with high charges for one test tend to have high charges for others, but the research paper doesn’t look at relationships between costs.

The Vox story also illustrates a point about reporting, with this graph


If you look carefully, there’s something strange about the graph. The brown box second from the right is ‘lipid panel’, and it goes up to a bit short of $600, not to $10169. Similarly, the ‘metabolic panel’, the right-most box, goes up to $1000 on the graph and $7303 in the story.

The graph is taken from the research paper. In the research paper it had a caption explaining that the ‘whiskers’ in the box plot go to the 5th and 95th percentiles (a non-standard but reasonable choice). This caption fell off on the way to, and no-one seems to have noticed.

March 5, 2015

Showing us the money

The Herald is running a project to crowdsource data entry and annotation for NZ political donations and expenses: it’s something that’s hard to automate and where local knowledge is useful. Today, they have an interactive graph for 2014 election donations and have made the data available


February 25, 2015

Wiki New Zealand site revamped

We’ve written before about Wiki New Zealand, which aims to ‘democractise data’. WNZ has revamped its website to make things clearer and cleaner, and you can browse here.

As I’m a postgraduate scarfie this year, the table on domestic students in tertiary education interested me – it shows that women (grey) are enrolled in greater numbers than men at every single level. Click the graph to embiggen.

Founder Lillian Grace talks about the genesis of Wiki New Zealand here, and for those who love the techy  side, here’s a video about the backend.