Stats Chat

March 12, 2017

Briefly

By Thomas Lumley

False positives: many people who think they are allergic to penicillin actually aren’t, and so don’t need to be given broader-spectrum antibiotics (which have more impact on resistance). Ars Technica, the research paper.

Cancer genomics researcher accused of data falsification. Long NY Times story, including very clever animation of Western blot duplication.

A bill in the US House of Representatives wouldn’t quite let employers demand genetic data from employees, but it would let employers make employees pay not to give it. (STAT news)

President Trump described good employment numbers under the previous government as ‘phony’. After the first month of his government, the White House press secretary said “They may have been phony in the past, but it’s very real now”. (via Vox)

“Cause of death” is complicated: the BBC has a story “The biggest killer you may not know” about sepsis. The story says it “kills more people in the UK each year than bowel, breast and prostate cancer combined.” But it’s not either/or. A substantial number of sepsis deaths are due to cancer or cancer treatment.

Cathy O’Neil on how looking harder for crimes by any group (such as immigrants) is bound to increase the crime rate — if a spurious increase wasn’t the aim, you’d need to be careful about interpreting the data.

“There are no news stories about Obama not planning a coup, just as web pages about the Holocaust tend to take as a given that it happened.” NYmag on Google’s featured snippets

An animated tour of data science at online clothes retailer StichFix

Some lovely 1920s maps of Australia

March 9, 2017

Causation, correlation, and gaps

By Thomas Lumley

It’s often hard to establish whether a correlation between two variables is cause and effect, or whether it’s due to other factors. One technique that’s helpful for structuring one’s thinking about the problem is a causal graph: bubbles for variables, and arrows for effects.

I’ve written about the correlation between chocolate consumption and number of Nobel prizes for countries. The ‘chocolate leads to Nobel Prizes’ hypothesis would be drawn like this:

One of several more-reasonable alternatives is that variations in wealth explain the correlation, which looks like

As another example, there’s a negative correlation between the number of pirates operating in the world’s oceans and atmospheric CO₂ concentration. It could be that pirates directly reduce atmospheric CO₂ concentration:

but it’s perhaps more likely that both technology and wealth have changed over time, leading to greater CO₂ emissions and also to nations with the ability and motivation to suppress piracy:

The pictures are oversimplified, but they still show enough of the key relationships to help with reasoning. In particular, in these alternative explanations, there are arrows pointing into both the putative cause and the effect. There are arrows from the same origin into both ‘chocolate’ and ‘Nobel Prizes’; there are arrows from the same origins into both ‘pirates’ and ‘CO₂‘. Confounding — the confusion of relationships that leads to causes not matching correlations — requires arrows into both variables (or selection based on arrows out of both variables).

So, when we see a causal hypothesis like this one:

and ask if there’s “really” a gender pay gap, the answer “No” requires finding a variable with arrows into both gender and pay. Which in your case you have not got. The pay gap really is caused by gender.

There are still interesting and important questions to be asked about mechanisms. For example, consider this graph

We’d like to know how much of the pay gap is direct underpayment, how much goes through the mechanism of women doing more childcare, and how much goes through the mechanism of occupations with more women being paid less. Information about mechanisms helps us think about how to reduce the gap, and what the other costs of reducing it might be. The studies I’ve seen suggest that all three of these mechanisms do contribute, so even if you think only the direct effects matter there’s still a problem.

You can also think of all sorts of things and stuff I’ve left out of that graph, and you could put some of them back in

But you’re still going to end up with a graph where there are only arrows out of gender. Women earn less, on average, and this is causation, not mere correlation.

View comments (1)

March 8, 2017

Briefly

By Thomas Lumley

“Exploding boxplots”: although a boxplot is a lot better than just showing a mean, it’s usually worse than showing the data

The US state of Michigan used an automated system to detect unemployment benefit fraud. Late last year, an audit of 22427 cases of fraud overturned 93% of them! Now, a class-action lawsuit has been filed (PDF), giving (a one-sided view of) more of the details.

StatsChat has been saying for quite some time that people shouldn’t be making generalisations about road crash rates without evaluating the statistical evidence for increases or decreases. It’s good to see someone doing the analysis: the Ministry of Transport has a big long report (PDF, from here) including (p37)[updated link]

110. However, since 2013 the fatality rate has injury rate has begun to increase. We conducted statistical tests (Poisson) to see whether this increase was more than natural variation, and found strong evidence that the fatality and injury rates are actually rising.
Fascinating blog by John Grimwade, an infographics (as opposed to data visualisation) expert (via Kieran Healy)

“Not only does Google, the world’s preeminent index of information, tell its users that caramelizing onions takes “about 5 minutes”—it pulls that information from an article whose entire point was to tell people exactly the opposite.” Another problem with Google’s new answer box, less serious than the claims about a communist coup in the US, but likely to be believed by more people.

Yes, November 19

By Thomas Lumley

The graph is from a Google Trends search for “International Men’s Day“.

There are two peaks. In the majority of years, the larger peak is on International Women’s Day, and the smaller peak is on the day itself.

March 7, 2017

Super 18 Predictions for Round 3

By David Scott

Team Ratings for Round 3

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

	Current Rating	Rating at Season Start	Difference
Hurricanes	17.67	13.22	4.40
Chiefs	10.59	9.75	0.80
Crusaders	8.84	8.75	0.10
Highlanders	8.06	9.17	-1.10
Lions	7.90	7.64	0.30
Waratahs	4.27	5.81	-1.50
Brumbies	3.33	3.83	-0.50
Stormers	1.81	1.51	0.30
Blues	0.90	-1.07	2.00
Sharks	0.69	0.42	0.30
Bulls	-0.76	0.29	-1.00
Jaguares	-4.02	-4.36	0.30
Cheetahs	-6.31	-7.36	1.10
Force	-8.54	-9.45	0.90
Reds	-9.91	-10.28	0.40
Rebels	-12.37	-8.17	-4.20
Kings	-18.14	-19.02	0.90
Sunwolves	-21.13	-17.76	-3.40

Performance So Far

So far there have been 18 matches played, 12 of which were correctly predicted, a success rate of 66.7%.
Here are the predictions for last week’s games.

	Game	Date	Score	Prediction	Correct
1	Force vs. Reds	Mar 02	26 – 19	4.60	TRUE
2	Chiefs vs. Blues	Mar 03	41 – 26	12.90	TRUE
3	Hurricanes vs. Rebels	Mar 04	71 – 6	29.80	TRUE
4	Highlanders vs. Crusaders	Mar 04	27 – 30	3.50	FALSE
5	Brumbies vs. Sharks	Mar 04	22 – 27	8.20	FALSE
6	Sunwolves vs. Kings	Mar 04	23 – 37	3.10	FALSE
7	Lions vs. Waratahs	Mar 04	55 – 36	6.10	TRUE
8	Stormers vs. Jaguares	Mar 04	32 – 25	10.20	TRUE
9	Cheetahs vs. Bulls	Mar 04	34 – 28	-3.10	FALSE

Predictions for Round 3

Here are the predictions for Round 3. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

	Game	Date	Winner	Prediction
1	Chiefs vs. Hurricanes	Mar 10	Hurricanes	-3.60
2	Brumbies vs. Force	Mar 10	Brumbies	15.40
3	Sharks vs. Waratahs	Mar 10	Sharks	0.40
4	Blues vs. Highlanders	Mar 11	Highlanders	-3.70
5	Reds vs. Crusaders	Mar 11	Crusaders	-14.70
6	Cheetahs vs. Sunwolves	Mar 11	Cheetahs	18.80
7	Kings vs. Stormers	Mar 11	Stormers	-16.50
8	Jaguares vs. Lions	Mar 11	Lions	-7.90

View comments (9)

NRL Predictions for Round 2

By David Scott

Team Ratings for Round 2

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

	Current Rating	Rating at Season Start	Difference
Raiders	9.62	9.94	-0.30
Storm	8.46	8.49	-0.00
Cowboys	7.22	6.90	0.30
Broncos	5.39	4.36	1.00
Sharks	4.82	5.84	-1.00
Panthers	2.92	6.08	-3.20
Roosters	0.21	-1.17	1.40
Eels	-0.06	-0.81	0.70
Bulldogs	-1.31	-1.34	0.00
Wests Tigers	-2.23	-3.89	1.70
Titans	-2.36	-0.98	-1.40
Rabbitohs	-3.48	-1.82	-1.70
Sea Eagles	-3.73	-2.98	-0.80
Dragons	-4.58	-7.74	3.20
Warriors	-6.89	-6.02	-0.90
Knights	-16.07	-16.94	0.90

Performance So Far

So far there have been 8 matches played, 3 of which were correctly predicted, a success rate of 37.5%.
Here are the predictions for last week’s games.

	Game	Date	Score	Prediction	Correct
1	Sharks vs. Broncos	Mar 02	18 – 26	5.00	FALSE
2	Bulldogs vs. Storm	Mar 03	6 – 12	-6.30	TRUE
3	Rabbitohs vs. Wests Tigers	Mar 03	18 – 34	5.60	FALSE
4	Dragons vs. Panthers	Mar 04	42 – 10	-10.30	FALSE
5	Cowboys vs. Raiders	Mar 04	20 – 16	0.50	TRUE
6	Titans vs. Roosters	Mar 04	18 – 32	3.70	FALSE
7	Warriors vs. Knights	Mar 05	26 – 22	14.90	TRUE
8	Sea Eagles vs. Eels	Mar 05	12 – 20	1.30	FALSE

Predictions for Round 2

Here are the predictions for Round 2. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

	Game	Date	Winner	Prediction
1	Roosters vs. Bulldogs	Mar 09	Roosters	5.00
2	Warriors vs. Storm	Mar 10	Storm	-11.30
3	Broncos vs. Cowboys	Mar 10	Broncos	1.70
4	Knights vs. Titans	Mar 11	Titans	-10.20
5	Sea Eagles vs. Rabbitohs	Mar 11	Sea Eagles	3.30
6	Raiders vs. Sharks	Mar 11	Raiders	8.30
7	Wests Tigers vs. Panthers	Mar 12	Panthers	-1.70
8	Dragons vs. Eels	Mar 12	Eels	-1.00

The amazing pizzachart

By Thomas Lumley

From YouGov (who seem to already be regretting it).

This obviously isn’t a pie chart, because the pieces are the same size but the numbers are different. It’s not really a graph at all; it’s an idiosyncratically organised, illustrated table. It gets worse, though. The pizza picture itself isn’t doing any productive work in this graphic: the only information it conveys is misleading. There’s a clear impression given that particular ingredients go together, when that’s not how the questions were asked. And as the footnote says, there are a lot of popular ingredients that didn’t even make it on to the graphic.

View comments (2)

March 6, 2017

Cause of death

By Thomas Lumley

In medical research we distinguish ‘hard’ outcomes that can be reliably and objectively measured (such as death, blood pressure, activated protein C concentrations) from ‘soft’ outcomes that depend on subjective reporting. We also distinguish ‘patient-centered’ or ‘clinical’ or ‘real’ outcomes that matter directly to patients (such as death, pain, or dependency) from ‘surrogate’ or ‘intermediate’ outcomes that are biologically meaningful but don’t directly matter to patients. ‘Death’ is one of the few things we can measure that’s on both lists.

Cause of death, however, is a much less ideal thing to measure. If some form of cancer screening makes it less likely that you die of that particular type of cancer but doesn’t increase how long you live, it’s going to be less popular than if it genuinely postpones death. What’s more surprising is that cause of death is hard to measure objectively and reliably. But it is.

Suppose someone smokes heavily for many years and as a result develops chronic lung disease, and as a result develops pneumonia, and as a result is in hospital, has a cardiac arrest due to a medical error, and dies. Is the cause of death ‘cardiac arrest’ or ‘medical error’ or ‘pneumonia’ or ‘COPD’ or ‘smoking’? The best choice is a subjective matter of convention: what’s the most useful way to record the primary cause of death? But even with a convention in place, there’s a lot of work to make sure it is followed. For example, a series of research papers in Thailand estimated that more than half the deaths from the main causes (eg stroke, HIV/AIDs, road traffic accidents, types of heart disease) were misclassified as less-specific causes, and came up with a way to at least correct the national totals.

In Western countries, things are better on average. However, as Radio NZ described today, in Australia (and probably NZ) deaths of people with intellectual disability tend to be reported as due to their intellectual disability rather than to whatever specific illness or injury they had. You can see why this happens, but you should also be able to see why it’s not ideal in improving healthcare for these people. Listen to the Radio NZ story; it’s good. If you want a reference to the open-access research paper, though, you won’t find it at Radio NZ. It’s here

View comments (2)

Briefly

By Thomas Lumley

“Does air pollution kill 40,000 each year people in the UK?”, from David Spiegelhalter

Phil Cook on “99% sugar free” beer (me, last year, on a similar topic)

Newshub had a story about the Accommodation Survey not specifically excluding people in hotels who were there as emergency housing. Nerds across the NZ political spectrum (eg, me, Keith Ng, and Eric Crampton) were unimpressed with this story. Eric actually wrote a blog post, so I’ll refer you there for more details.

Russell Brown wrote about the overuse of workplace drug tests that aren’t tests for impairment.

Ars Technica shows you can write about lab-science discoveries relevant to health without leaving out all the cautions and caveats.

A research paper in PLoS One shows that newspapers write about news. That is, they write `breakthrough’ stories about new treatments but give a lot less prominence to later studies that are less favorable. Interestingly, this didn’t apply to ‘lifestyle’ stories, where ‘coffee is Good/Bad this week’ can always find a place.

The Herald had a story last week about “The $2m+ price tag for a top decile Auckland education.” In contrast to their story two years ago, this doesn’t make any attempt to estimate the premium for the top school zones. That is, if a family with school-age kids doesn’t live in the ‘Double Grammar Zone’ and pay $2 million for a house, they’ll still have to live somewhere and pay something for a house. In the 2015 story, the cost of a house just outside a top school zone was about 20% lower than just inside. Even that probably overestimates the school premium, but the total cost of a house obviously does.

View comments (1)

Stat of the Week Competition: March 4 – 10 2017

By Rachel Cunliffe

Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.

Here’s how it works:

Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday March 10 2017.
Statistics can be bad, exemplary or fascinating.
The statistic must be in the NZ media during the period of March 4 – 10 2017 inclusive.
Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.

Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.

(more…)

Stats Chat

Briefly

Causation, correlation, and gaps

Briefly

Yes, November 19

Super 18 Predictions for Round 3

Team Ratings for Round 3

Performance So Far

Predictions for Round 3

NRL Predictions for Round 2

Team Ratings for Round 2

Performance So Far

Predictions for Round 2

The amazing pizzachart

Cause of death

Briefly

Stat of the Week Competition: March 4 – 10 2017

Recent comments

Popular posts

Latest posts

All topics

Recommended sites

Subscribe:

Receive our posts via email:

Team Ratings for Round 3

Performance So Far

Predictions for Round 3

Team Ratings for Round 2

Performance So Far

Predictions for Round 2

Recent comments

Popular posts

Latest posts

All topics

Recommended sites