March 12, 2017

Briefly

  • False positives: many people who think they are allergic to penicillin actually aren’t, and so don’t need to be given broader-spectrum antibiotics (which have more impact on resistance). Ars Technica, the research paper.
  • Cancer genomics researcher accused of data falsification. Long NY Times story, including very clever animation of Western blot duplication.
  • A bill in the US House of Representatives wouldn’t quite let employers demand genetic data from employees, but it would let employers make employees pay not to give it. (STAT news)
  • President Trump described good employment numbers under the previous government as ‘phony’.  After the first month of his government, the White House press secretary said “They may have been phony in the past, but it’s very real now”.  (via Vox)
  • “Cause of death” is complicated: the BBC has a story “The biggest killer you may not know” about sepsis. The story says it “kills more people in the UK each year than bowel, breast and prostate cancer combined.” But it’s not either/or. A substantial number of sepsis deaths are due to cancer or cancer treatment.
  • Cathy O’Neil on how looking harder for crimes by any group (such as immigrants) is bound to increase the crime rate — if a spurious increase wasn’t the aim, you’d need to be careful about interpreting the data.
March 9, 2017

Causation, correlation, and gaps

It’s often hard to establish whether a correlation between two variables is cause and effect, or whether it’s due to other factors.  One technique that’s helpful for structuring one’s thinking about the problem is a causal graph: bubbles for variables, and arrows for effects.

I’ve written about the correlation between chocolate consumption and number of Nobel prizes for countries.  The ‘chocolate leads to Nobel Prizes’ hypothesis would be drawn like this:

chocolate

One of several more-reasonable alternatives is that variations in wealth explain the correlation, which looks like

chocolate1

As another example, there’s a negative correlation between the number of pirates operating in the world’s oceans and atmospheric CO2 concentration.  It could be that pirates directly reduce atmospheric CO2 concentration:

pirates

but it’s perhaps more likely that both technology and wealth have changed over time, leading to greater CO2 emissions and also to nations with the ability and motivation to suppress piracy:

pirates1

The pictures are oversimplified, but they still show enough of the key relationships to help with reasoning.  In particular, in these alternative explanations, there are arrows pointing into both the putative cause and the effect. There are arrows from the same origin into both ‘chocolate’ and ‘Nobel Prizes’; there are arrows from the same origins into both ‘pirates’ and ‘CO2‘.  Confounding — the confusion of relationships that leads to causes not matching correlations — requires arrows into both variables (or selection based on arrows out of both variables).

So, when we see a causal hypothesis like this one:

paygap

and ask if there’s “really” a gender pay gap, the answer “No” requires finding a variable with arrows into both gender and pay.  Which in your case you have not got. The pay gap really is caused by gender.

There are still interesting and important questions to be asked about mechanisms. For example, consider this graph

paygap1

We’d like to know how much of the pay gap is direct underpayment, how much goes through the mechanism of women doing more childcare, and how much goes through the mechanism of occupations with more women being  paid less.  Information about mechanisms helps us think about how to reduce the gap, and what the other costs of reducing it might be.  The studies I’ve seen suggest that all three of these mechanisms do contribute, so even if you think only the direct effects matter there’s still a problem.

You can also think of all sorts of things and stuff I’ve left out of that graph, and you could put some of them back in

paygap2

But you’re still going to end up with a graph where there are only arrows out of gender.  Women earn less, on average, and this is causation, not mere correlation.

March 8, 2017

Briefly

  • “Exploding boxplots”: although a boxplot is a lot better than just showing a mean, it’s usually worse than showing the data
  • The US state of Michigan used an automated system to detect unemployment benefit fraud. Late last year, an audit of 22427 cases of fraud overturned 93% of them! Now, a class-action lawsuit has been filed (PDF), giving (a one-sided view of) more of the details.
  • StatsChat has been saying for quite some time that people shouldn’t be making generalisations about road crash rates without evaluating the statistical evidence for increases or decreases.  It’s good to see someone doing the analysis: the Ministry of Transport has a big long report (PDF, from here) including (p37)[updated link]

    110. However, since 2013 the fatality rate has injury rate has begun to increase. We conducted statistical tests (Poisson) to see whether this increase was more than natural variation, and found strong evidence that the fatality and injury rates are actually rising.

  • Fascinating blog by John Grimwade, an infographics (as opposed to data visualisation) expert (via Kieran Healy)
  • “Not only does Google, the world’s preeminent index of information, tell its users that caramelizing onions takes “about 5 minutes”—it pulls that information from an article whose entire point was to tell people exactly the opposite.”  Another problem with Google’s new answer box, less serious than the claims about a communist coup in the US, but likely to be believed by more people.

Yes, November 19

trends

The graph is from a Google Trends search for  “International Men’s Day“.

There are two peaks. In the majority of years, the larger peak is on International Women’s Day, and the smaller peak is on the day itself.

March 7, 2017

Super 18 Predictions for Round 3

Team Ratings for Round 3

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Hurricanes 17.67 13.22 4.40
Chiefs 10.59 9.75 0.80
Crusaders 8.84 8.75 0.10
Highlanders 8.06 9.17 -1.10
Lions 7.90 7.64 0.30
Waratahs 4.27 5.81 -1.50
Brumbies 3.33 3.83 -0.50
Stormers 1.81 1.51 0.30
Blues 0.90 -1.07 2.00
Sharks 0.69 0.42 0.30
Bulls -0.76 0.29 -1.00
Jaguares -4.02 -4.36 0.30
Cheetahs -6.31 -7.36 1.10
Force -8.54 -9.45 0.90
Reds -9.91 -10.28 0.40
Rebels -12.37 -8.17 -4.20
Kings -18.14 -19.02 0.90
Sunwolves -21.13 -17.76 -3.40

 

Performance So Far

So far there have been 18 matches played, 12 of which were correctly predicted, a success rate of 66.7%.
Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Force vs. Reds Mar 02 26 – 19 4.60 TRUE
2 Chiefs vs. Blues Mar 03 41 – 26 12.90 TRUE
3 Hurricanes vs. Rebels Mar 04 71 – 6 29.80 TRUE
4 Highlanders vs. Crusaders Mar 04 27 – 30 3.50 FALSE
5 Brumbies vs. Sharks Mar 04 22 – 27 8.20 FALSE
6 Sunwolves vs. Kings Mar 04 23 – 37 3.10 FALSE
7 Lions vs. Waratahs Mar 04 55 – 36 6.10 TRUE
8 Stormers vs. Jaguares Mar 04 32 – 25 10.20 TRUE
9 Cheetahs vs. Bulls Mar 04 34 – 28 -3.10 FALSE

 

Predictions for Round 3

Here are the predictions for Round 3. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Chiefs vs. Hurricanes Mar 10 Hurricanes -3.60
2 Brumbies vs. Force Mar 10 Brumbies 15.40
3 Sharks vs. Waratahs Mar 10 Sharks 0.40
4 Blues vs. Highlanders Mar 11 Highlanders -3.70
5 Reds vs. Crusaders Mar 11 Crusaders -14.70
6 Cheetahs vs. Sunwolves Mar 11 Cheetahs 18.80
7 Kings vs. Stormers Mar 11 Stormers -16.50
8 Jaguares vs. Lions Mar 11 Lions -7.90

 

NRL Predictions for Round 2

Team Ratings for Round 2

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Raiders 9.62 9.94 -0.30
Storm 8.46 8.49 -0.00
Cowboys 7.22 6.90 0.30
Broncos 5.39 4.36 1.00
Sharks 4.82 5.84 -1.00
Panthers 2.92 6.08 -3.20
Roosters 0.21 -1.17 1.40
Eels -0.06 -0.81 0.70
Bulldogs -1.31 -1.34 0.00
Wests Tigers -2.23 -3.89 1.70
Titans -2.36 -0.98 -1.40
Rabbitohs -3.48 -1.82 -1.70
Sea Eagles -3.73 -2.98 -0.80
Dragons -4.58 -7.74 3.20
Warriors -6.89 -6.02 -0.90
Knights -16.07 -16.94 0.90

 

Performance So Far

So far there have been 8 matches played, 3 of which were correctly predicted, a success rate of 37.5%.
Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Sharks vs. Broncos Mar 02 18 – 26 5.00 FALSE
2 Bulldogs vs. Storm Mar 03 6 – 12 -6.30 TRUE
3 Rabbitohs vs. Wests Tigers Mar 03 18 – 34 5.60 FALSE
4 Dragons vs. Panthers Mar 04 42 – 10 -10.30 FALSE
5 Cowboys vs. Raiders Mar 04 20 – 16 0.50 TRUE
6 Titans vs. Roosters Mar 04 18 – 32 3.70 FALSE
7 Warriors vs. Knights Mar 05 26 – 22 14.90 TRUE
8 Sea Eagles vs. Eels Mar 05 12 – 20 1.30 FALSE

 

Predictions for Round 2

Here are the predictions for Round 2. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Roosters vs. Bulldogs Mar 09 Roosters 5.00
2 Warriors vs. Storm Mar 10 Storm -11.30
3 Broncos vs. Cowboys Mar 10 Broncos 1.70
4 Knights vs. Titans Mar 11 Titans -10.20
5 Sea Eagles vs. Rabbitohs Mar 11 Sea Eagles 3.30
6 Raiders vs. Sharks Mar 11 Raiders 8.30
7 Wests Tigers vs. Panthers Mar 12 Panthers -1.70
8 Dragons vs. Eels Mar 12 Eels -1.00

 

The amazing pizzachart

From YouGov (who seem to already be regretting it).

Pizza-01

This obviously isn’t a pie chart, because the pieces are the same size but the numbers are different. It’s not really a graph at all; it’s an idiosyncratically organised, illustrated table.  It gets worse, though. The pizza picture itself isn’t doing any productive work in this graphic: the only information it conveys is misleading. There’s a clear impression given that particular ingredients go together, when that’s not how the questions were asked. And as the footnote says, there are a lot of popular ingredients that didn’t even make it on to the graphic.

 

 

March 6, 2017

Cause of death

In medical research we distinguish ‘hard’ outcomes that can be reliably and objectively measured (such as death, blood pressure, activated protein C concentrations) from ‘soft’ outcomes that depend on subjective reporting.  We also distinguish ‘patient-centered’ or ‘clinical’ or ‘real’ outcomes that matter directly to patients (such as death, pain, or dependency) from ‘surrogate’  or ‘intermediate’ outcomes that are biologically meaningful but don’t directly matter to patients.  ‘Death’ is one of the few things we can measure that’s on both lists.

Cause of death, however, is a much less ideal thing to measure.  If some form of cancer screening makes it less likely that you die of that particular type of cancer but doesn’t increase how long you live, it’s going to be less popular than if it genuinely postpones death.  What’s more surprising is that cause of death is hard to measure objectively and reliably. But it is.

Suppose someone smokes heavily for many years and as a result develops chronic lung disease, and as a result develops pneumonia, and as a result is in hospital, has a cardiac arrest due to a medical error, and dies. Is the cause of death ‘cardiac arrest’ or ‘medical error’ or ‘pneumonia’ or ‘COPD’ or ‘smoking’?  The best choice is a subjective matter of convention: what’s the most useful way to record the primary cause of death? But even with a convention in place, there’s a lot of work to make sure it is followed.  For example, a series of research papers in Thailand estimated that more than half the deaths from the main causes (eg stroke, HIV/AIDs, road traffic accidents, types of heart disease) were misclassified as less-specific causes, and came up with a way to at least correct the national totals.

In Western countries, things are better on average. However, as Radio NZ described today, in Australia (and probably NZ) deaths of people with intellectual disability tend to be reported as due to their intellectual disability rather than to whatever specific illness or injury they had.  You can see why this happens, but you should also be able to see why it’s not ideal in improving healthcare for these people.  Listen to the Radio NZ story; it’s good. If you want a reference to the open-access research paper, though, you won’t find it at Radio NZ. It’s here

 

Briefly

  • Newshub had a story about the Accommodation Survey not specifically excluding people in hotels who were there as emergency housing.  Nerds across the NZ political spectrum (eg, me, Keith Ng, and Eric Crampton) were unimpressed with this story. Eric actually wrote a blog post, so I’ll refer you there for more details.
  • Russell Brown wrote about the overuse of workplace drug tests that aren’t tests for impairment.
  • A research paper in PLoS One shows that newspapers write about news.  That is, they write `breakthrough’ stories about new treatments but give a lot less prominence to later studies that are less favorable.  Interestingly, this didn’t apply to ‘lifestyle’ stories, where ‘coffee is Good/Bad this week’ can always find a place.
  • The Herald had a story last week about “The $2m+ price tag for a top decile Auckland education.” In contrast to their story two years ago, this doesn’t make any attempt to estimate the premium for the top school zones. That is, if a family with school-age kids doesn’t live in the ‘Double Grammar Zone’ and pay $2 million for a house, they’ll still have to live somewhere and pay something for a house.  In the 2015 story, the cost of a house just outside a top school zone was about 20% lower than just inside. Even that probably overestimates the school premium, but the total cost of a house obviously does.

Stat of the Week Competition: March 4 – 10 2017

Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.

Here’s how it works:

  • Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday March 10 2017.
  • Statistics can be bad, exemplary or fascinating.
  • The statistic must be in the NZ media during the period of March 4 – 10 2017 inclusive.
  • Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.

Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.

(more…)