June 26, 2014

Slightly too Open Data

  1. The Atlantic published some visualisations of taxi rides in New York
  2. Chris Whong asked for the data under Freedom-of-Information laws, and got it. Of course, the taxi and driver ids were anonymized
  3. Vijay Pandurangan noticed that the driver id and taxi id were really, really weakly anonymised.
  4. You can find out a lot once you know the taxi id.

 

The NY Taxi & Limousine Commission had run the ids through a cryptographic hash function, MD5. Hash functions are designed so that if you don’t know anything about the input you can’t reconstruct it from the output, but if you know the input exactly, you can verify easily that it gives the same output.  The problem comes when you know a lot about the input, but not everything.  In this case, there are only about two million possible id numbers, and you can just try them all. Once you have the ids, you can look up.

Even if the taxi authorities had done the anonymisation correctly — replacing each id with a random number — it would inevitably have been possible to extract some of the ids with a bit of work.  That’s not the same as being able to extract all of them with a few hours’ computer time.

Roundup return

The Séralini et al paper on Roundup and Roundup-resistant GM corn is back. The NZ Science Media Centre has comments.  As does Retraction Watch

June 25, 2014

NRL Predictions for Round 16

Team Ratings for Round 16

The basic method is described on my Department home page. I have made some changes to the methodology this year, including shrinking the ratings between seasons.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Roosters 9.74 12.35 -2.60
Rabbitohs 7.89 5.82 2.10
Sea Eagles 6.70 9.10 -2.40
Broncos 4.31 -4.69 9.00
Cowboys 3.41 6.01 -2.60
Warriors 2.59 -0.72 3.30
Panthers 2.25 -2.48 4.70
Storm 1.81 7.64 -5.80
Bulldogs 1.71 2.46 -0.80
Knights -1.95 5.23 -7.20
Wests Tigers -4.77 -11.26 6.50
Titans -5.41 1.45 -6.90
Eels -5.78 -18.45 12.70
Dragons -6.09 -7.57 1.50
Raiders -7.68 -8.99 1.30
Sharks -10.52 2.32 -12.80

 

Performance So Far

So far there have been 110 matches played, 63 of which were correctly predicted, a success rate of 57.3%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Raiders vs. Bulldogs Jun 20 14 – 22 -4.10 TRUE
2 Warriors vs. Broncos Jun 21 19 – 10 1.30 TRUE
3 Sharks vs. Sea Eagles Jun 21 0 – 26 -9.80 TRUE
4 Storm vs. Eels Jun 22 46 – 20 9.00 TRUE
5 Titans vs. Dragons Jun 22 18 – 19 6.70 FALSE
6 Knights vs. Cowboys Jun 23 36 – 28 -2.90 FALSE

 

Predictions for Round 16

Here are the predictions for Round 16. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Sea Eagles vs. Roosters Jun 27 Sea Eagles 1.50
2 Broncos vs. Sharks Jun 27 Broncos 19.30
3 Wests Tigers vs. Raiders Jun 28 Wests Tigers 7.40
4 Cowboys vs. Rabbitohs Jun 28 Cowboys 0.00
5 Warriors vs. Panthers Jun 29 Warriors 4.80
6 Eels vs. Knights Jun 29 Eels 0.70
7 Dragons vs. Storm Jun 30 Storm -3.40

 

Super 15 Predictions for Round 17

Team Ratings for Round 17

The basic method is described on my Department home page. I have made some changes to the methodology this year, including shrinking the ratings between seasons.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Crusaders 8.68 8.80 -0.10
Waratahs 5.97 1.67 4.30
Sharks 5.65 4.57 1.10
Brumbies 3.64 4.12 -0.50
Hurricanes 2.74 -1.44 4.20
Bulls 2.62 4.87 -2.30
Stormers 1.98 4.38 -2.40
Chiefs 1.44 4.38 -2.90
Blues 0.21 -1.92 2.10
Highlanders -1.69 -4.48 2.80
Force -2.85 -5.37 2.50
Reds -4.13 0.58 -4.70
Cheetahs -4.52 0.12 -4.60
Lions -6.02 -6.93 0.90
Rebels -6.73 -6.36 -0.40

 

Performance So Far

So far there have been 101 matches played, 65 of which were correctly predicted, a success rate of 64.4%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Crusaders vs. Force May 30 30 – 7 14.40 TRUE
2 Reds vs. Highlanders May 30 38 – 31 0.70 TRUE
3 Chiefs vs. Waratahs May 31 17 – 33 1.60 FALSE
4 Blues vs. Hurricanes May 31 37 – 24 -1.80 FALSE
5 Brumbies vs. Rebels May 31 37 – 10 10.90 TRUE
6 Lions vs. Bulls May 31 32 – 21 -8.50 FALSE
7 Sharks vs. Stormers May 31 19 – 21 7.40 FALSE

 

Predictions for Round 17

Here are the predictions for Round 17. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Highlanders vs. Chiefs Jun 27 Chiefs -0.60
2 Rebels vs. Reds Jun 27 Reds -0.10
3 Hurricanes vs. Crusaders Jun 28 Crusaders -3.40
4 Waratahs vs. Brumbies Jun 28 Waratahs 4.80
5 Force vs. Blues Jun 28 Force 0.90

 

Something to listen to

Two people we have linked to a lot, Felix Salmon and Cathy O’Neill, now have a podcast on money and finance, at Slate

Not even wrong

The Readers’ Digest “Most Trusted” lists are out again. Sigh.

Before we get to the actual complaint in Stat-of-the-Week recommendation, we should acknowledge that there’s no way the “most trusted” list could make sense.

Firstly, ‘trusted’ requires more detail. What is it that we’re trusting these people with? Of course, it wouldn’t help making the question more specific, since people will still answer on some vague ‘niceness’ scale anyway: we saw this problem with a Herald poll at the beginning of the year, which asked opinions about five notable people and found the only one notable for his commitment to animal safety had the lowest rating for “who would you trust to feed your cat?”. Secondly, there’s no useful way to get an accurate rating of dozens of people (or other items) in an opinion poll. People’s brains overload. Thirdly, even if you could get a rating from each respondent, the overall ranking will be sensitive to how you combine the individual ratings.

So how does Readers’ Digest do it? They say (shouting in the original)

READER’S DIGEST COMMISSIONED CATALYST CONSULTANCY & RESEARCH TO POLL A REPRESENTATIVE SAMPLE OF NEW ZEALANDERS ABOUT TRUSTED PEOPLE AND PROFESSIONS. A TOTAL OF 603 ADULTS RANKED 100 WELL-KNOWN PEOPLE AND 50 JOB TYPES ON A SCALE OF ONE TO TEN IN MARCH 2014.

That is, the list is determined in advance, and the polling just addresses the ordering on the list. There is some vague sense in which Willie Apiata is the most trusted person,  or at least the most highly-regarded person, or at least the most highly-regarded famous person, in New Zealand but there really isn’t any useful sense in which Hone Harawira is the least trusted person in New Zealand. There are many people in NZ who you’d expect to be less trusted than Mr Harawira; they didn’t get put on the list, and the survey respondents weren’t asked about them.

It’s not surprising that stories keep coming out about this list, and I suppose it’s not surprising that people try to interpret being on the bottom of the list. Perhaps more surprising, no-one has yet complained that there are actually 101 well-known people, not 100, on the list.

June 24, 2014

Beyond clinical trials?

From The Atlantic

And with reliable simulations for what’s happening at the cellular level, this approach could be used to treat patients and also to test new drugs and devices. Dassault Systèmes is focusing on that level of granularity now, trying to simulate propagation of cholesterol in human cells and building oncological cell models. “It’s data science and modeling,” Charlès told me. “Coupling the two creates a new environment in medicine.”

Charlès and his colleagues believe that a shift to virtual clinical trials—that is, testing new medicines and devices using computer models before or instead of trials in human patients—could make new treatments available more quickly and cheaply. 

From pharmaceutical chemist Derek Lowe, in response

Speed the day. The cost of clinical trials, coupled with their low success rate, is eating us alive in this business (and it’s getting worse every year). This is just the sort of thing that could rescue us from the walls that are closing in more tightly all the time. But this talk of shifts and revolutions makes it sound as if this sort of thing is happening right now, which it isn’t. No such simulated clinical trial, one that could serve as the basis for a drug approval, is anywhere near even being proposed. How long before one is, then? If things go really swimmingly, I’d say 20 to 25 years from now, personally, but I’d be glad to hear other estimates.

We do, potentially, have the tools to use current treatments more effectively, and data science can help.  Even there,  the biggest opportunities are nothing to do with subtle individual differences — for example, both here and in the US, only about half of people with hypertension are being treated.

June 23, 2014

Possibly underreported

From Stuff, the headline “Cheating on the rise at Massey.” The basis for the story is that there were 56 incidents from 56 separate students reported in 2012, and 72 incidents from 51 separate students reported last year.

We aren’t told if that’s out of the 35000 total students, the 18000 on-campus students, or the 9000 at the Manawatu campus. Even with the smallest denominator, the cheating rate is only about half a percent. Taking this at face value requires a touching faith in the honest of Massey students, since the rate is a couple of orders of magnitude lower than self-report surveys often find for ever having plagiarised in college, and five times lower than a careful experiment found for a single assignment in US colleges (PDF)

Since reported incidents of cheating are a small minority of actual incidents, it’s hard to say anything sensible about trends from two years at a single university, especially as the story says Massey is taking new steps to combat cheating. There’s no way to disentangle changes in reporting from changes in cheating.

Briefly

  • From The Functional Art, ethics in infographics
  • From Scott Aaronson, is it possible to define morality or trust the way Google defines reliability
  • “Ethics in Graphic Design” is a forum for the exploration of ethical issues in graphic design. It is intended to be used as a resource and to create an open dialogue among graphic designers about these critical issues. 

Undecided?

My attention was drawn on Twitter to this post at The Political Scientist arguing that the election poll reporting is misleading because they don’t report the results for the relatively popular “Undecided” party.  The post is making a good point, but there are two things I want to comment on. Actually, three things. The zeroth thing is that the post contains the numbers, but only as screenshots, not as anything useful.

The first point is that the post uses correlation coefficients to do everything, and these really aren’t fit for purpose. The value of correlation coefficients is that they summarise the (linear part of the) relationship between two variables in a way that doesn’t involve the units of measurement or the direction of effect (if any). Those are bugs, not features, in this analysis. The question is how the other party preferences have changed with changes in the ‘Undecided’ preference — how many extra respondents picked Labour, say, for each extra respondent who gave a preference. That sort of question is answered  (to a straight-line approximation) by regression coefficients, not correlation coefficients.

When I do a set of linear regressions, I estimate that changes in the Undecided vote over the past couple of years have split approximately  70:20:3.5:6.5 between Labour:National:Greens:NZFirst.  That confirms the general conclusion in the post: most of the change in Undecided seems to have come from  Labour. You can do the regressions the other way around and ask where (net) voters leaving Labour have gone, and find that they overwhelmingly seem to have gone to Undecided.

What can we conclude from this? The conclusion is pretty limited because of the small number of polls (9) and the fact that we don’t actually have data on switching for any individuals. You could fit the data just as well by saying that Labour voters have switched to National and National voters have switched to Undecided by the same amount — this produces the same counts, but has different political implications. Since the trends have basically been a straight line over this period it’s fairly easy to get alternative explanations — if there had been more polls and more up-and-down variation the alternative explanations would be more strained.

The other limitation in conclusions is illustrated by the conclusion of the post

There’s a very clear story in these two correlations: Put simply, as the decided vote goes up so does the reported percentage vote for the Labour Party.

Conversely, as the decided vote goes up, the reported percentage vote for the National party tends to go down.

The closer the election draws the more likely it is that people will make a decision.

But then there’s one more step – getting people to put that decision into action and actually vote.

We simply don’t have data on what happens when the decided vote goes up — it has been going down over this period — so that can’t be the story. Even if we did have data on the decided vote going up, and even if we stipulated that people are more likely to come to a decision near the election, we still wouldn’t have a clear story. If it’s true that people tend to come to a decision near the election, this means the reason for changes in the undecided vote will be different near an election than far from an election. If the reasons for the changes are different, we can’t have much faith that the relationships between the changes will stay the same.

The data provide weak evidence that Labour has lost support to ‘Undecided’ rather than to National over the past couple of years, which should be encouraging to them. In the current form, the data don’t really provide any evidence for extrapolation to the election.

 

[here’s the re-typed count of preferences data, rounded to the nearest integer]