May 22, 2016

Knowing what you’re predicting

From a Sydney Morning Herald story about brain wave reading.

The faux insurgents were asked to hatch a mock terrorist plot by selecting one of four dates in July, one of four locations in Houston and one of four types of bomb, then jot it all down in a letter to their terrorist boss.

EEG caps on, they were later shown a slew of months of the year, US cities and varieties of terror attack on a computer; and when “July”, “Houston” and “bomb” appeared among them, the P300 spikes were big enough to nab all 12 “culprits”.

The brain fingerprinting technique relies on picking up a signal that the brain recognises some piece of information. The people who make the gadgetry claim this can be done with 100% accuracy (not everyone agrees). However, even if the brain waves can be picked up with 100% accuracy, that’s not 100% accuracy for the real question.

Consider DNA evidence. In the ideal case of a high-quality DNA sample from the scene of a crime, and a high-quality sample from a suspect, and the right combination of ancestries, it is possible to be almost 100% sure that the suspect’s DNA (or that of an identical twin) is present in the crime sample. The scene-of-crime sample could be billions of times more likely if the suspect contributed to it than if a random person from the population did. The DNA expert won’t (or shouldn’t) testify that the suspect is almost certainly guilty, because that’s not a DNA question. Even ruling out police fraud or incompetence, the suspect’s DNA could have present in the sample for some innocent reason. Guilt is not a question that capillary electrophoresis can answer.

The situation is worse for the brain fingerprinting technique, because it’s intended to be used before a terrorist attack has been committed, and potentially before the suspects have even committed a crime such as conspiracy.  Maybe they recognised an attack plan because they’d been thinking about it, or because they’d read a Tom Clancy novel about it. Maybe they recognised “July” and “Houston” from baseball and the bomb from somewhere else entirely.  None of these would be counted as an error by the brain wave enthusiasts — they are entirely genuine indications of recognition — but they aren’t specific evidence of past or future crime.


May 21, 2016

Advertising, health promotion, and lots of latex

The biennial Olympic condom story is out.  The Rio Olympics are planning to give away 450,000 condoms in the Olympic Village, compared to a mere 150,000 in London, and 90,000 in Sydney (initially 70,000, but they ran out).

This graph shows (with black dots) the publicised numbers for the past Olympics that I could find easily (Torino seems to be keeping quiet, for some reason)


So, why so many? Condoms are cheap to produce and hard to advertise.  Even buying retail from Amazon you can get 1000 for less than US$150, so 450,000 would cost about US$65k.  In a setting like this, I’m sure the health promotion folks are paying a lot less than that, and the international news coverage implying that Olympic athletes have safe sex is worth far more than the cost of materials.

The red dot? Oh yes. That’s the number handed out by the Health Ministry campaigners at street parties for Carnival this year in Brazil.

May 20, 2016


  • The Princeton Web CensusToday I’m pleased to release initial analysis results from our monthly, 1-million-site measurement. This is the largest and most detailed measurement of online tracking to date, including measurements for stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and “cookie syncing”.  These results represent a snapshot of web tracking, but the analysis is part of an effort to collect data on a monthly basis and analyze the evolution of web tracking and privacy over time.”
  • Nate Silver on TwitterAn irony is that our early Trump forecasts weren’t based on a statistical model. Just a guesstimate that I got stubborn anchoring myself to. So one lesson is “when in doubt, build a model”. Doesn’t have to be your final answer. But it’s a great starting point. Provides discipline.”
  • From Flowing Data, a visualisation of the changing US diet
  • A visualisation of 24 hours of data flow in a health insurance company: pretty, but not necessarily useful
  • “Mukherjee gives us a Whig history of the gene, told with verve and color, if not scrupulous accuracy. “ A book review/essay at the Atlantic, by Nathaniel Comfort
  • There’s a new White House report on Big Data and Civil RightsUsing case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.” (via

Depends who you ask

There’s a Herald story about sleep

A University of Michigan study using data from Entrain, a smartphone app aimed at reducing jetlag, found Kiwis on average go to sleep at 10.48pm and wake at 6.54am – an average of 8 hours and 6 minutes sleep.

It quotes me as saying the results might not be all that representative, but it just occurred to me that there are some comparison data sets for the US at least.

  • The Entrain study finds people in the US go to sleep on average just before 11pm and wake up on average between 6:45 and 7am.
  • SleepCycle, another app, reports a bedtime of 11:40 for women and midnight for men, with both men and women waking at about 7:20.
  • The American Time Use Survey is nationally representative, but not that easy to get stuff out of. However, Nathan Yau at Flowing Data has an animation saying that 50% of the population are asleep at 10:30pm and awake at 6:30am
  • And Jawbone, who don’t have to take anyone’s word for whether they’re asleep, have a fascinating map of mean bedtime by county of the US. It looks like the national average is after 11pm, but there’s huge variation, both urban-rural and position within your time zone.

These differences partly come from who is deliberately included and excluded (kids, shift workers, the very old), partly from measurement details, and partly from oversampling of the sort of people who use shiny gadgets.

May 18, 2016

Super 18 Predictions for Round 13

Team Ratings for Round 13

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Crusaders 9.63 9.84 -0.20
Highlanders 7.33 6.80 0.50
Hurricanes 6.75 7.26 -0.50
Chiefs 5.22 2.68 2.50
Waratahs 4.69 4.88 -0.20
Brumbies 2.95 3.15 -0.20
Lions 1.82 -1.80 3.60
Sharks 1.19 -1.64 2.80
Stormers 0.72 -0.62 1.30
Bulls -1.01 -0.74 -0.30
Rebels -5.49 -6.33 0.80
Blues -5.50 -5.51 0.00
Jaguares -7.15 -10.00 2.80
Cheetahs -7.27 -9.27 2.00
Reds -9.27 -9.81 0.50
Force -10.87 -8.43 -2.40
Sunwolves -16.42 -10.00 -6.40
Kings -20.56 -13.66 -6.90


Performance So Far

So far there have been 93 matches played, 65 of which were correctly predicted, a success rate of 69.9%.
Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Highlanders vs. Crusaders May 13 34 – 26 0.30 TRUE
2 Rebels vs. Brumbies May 13 22 – 30 -4.50 TRUE
3 Hurricanes vs. Reds May 14 29 – 14 20.70 TRUE
4 Waratahs vs. Bulls May 14 31 – 8 7.90 TRUE
5 Sunwolves vs. Stormers May 14 17 – 17 -14.90 FALSE
6 Cheetahs vs. Kings May 14 34 – 20 17.20 TRUE
7 Lions vs. Blues May 14 43 – 5 7.70 TRUE
8 Jaguares vs. Sharks May 14 22 – 25 -4.50 TRUE


Predictions for Round 13

Here are the predictions for Round 13. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Crusaders vs. Waratahs May 20 Crusaders 8.90
2 Reds vs. Sunwolves May 21 Reds 11.20
3 Chiefs vs. Rebels May 21 Chiefs 14.70
4 Force vs. Blues May 21 Blues -1.40
5 Lions vs. Jaguares May 21 Lions 13.00
6 Sharks vs. Kings May 21 Sharks 25.30
7 Bulls vs. Stormers May 21 Bulls 1.80


NRL Predictions for Round 11

Team Ratings for Round 11

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Broncos 12.11 9.81 2.30
Cowboys 12.06 10.29 1.80
Storm 6.67 4.41 2.30
Sharks 6.35 -1.06 7.40
Bulldogs 2.36 1.50 0.90
Roosters 1.90 11.20 -9.30
Eels 0.88 -4.62 5.50
Panthers 0.45 -3.06 3.50
Raiders -0.55 -0.55 0.00
Sea Eagles -0.64 0.36 -1.00
Rabbitohs -0.89 -1.20 0.30
Dragons -3.24 -0.10 -3.10
Titans -5.04 -8.39 3.30
Warriors -6.04 -7.47 1.40
Wests Tigers -8.78 -4.06 -4.70
Knights -15.92 -5.41 -10.50


Performance So Far

So far there have been 80 matches played, 44 of which were correctly predicted, a success rate of 55%.
Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Dragons vs. Raiders May 12 16 – 12 -0.40 FALSE
2 Eels vs. Rabbitohs May 13 20 – 22 5.90 FALSE
3 Panthers vs. Warriors May 14 30 – 18 5.60 TRUE
4 Storm vs. Cowboys May 14 15 – 14 -6.50 FALSE
5 Broncos vs. Sea Eagles May 14 30 – 6 14.40 TRUE
6 Knights vs. Sharks May 15 0 – 62 -12.80 TRUE
7 Wests Tigers vs. Bulldogs May 15 4 – 36 -7.80 TRUE
8 Titans vs. Roosters May 16 26 – 6 -7.70 FALSE


Predictions for Round 11

Here are the predictions for Round 11. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Rabbitohs vs. Dragons May 19 Rabbitohs 2.40
2 Cowboys vs. Broncos May 20 Cowboys 2.90
3 Wests Tigers vs. Knights May 21 Wests Tigers 10.10
4 Warriors vs. Raiders May 21 Raiders -1.50
5 Sharks vs. Sea Eagles May 21 Sharks 10.00
6 Panthers vs. Titans May 22 Panthers 8.50
7 Bulldogs vs. Roosters May 22 Bulldogs 3.50
8 Eels vs. Storm May 23 Storm -2.80


May 17, 2016

Housing prices, SF edition

Eric Fischer set out to look at rental price trends in San Francisco. The standard dataset goes back only to 1979, which was also the start of rent control. Most people would have stopped there. But no:

I set out to replicate the DataBook’s methodology over a wider range of years, … Mostly I used the San Francisco Public Library’s page scans of the newspaper but resorted to microfilm for the few later years where no page scans are available.

That is, he copied down and entered the prices from the ads by hand.

There has been a remarkable constant trend in SF rental prices since the mid-1950s, with median real prices increasing steadily by 2.5%/year, decade after decade.26941938971_ea9415db14

For the years since 1975, when employment data are available, most of the deviations from this trend can be explained by increases or decreases in numbers of homes in the city, increases or decreases in number of jobs, and increases or decreases in total real salaries and wages paid (specifically salaries and wages, not all income).

Rent control didn’t have a big impact. Speculation didn’t have a big impact — prices were higher during the boom of the 1990s, but only as much as would be expected from more people in the city and the higher salaries and wages they were paid.

San Francisco County already has a population density of over 7000 people per square km — lower than the Auckland CBD, but higher than anywhere else in Auckland. It’s hard for them to increase supply enough to reduce prices, but they might manage to increase supply enough to stabilise prices.

(via Michael Andersen and @BarbsNZgarden)


  • You’ve probably seen this, but Facebook’s news feed editing wasn’t as algorithmic as they were suggesting. Of course, that tells you nothing one way or the other about bias, as people including Cathy O’Neil point out.
  • The difficulties of turning data science into gobs and gobs of money, as illustrated by Palantir. From Roger Peng at Simply Statistics.
  • Finally for stats/literature dual nerds, an excerpt from the new book by historian of statistics Stephen Stigler

May 16, 2016

Stat of the Week Competition: May 14 – 20 2016

Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.

Here’s how it works:

  • Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday May 20 2016.
  • Statistics can be bad, exemplary or fascinating.
  • The statistic must be in the NZ media during the period of May 14 – 20 2016 inclusive.
  • Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.

Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.


May 13, 2016

Aggregation, not ok?

You’ve probably heard of OkCupid, a dating site. People give sites like that a lot of personal information. And, in a sense, the information is obviously not going to be kept secret — after all, the point of using a dating site is to be found by people you don’t already know.  When someone writes a script to collect the data from large numbers of users, and then publishes it in a convenient and easy to process format, you can just about see how they’d think that was ok. It’s harder to see how they’d be surprised not everyone feels that way.

Aggregation makes a difference because we can search, match, and analyse the data by computer. That’s important for two reasons.

First, it’s quicker and easier — you can get a set of records grouped by sexual preference or other interests almost as quickly as you can think of the question, and you can link usernames or other information to other datasets. The database includes potential matching variables like income, education level, age, job, country, city, which you could still use just taking down data one person at a time by hand, but it would be slow and boring.

Second, the database is impersonal. If you stood outside a gay bar watching who went in and out, you couldn’t really pretend you were innocently using publicly visible information.  If you signed up and went through dating profiles one at a time, it would be easier to pretend, but you’d still tend to see the people behind the data. When it’s a big spreadsheet, it’s easier to ignore how the people would feel about it.

Sometimes people aggregate and publish data knowing it may do harm, because they think there’s a higher interest involved in getting the data out — even if the data release is obviously illegal. This release isn’t obviously illegal (though there are possibilities), but the higher interest is pretty obscure too. The accompanying research paper says

As an example of the analyses one can do with the dataset, a cognitive ability test is constructed from 14 suitable items. To validate the dataset and the test, the relationship of cognitive ability to religious beliefs and political interest/participation is examined.

Those variables are so not what’s going to attract people to these data. But even if you think it’s important for anyone on the internet to be able to do that sort of correlation for variables such as sexual orientation and drug use, it’s hard to think of a reason to include the OkCupid username.