Posts filed under General (634)

February 27, 2015

CensusAtSchool 2015 launches soon!

It’s nearly CensusAtSchool time again!  CAS is a biennial educational project in te reo Māori and English that turns school students into data detectives, using real-world, anonymised data about them, their peers, and their world.123_UoAStats_6May13 -low res

This is how it works: In the classroom, using any sort of internet-enabled digital device, and under the supervision of teachers, students fill in a confidential questionnaire in English or te reo Māori.

Some questions involve practical skills, such as weighing their schoolbags and measuring their arm span. Some questions ask about their day-to-day lives: How do they get to school? Where did they eat their dinners the night before? Do they think bullying is a problem in their school? And, given that this is a major sporting year: Which two teams will contest the Rugby World Cup final?

The database is then made available for students and their teachers to undertake statistical investigations, which is an important part of the statistics strand of the curriculum.

Teachers, this year’s Census starts on March 16, and can be completed any time this year. It’s free and you can register at For everyone else, CAS always attracts great mainstream media interest – we’ll post the best stories here as they crop up.081_UoAStats_6May13low res

CensusAtSchool is an international educational project that began in the UK in 2000, based on a 1990 trial project by Dr Sharleen Forbes, then of Statistics New Zealand. It is now run in the UK, Ireland, Australia, Canada, South Africa, Japan and the US, as well as  New Zealand.

February 25, 2015


  • NZ papers have sensible coverage of the new peanuts/kids research (Herald, Stuff). NHS Behind The Headlines has a summary and takes some UK papers to task.
  • “Rich Data, Poor Data”: Nate Silver writes about why sports statistics works. Unfair summary: it’s an artificial problem in a controlled environment that people care about more than they should.
  • “the vast majority of health sites, from the for-profit to the government-run, are loaded with tracking elements that are sending records of your health inquiries to the likes of web giants like Google, Facebook, and Pinterest, and data brokers like Experian and Acxiom.” Story at, video summary from the researcher:
  • “A memo to the American people from US Chief Data Scientist Dr DJ Patil”.  More informative than you might expect given the source.
  • “the biggest problem facing the world of public opinion research isn’t that online opt-in polls, but rather the temptation to troll twitter to “see what people are thinkingand other thoughts from Cathy O’Neil, based on the new report on Big Data from the American Association for Public Opinion Research.
  • another part of the increasing supply of open data

Wiki New Zealand site revamped

We’ve written before about Wiki New Zealand, which aims to ‘democractise data’. WNZ has revamped its website to make things clearer and cleaner, and you can browse here.

As I’m a postgraduate scarfie this year, the table on domestic students in tertiary education interested me – it shows that women (grey) are enrolled in greater numbers than men at every single level. Click the graph to embiggen.

Founder Lillian Grace talks about the genesis of Wiki New Zealand here, and for those who love the techy  side, here’s a video about the backend.












February 18, 2015

Super 15 Predictions for Round 2

Team Ratings for Round 2

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Crusaders 8.60 10.42 -1.80
Waratahs 8.35 10.00 -1.60
Brumbies 3.95 2.20 1.70
Hurricanes 3.65 2.89 0.80
Sharks 2.79 3.91 -1.10
Chiefs 2.77 2.23 0.50
Stormers 2.69 1.68 1.00
Bulls 1.87 2.88 -1.00
Blues 0.90 1.44 -0.50
Highlanders -2.54 -2.54 -0.00
Force -3.02 -4.67 1.70
Lions -4.14 -3.39 -0.80
Cheetahs -4.42 -5.55 1.10
Reds -6.73 -4.98 -1.70
Rebels -7.71 -9.53 1.80


Performance So Far

So far there have been 7 matches played, 2 of which were correctly predicted, a success rate of 28.6%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Crusaders vs. Rebels Feb 13 10 – 20 24.40 FALSE
2 Brumbies vs. Reds Feb 13 47 – 3 11.20 TRUE
3 Lions vs. Hurricanes Feb 13 8 – 22 -1.80 TRUE
4 Blues vs. Chiefs Feb 14 18 – 23 3.20 FALSE
5 Sharks vs. Cheetahs Feb 14 29 – 35 13.50 FALSE
6 Bulls vs. Stormers Feb 14 17 – 29 5.20 FALSE
7 Waratahs vs. Force Feb 15 13 – 25 18.70 FALSE


Predictions for Round 2

Here are the predictions for Round 2. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Chiefs vs. Brumbies Feb 20 Chiefs 3.30
2 Rebels vs. Waratahs Feb 20 Waratahs -12.10
3 Bulls vs. Hurricanes Feb 20 Bulls 2.70
4 Highlanders vs. Crusaders Feb 21 Crusaders -7.10
5 Reds vs. Force Feb 21 Reds 0.30
6 Stormers vs. Blues Feb 21 Stormers 6.30
7 Sharks vs. Lions Feb 21 Sharks 10.90


February 16, 2015


February 14, 2015

Run and find out, but guess first has a post on calendar patterns.  Yuri Victor noticed that, this year, February is a nice rectangular shape on a calendar (as happens whenever it starts on Sunday in  a non-leap year), and wondered how often this happened.  This is the sort of question where you can easily find out the answer, so he did:

I decided to see if this occurs often so I wrote some code and found out it happens more than I thought.In the past 100 years, there have been 11 Februaries that make a rectangle.

He also noticed that February 13th would be Friday when this happened and wondered how often we got a Friday 13th:

Friday the 13ths also happen more than I thought. In the past 100 years there have been 171 Friday the 13ths, which means there is one to two a year.

This is a Good Thing. We want journalists wondering about patterns and looking up data to check them. We don’t want them being required to call an expert in calendars to give a quote. It’s also a Good Thing that he tells us his expectations were wrong

It would be even better, though, if he’d tried to work out a quantitative guess and tell us. The simplest guess would be that, in the long run, February 1 is a Sunday as often as any other day, and that the 13th of a month is a Friday as often as any other day.  These are natural guesses because there’s no special reason the year or a particular month should start on a particular day of the week. 

In  100 years there are 1200 months, and 1200/7 is 171.4, so it looks as though Friday 13th happens in almost exactly 1/7 of months.  In the past 100 years there are 75 Februaries with 28 days, and 75/7 is 10.7, so 28-day Februaries begin on Sunday almost exactly 1/7 of the time.

You wouldn’t always expect the simplest possible explanation to hold. For example, the date of Passover is set based on the solar and lunar calendars, in a 19-year cycle. Since 7 doesn’t divide 19, you’d expect either that the days of the week didn’t divide up equally or that they took a long time (requiring lots of leap years) to do so.


February 13, 2015

Super 15 Predictions for Round 1

Team Ratings for Round 1

Sorry about the delay in posting these predictions. I have never been really happy with my selection of parameters, so decided to do a massive grid search yesterday which ended up taking about 16 hours. That means this year I have shiny new parameter values for my predictions. As ever, remember the adage, “past performance is not necessarily indicative of future behaviour”.

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Crusaders 10.42 10.42 0.00
Waratahs 10.00 10.00 0.00
Sharks 3.91 3.91 -0.00
Hurricanes 2.89 2.89 -0.00
Bulls 2.88 2.88 0.00
Chiefs 2.23 2.23 0.00
Brumbies 2.20 2.20 -0.00
Stormers 1.68 1.68 -0.00
Blues 1.44 1.44 0.00
Highlanders -2.54 -2.54 -0.00
Lions -3.39 -3.39 -0.00
Force -4.67 -4.67 0.00
Reds -4.98 -4.98 0.00
Cheetahs -5.55 -5.55 -0.00
Rebels -9.53 -9.53 -0.00


Predictions for Round 1

Here are the predictions for Round 1. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Crusaders vs. Rebels Feb 13 Crusaders 24.40
2 Brumbies vs. Reds Feb 13 Brumbies 11.20
3 Lions vs. Hurricanes Feb 13 Hurricanes -1.80
4 Blues vs. Chiefs Feb 14 Blues 3.20
5 Sharks vs. Cheetahs Feb 14 Sharks 13.50
6 Bulls vs. Stormers Feb 14 Bulls 5.20
7 Waratahs vs. Force Feb 15 Waratahs 18.70


Misunderstanding genetic heritability

From the Herald, under the headline “Is this why we’re all getting fat?”

According to the UN’s World Health Organisation, obesity nearly doubled worldwide from 1980 to 2008.

More than 2.8 million adults die each year as a result of being overweight or obese, it says. A full 42 million children under the age of five are considered to be obese.

Diet and a sedentary lifestyle have long been fingered as causes of obesity, but in recent years, advances in gene sequencing have turned attention to inheritance.

Previous studies have variously estimated genes as being to blame for between 40 and 70 per cent of the problem.

Every sentence here is true, but the impression is completely wrong.

The 40-70% genetic contribution to weight is comparing different individuals in basically the same environment.  The ‘obesity epidemic’ is comparing whole populations over time.  One thing we know can’t possibly explain the recent increases in obesity is genetics: there hasn’t been time for the genes of these populations to change.

February 12, 2015

Two types of brain image study

If a brain imaging study finds greater activation in the asymmetric diplodocus region or increased thinning in the posterior homiletic, what does that mean?

There are two main possibilities. Some studies look at groups who are different and try to understand why. Other studies try to use brain imaging as an alternative to measuring actual behaviour. The story in the Herald (from the Washington Post), “Benefit of kids’ music lessons revealed – study” is the second type.

The researchers looked at 334 MRI brain images from 232 young people (so mostly one each, some with two or three), and compared the age differences in young people who did or didn’t play a musical instrument.  A set of changes that happens as you grow up happened faster for those who played a musical instrument.

“What we found was the more a child trained on an instrument,” said James Hudziak, a professor of psychiatry at the University of Vermont and director of the Vermont Center for Children, Youth and Families, “it accelerated cortical organisation in attention skill, anxiety management and emotional control.

An obvious possibility is that kids who play a musical instrument have different environments in other ways, too.  The researchers point this out in the research paper, if not in the story.  There’s a more subtle issue, though. If you want to measure attention skill, anxiety management, or emotional control, why wouldn’t you measure them directly instead of measuring brain changes that are thought to correlate with them?

Finally, the effect (if it is an effect) on emotional and behavioural maturation (if it is on emotional and behavioural maturation) is very small. Here’s a graph from the paper
PowerPoint Presentation


The green dots are the people who played a musical instrument; the blue dots are those who didn’t.  There isn’t any dramatic separation or anything — and to the extent that the summary lines show a difference it looks more as if the musicians started off behind and caught up.


  • Ways of visualising uncertainty in statistics, from Visualising Data
  • Football competes with internet porn for audience: analysis from Pornhub
    The zero line is ‘average day and time': a better comparison would have been a typical winter Sunday.
  • The New Yorker, on the problems with so-called precision medicine: The pace of genetics research, the variability of test methods and results, and the aura of infallibility with which the tests are marketed, she told me, make this advance a more complicated one than the EKG.  But, as the demand for DNA testing increases, she says, “it will probably be a bit worse before it gets better.”
  • A panel of the Institute of Medicine has come out with a definition, diagnostic criteria, and a new name for ‘chronic fatigue syndrome’.  The question wasn’t whether people were sick — that’s pretty obvious. The question was which set of people have the same thing wrong with them, and how to tell.  It’s a statistical issue because a definition leads to counting people who satisfy it.
  • It sees you when you’re sleeping; it knows when you’re awake: smart power meters on the front page of the Dominion Post.  (It also sends you lots of email whenever you alter your habits, eg, by travelling).
  • “There’s no plague on the New York subway. No platypuses either”.  Ed Yong on false positives in DNA testing. His team swabbed tomato plants in a field in Virginia, analysed the DNA in those samples, and found matches to the duck-billed platypus—an Australian animal, not known to live in Virginia. They then analysed over 19,000 publicly available microbiome samples from around the world; around a third threw up matches for platypus DNA. Either the platypus secretly rules the world or, more likely, this was a hilarious case of false positives gone mad.
  • NHS Choices makes StatsChat look tactful and friendly: they are going after the newspapers on Twitter
  • “But these headlines are without serious foundation, and through no fault of the journalists.”  David Spiegelhalter on UK coverage of a study of health associations with low-level alcohol consumption.
  • How to release data in a spreadsheet:  Send this to everyone who know who releases data, or just put it on your blog in a passive-aggressive way. The key point is that data release is different from data presentation.