January 18, 2018

Predicting the future

As you’ve heard if you’re in NZ, the Treasury got the wrong numbers for predicted impact on child poverty of Labour’s policies (and as you might not have heard, similarly wrong numbers for the previous government’s policies).

Their ‘technical note‘ is useful

In late November and early December 2017, a module was developed to further improve the Accommodation Supplement analysis. This was applied to both the previous Government’s package and the current Government’s Families Package. The coding error occurred in this “add-on module” – in a single line of about 1000 lines of code.

The quality-assurance (QA) process for the add-on module included an independent review of the methodology by a senior statistician outside the Treasury’s microsimulation modelling team, multiple layers of code review, and an independent replication of each stage by two modellers. No issues were identified during this process.

I haven’t seen their code, but I have seen other microsimulation models and as a statistics researcher I’m familiar with the problem of writing and testing code that does a calculation you don’t have any other way to do. In fact, when I got called by Newstalk ZB about the Treasury’s error I was in the middle of talking to a PhD student about how to check code for a new theoretical computation.

It’s relatively straightforward to test code when you know what the output should be for each input: you put in a set of income measurements and see if the right tax comes out, or you click on a link and see if you get taken to the right website, or you shoot the Nazi and see if his head explodes. The most difficult part is thinking of all the things that need to be checked.  It’s much harder when you don’t know what the output should even be because the whole point of writing the code is to find out.

You can test chunks of code that are small enough to be simple. You can review the code and try to see if it matches the process that you’re working from. You might be able to work out special cases in some independent way. You can see if the outputs change in sensible ways when you change the inputs. You can get other people to help. And you do all that. And sometimes it isn’t enough.

The Treasury say that they typically try to do more

This QA process, however, is not as rigorous as independent co-production, which is used for modifications of the core microsimulation model.  Independent co-production involves two people developing the analysis independently, and cross-referencing their results until they agree. This significantly reduces the risk of errors, but takes longer and was not possible in the time available.

That’s a much stronger verification approach.  Personally, I’ve never gone as far as complete independent co-production, but I have done partial versions and it does make you much more confident about the results.

The problem with more rigorous testing approaches is they take time and money and often end up just telling you that you were right.  Being less extreme about it is often fine, but maybe isn’t good enough for government work.

Measuring what you care about

There’s a story in the Guardian saying

The credibility of a computer program used for bail and sentencing decisions has been called into question after it was found to be no more accurate at predicting the risk of reoffending than people with no criminal justice experience provided with only the defendant’s age, sex and criminal history.

They even link to the research paper.

That’s all well and good, or rather, not good. But there’s another issue that doesn’t even get raised.  The algorithms aren’t trained and evaluated on data about re-offending. They’re trained and evaluated on data about re-conviction: they have to be, because that’s all we’ve got.

Suppose two groups of people have the same rate of re-offending, but one group are more likely to get arrested, tried, and convicted than the other. The group with a higher re-conviction rate will look to the algorithm as if they have a higher chance of re-offending.   They’ll get a higher predicted probability of re-offending. Evaluation will confirm they’re more likely to have the “re-offending” box ticked in their subsequent data.  The model can look like it’s good at discriminating between re-offenders and those who go straight, when it’s actually just good at discriminating against the same people as the justice system.

This isn’t an easy problem to fix: re-conviction data are what you’ve got. But when you don’t have the measurement you want, it’s important to be honest about it. You’re predicting what you measured, not what you wanted to measure.

Maps and models

This spectacular map from the National Geospatial-Intelligence Agency was circulating yesterday on Twitter. I got it from Christopher Jackson (@seis_matters). It shows antineutrino emissions from around the earth

Our local (sub)continent of Zealandia shows up nicely at the bottom right. The black dots are nuclear reactors, and the dark smudge is just the immense rock mass of the Himalayas.

This next map is a style you’ve seen before. It shows New Zealand’s winds at the moment: the storm is passing over.

What these maps have in common is a very high ratio of model to actual data.  The `live’ wind map isn’t based on detailed live reports from a fine grid of weather stations. There aren’t any — especially out in the Pacific. It’s a map of the NOAA Global Forecast System, but forecasting the very near future rather than the long range. It isn’t going to give you more up-to-date information than the Met Service.

The antineutrino map is even more model-based. In the scientific paper I was struck by the sentence

Recently, the blossoming field of neutrino geoscience, first proposed by Eder15, has become a reality with 130 observed geoneutrino interactions12,13 confirming Kobayashi’s view of the Earth being a “neutrino star”16

It looks like the map has well over a million pixels per observed geophysical neutrino. When it comes to nuclear reactors, the paper says “These exciting geophysical capabilities have significant overlap with the non-proliferation community where remote monitoring of antineutrinos emanating from nuclear reactors is being seriously considered“. That is, the reactors are black dots on the map because they know where the reactors are and how many neutrinos they’d make, not because they measured them. The observations do go into the model, and they probably provide actual information about the deeper bits of the earth’s crust, but the map is of the model, not the observaations.

Better or worse?

There was some controversy about the difficulty of the NCEA level 1 maths and stats exam last year.  As Stuff reports

It prompted the NZQA to release the exam to the public, and now the authority is taking the extra step to share the exam outcome before the consolidated results are released in April.

“NZQA has taken the unusual step of announcing these provisional results early so we can respond to the concerns teachers raised with us in the open letter,” said NZQA deputy chief executive Kristine Kilkelly.

“Provisional results for the NCEA Level 1 Mathematics and Statistics examinations in November show the majority of students who sat the examinations gained an Achieved or better grade for each standard.”

There’s a graph with the story, which is always nice:

I’m not convinced this graph is a great way of showing how the 2017 results differed from previous years: it’s better for showing that, yes, the majority of people passed.

Here’s my attempt at showing the 2017 differences: the arrows show the change from last year and the bars show the five-year range. I think it would have been better to just plot the four six-year time series, but that data wasn’t in the NCEA press release. It would also have been better to look at the ‘Merit’ and ‘Excellence’ percentages, but again that’s not given.

I think it’s clearer from this graph that the pass rate for “91028  Investigate Relationships Between Tables, Equations and Graphs” was lower last year, and lower by quite a large amount relative to previous the year-to-year variation. Two of the units have no sign of that sort of drop, and the fourth has a similar drop but from a high point to a value still within the recent range.

So, maybe there was an issue with the ‘tables, equations and graphs’ test.

 

Update: another redesign by Andrew P. Wheeler

January 14, 2018

Briefly

  • Metropolitan Museum of Art President “For various reasons, over the past 10 or 12 years, the pay-as-you-wish policy has failed. It has declined by 71% in the amount people pay.” Felix SalmonIt’s worth fact-checking this, because it turns out that it’s not really true”
  • Cloudflare, a company that distributes websites across the world, has a wall of lava lamps that it uses for random number generation (presumably to seed computational pseudorandom generators)
  • “Do algorithms reveal sexual orientation or just expose our stereotypes?”— on last year’s ‘gaydar’ paper.
  • 538 looks at how they got an analysis of broadband internet availabilty wrong, due to bad data.
  • “The projects tried to show hidden patterns of our daily shopping….Unfortunately, it shows only the internal categorization and sorting of the supermarket.” Another example of data not meaning what you think it means. Christian Laesser (via FlowingData)
  • Child protective agencies are haunted when they fail to save kids. Pittsburgh officials believe a new data analysis program is helping them make better judgment calls.from the New York Times.
  • The NZ government has released a review of the handling of weather data (PDF)
  • From the LSE Impact blog “Academics looking to communicate the findings and value of their research to wider audiences are increasingly going through the media to do so. But poor or incomplete reporting can undermine respect for experts by misrepresenting research, especially by trivialising or sensationalising it, or publishing under inappropriate headlines and with cherry-picked statistics.”  As StatsChat readers will known a lot of this is public-relations people, but some of it is definitely the researchers.
  • The scientific reporting of some pre-clinical research is disturbingly crap: a report in the BMJ; Siouxsie Wiles commenting at The Spinoff
  • Constructing optical illusions for AI visual systems: (gory technical details)
  • You may have seen reports of research saying that Australian hawks spread bushfires…

January 10, 2018

Complete balls

The UK’s Metro magazine has a dramatic story under the headline Popping ibuprofen could make your balls shrivel up

Got a pounding headache?

You might just want to give a big glass of water and a nap a go before reaching for the painkillers. Scientists warn that ibuprofen could be wrecking men’s fertility by making their balls shrivel up.

Sounds pleasant.

Fortunately, that’s not what the study showed.

The story goes on

Researchers looked at 31 male participants and found that taking ibuprofen reduced production of testosterone by nearly a quarter in the space of around six weeks.

That’s also almost completely untrue. In fact, the research paper says (emphasis added)

We investigated the levels of total testosterone and its direct downstream metabolic product, 17β-estradiol. Administration of ibuprofen did not result in any significant changes in the levels of these two steroid hormones after 14 d or at the last day of administration at 44 d. The levels of free testosterone were subsequently analyzed by using the SHBG levels. Neither free testosterone nor SHBG levels were affected by ibuprofen.

Stuff has a much better take on this one:

Men who take ibuprofen for longer than the bottle advises could be risking their fertility, according to a new study.

Researchers found that men who took ibuprofen for extended periods had developed a condition normally seen in elderly men and smokers that, over time, can lead to fertility problems

Ars Technica has the more accurately boring headline Small study suggests ibuprofen alters testosterone metabolism.

The study involved 14 men taking the equivalent of six tablets a day of ibuprofen for six weeks (plus a control group). Their testosterone levels didn’t change, but the interesting research finding is that this was due to compensation for what would otherwise have been a decrease. That is, a hormone signalling to increase testosterone production was elevated.  There’s a potential risk that if the men kept taking ibuprofen at this level for long enough, the compensation process might give up. And that would potentially lead to fertility problems — though not (I don’t think) to the problems Metro was worried about.

So, taking ibuprofen for months on end without a good reason? Probably inadvisable. Like it says on the pack.

 

January 9, 2018

Election maps: what’s the question?

XKCD has come out with a new map of the 2016 US election

In about 2008 I made a less-artistic one of the 2004 elections on similar principles

These maps show some useful things about the US vote:

  1. the proportions for the two parties are pretty close, but
  2. most of the land area has very few voters, and
  3. most areas are relatively polarised
  4. but not as polarised as you think, eg, look at the cities in Texas

What these maps are terrible at is showing changes from one election to the next. The map for 2004 (Republicans ahead by about 2.5%) and 2016 (Republicans behind by about 3%) look very similar. And even 2008 (Republicans behind by 7%) wouldn’t look that different.

Like a well-written thousand words, a well-drawn picture needs to be about something. Questions matter. The data don’t speak for themselves.

January 8, 2018

Long tail of baby names

The Dept of Internal Affairs has released the most common baby names of 2017 (NZ is, I think, the first country each year to do this), and Radio NZ has a story.  A lot of names popular last year were also popular in the past; a few (eg Arlo) are changing fast.

If you look at the sixty-odd years of data available, there’s a dramatic trend. In 1954, ‘John’ was the top boy’s name, with 1389 uses. In 2017 the top was ‘Oliver’, but with only 314 uses — not enough to make 1954’s top twenty. According to the government, there were nearly 13,000 different names given last year, so the mean number of babies per name is under 5; the most popular names are still much more popular than average. But less so than in the past.

Here’s the trend in the number of babies given the top name

and the top ten names

and the top hundred names

That decrease is despite an increase in the total population: here’s the top 10 names as a percentage of all babies (assuming 53% of babies are boys)

and the top 100 names

The proportion with any of the top 100 names has been going down consistently, and also becoming less different between boys and girls.

 

Not dropping every year

Stuff has a story on road deaths, where Julie Ann Genter claims the Roads of National Significance are partly responsible for the increase in death rates. Unsurprisingly, Judith Collins disagrees.  The story goes on to say (it’s not clear if this is supposed to be indirect quotation from Judith Collins)

From a purely statistical viewpoint the road toll is lowering – for every 10,000 cars on the road, the number of deaths is dropping every year.

From a purely statistical viewpoint, this doesn’t seem to be true. The Ministry of Transport provides tables that show a rate of fatalities per 10,000 registered vehicles of 0.077 in 2013, 0.086 in 2014,  0.091 in 2015, and  0.090 in 2016. Here’s a graph, first raw

and now with a fitted trend (on a log scale, since the trend is straighter that way)

Now, it’s possible there’s some other way of defining the rate that doesn’t show it going up each year. And there’s a question of random variation as always. But if you scale for vehicles actually on the road, by using total distance travelled, we saw last year that there’s pretty convincing evidence of an increase in the underlying rate, over and above random variation.

The story goes on to say “But Genter is not buying into the statistics.” If she’s planning to make the roads safer, I hope that isn’t true.

Briefly

  • “Every now and then a story appears in the media about how boffins (and it is always “boffins”) have worked out an equation for something: the perfect cup of tea, the most depressing day of the year, the best way to make pancakes, the perfect handshake, or in the most recent case, the perfect cheese on toast.” The equation for the perfect bullshit equation.
  • The BBC’s statistics-in-the-media radio program More or Less has a special ‘statistics of the year’ episode
  • Some interesting student projects from a data visualisation class
  • How Spotify picks your music.
  • “Average London”: averages of tourist photos of the same London attraction.
  • Displaying uncertainty in the UK unemployment rate
  • One of the problems in training modern neural network classifiers is that they will pick up on anything, sensible or not. Luke Oakden-Rayner writes about a popular set of data from chest x-rays and why it won’t teach the computers the right things.
  • The American Academy of Family Physicians is not endorsing new blood pressure standards that would increase the proportion of US adults defined as having hypertension from about 1/3 to about 1/2.