January 18, 2018

Predicting the future

As you’ve heard if you’re in NZ, the Treasury got the wrong numbers for predicted impact on child poverty of Labour’s policies (and as you might not have heard, similarly wrong numbers for the previous government’s policies).

Their ‘technical note‘ is useful

In late November and early December 2017, a module was developed to further improve the Accommodation Supplement analysis. This was applied to both the previous Government’s package and the current Government’s Families Package. The coding error occurred in this “add-on module” – in a single line of about 1000 lines of code.

The quality-assurance (QA) process for the add-on module included an independent review of the methodology by a senior statistician outside the Treasury’s microsimulation modelling team, multiple layers of code review, and an independent replication of each stage by two modellers. No issues were identified during this process.

I haven’t seen their code, but I have seen other microsimulation models and as a statistics researcher I’m familiar with the problem of writing and testing code that does a calculation you don’t have any other way to do. In fact, when I got called by Newstalk ZB about the Treasury’s error I was in the middle of talking to a PhD student about how to check code for a new theoretical computation.

It’s relatively straightforward to test code when you know what the output should be for each input: you put in a set of income measurements and see if the right tax comes out, or you click on a link and see if you get taken to the right website, or you shoot the Nazi and see if his head explodes. The most difficult part is thinking of all the things that need to be checked.  It’s much harder when you don’t know what the output should even be because the whole point of writing the code is to find out.

You can test chunks of code that are small enough to be simple. You can review the code and try to see if it matches the process that you’re working from. You might be able to work out special cases in some independent way. You can see if the outputs change in sensible ways when you change the inputs. You can get other people to help. And you do all that. And sometimes it isn’t enough.

The Treasury say that they typically try to do more

This QA process, however, is not as rigorous as independent co-production, which is used for modifications of the core microsimulation model.  Independent co-production involves two people developing the analysis independently, and cross-referencing their results until they agree. This significantly reduces the risk of errors, but takes longer and was not possible in the time available.

That’s a much stronger verification approach.  Personally, I’ve never gone as far as complete independent co-production, but I have done partial versions and it does make you much more confident about the results.

The problem with more rigorous testing approaches is they take time and money and often end up just telling you that you were right.  Being less extreme about it is often fine, but maybe isn’t good enough for government work.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    Stephen McNeill

    I’m sympathetic to the poor person who had to write the code, and mindful of the time pressure they were probably (possibly) under at the time. But, Treasury is an operational agency of Government, and resources ought to be available to ensure adequate quality, even if 100% error-free results cannot be guaranteed. Sympathetic peer review is one way to do this of course, but there are many methods around.

    I was less impressed by a comment from a Treasury official on Morning Report who said the error was (excuse my memory) only on one line, one of several thousand lines. Hardly anyone, he suggested, would notice.

    It really doesn’t matter whether the task involved 10, 10000, or 10 million lines of code, the issue is with the results. Of course, as the complexity of the task increases it becomes more difficult to detect and track down the source of an error if there is one, but it is a manageable task, even if it’s difficult.

    1 month ago

  • avatar

    If they’re sure it’s only one error, why would it take a month to produce new results?

    1 month ago

  • avatar
    Richard penny

    Interesting that it’s a debate a point estimate.My first comment is any statistical model has uncertainty, so it would be interesting to see the change within the variance.

    My second point is any economic model is very much ceteris paribas and very rarely is it like that even in the short-term (i.e. 12 months) let long alone the time period the projections go. So the model projections are useful but shouldn’t be regarded as the last word.

    My third point is I am more concerned that they collect policy-based evidence to check the outcomes of the policy.That’s sure to find stuff that should have been in the model (plus stuff they could leave out).

    4 weeks ago