Posts filed under Estimation (18)

May 16, 2013

Back to my favourite topic – beer

BeerVis Graph

Here is a site to show with a flourish when your friends tell you at the pub that studying Statistics is no use. LifeHacker reports that BeerViz attempts to use historical data collected by BeerAdvocate, and presumably a statistical model, to suggest new beers to you based on what you already like. If they’re not using a statistical model then there is a great challenge for you loyal readers!

April 8, 2013

Briefly

  • Interesting post on how extreme income inequality is. The distribution is compared to a specific probability model, a ‘power law’, with the distribution of earthquake sizes given as another example. Unfortunately, although the ‘long tail’ point is valid, the ‘power law’ explanation is more dubious.   Earthquake sizes and wealth are two of the large number of empirical examples studied by Aaron Clauset, Cosma Shalizi, and Mark Newman, who find the power law completely fails to fit the distribution of wealth, and is not all that persuasive for earthquake sizes. As Cosma writes

If you use sensible, heavy-tailed alternative distributions, like the log-normal or the Weibull (stretched exponential), you will find that it is often very, very hard to rule them out. In the two dozen data sets we looked at, all chosen because people had claimed they followed power laws, the log-normal’s fit was almost always competitive with the power law, usually insignificantly better and sometimes substantially better. (To repeat a joke: Gauss is not mocked.)

 

April 1, 2013

Briefly

Despite the date, this is not in any way an April Fools post

  • “Data is not killing creativity, it’s just changing how we tell stories”, from Techcrunch
  • Turning free-form text into journalism: Jacob Harris writes about an investigation into food recalls (nested HTML tables are not an open data format either)
  • Green labels look healthier than red labels, from the Washington Post. When I see this sort of research I imagine the marketing experts thinking “how cute, they figured that one out after only four years”
  • Frances Woolley debunks the recent stories about how Facebook likes reveal your sexual orientation (with comments from me).  It’s amazing how little you get from the quoted 88% accuracy, even if you pretend the input data are meaningful.  There are some measures of accuracy that you shouldn’t be allowed to use in press releases.
March 29, 2013

Unclear on the concept: average time to event

One of our current Stat of the Week nominations is a story on Stuff claiming that criminals sentenced to preventive detention are being freed after an average of ‘only’ 11 years.

There’s a widely-linked story in the Guardian claiming that the average time until Google kills new services is 1459 days, based on services that have been cancelled in the past.  The story even goes on to say that more recent services have been cancelled more quickly.

As far as I know, no-one has yet produced a headline saying that the average life expectancy  for people born in the 21st century is only about 5 years, but the error in reasoning would be the same.

In all three cases, we’re interested in the average time until some event happens, but our data are incomplete, because the event hasn’t happened for everyone.  Some Google services are still running; some preventive-detention cases are still in prison; some people born this century are still alive.  A little thought reveals that the events which have occurred are a biased sample: they are likely to be the earliest events.   The 21st century kids who will live to 90 are still alive; those who have already died are not representative.

In medical statistics, the proper handling of times to death, to recurrence, or to recovery is a routine problem.  It’s still not possible to learn as much as you’d like without assumptions that are often unreasonable. The most powerful assumption you can make is that the rate of events is constant over time, in which case the life expectancy is the total observed time divided by the total number of events — you need to count all the observed time, even for the events that haven’t happened yet.  That is, to estimate the survival time for Google services, you add up all the time that all the Google services have operated, and divide by the number that have been cancelled.  People in the cricket-playing world will recognise this as the computation used for batting averages: total number of runs scored, divided by total number of times out.

The simple estimator is often biased, since the risk of an event may increase or decrease with time.  A new Google service might be more at risk than an established one; a prisoner detained for many years might be less likely to be released than a more recent convict.  Even so, using it distinguishes people who have paid some attention to the survivors from those who haven’t.

I can’t be bothered chasing down the history of all the Google services, but if we add in search (from 1997),  Adwords (from 2000), image search (2001), news (2002),  Maps, Analytics, Scholar, Talk, and Transit (2005), and count Gmail only from when it became open to all in 2007, we increase the estimated life expectancy for a Google service from the 4 years quoted in the Guardian to about 6.5 years.  Adding in other still-live services can only increase this number.

For a serious question such as the distribution of time in preventive detention you would need to consider trends over time, and differences between criminals, and the simple constant-rate model would not be appropriate.  You’d need a bit more data, unless what you wanted was just a headline.

February 25, 2013

Economic data mining needs theory

Via the new Observational Epidemiology blog, it is possible to talk about stochastic complexity in reasonably plain English

 

 

January 21, 2013

Seasonal units of measurements

Stuff says (complete with cute photo)

The birth of a rare Nepalese red panda baby, weighing not much more than a tomato, has thrilled Auckland Zoo keepers.

Hmm.

pandasize

Especially given all the fuss last year about New Zealanders’ ignorance of vegetables, perhaps “weighing a bit less than an iPhone” would be more informative.

 

November 22, 2012

Fly away home

With the summer holiday season approaching we’ve had requests for a post on the relative safety of driving and flying.

To a large extent this depends on where you are going: if you’re heading from Auckland to the Coromandel then I’d recommend driving, but if you want to spend some time on a beach in the Cook Islands your chances of getting there safely by car are distressingly low.

Clearly we need to rephrase the question.  Two possibilities are:

  • for a destination where either flying or driving makes sense, which one is safer?
  • if you compare a typical holiday road-trip to a typical holiday flight, which is safer?

We should also think about what risks to include: for a long plane flight the chance of a pulmonary embolism is higher than a crash, possibly much higher depending on your other risk factors.

The risk of a `fatal incident’ on a flight is largely independent of the length of the flight, and based on US data is about eight deaths per hundred million flights.  The risk is probably lower in NZ, since the figure includes the September 11 terrorist attacks.

The risk of death from car crash when driving in the US is about 4 per billion kilometers.  I don’t have good figures for NZ, but it’s a bit higher here. On the other hand, there’s a lot of variation depending on how you drive.

So, for a trip of 500km (eg, Auckland-Wellington), we’re looking at an average figure of about eight deaths in crashes per hundred million flights and about 200 deaths in crashes per hundred million car trips. Flying wins by a huge margin

University of Otago research estimates the risk of pulmonary embolism at about 0.5 per million short flights and about 1.3 per million long flights.  Estimates of the risk of death with pulmonary embolism in modern times seem to be around 10-20%, giving death rates of about 50-100 5-10 per hundred million short flights or 120-250 12-25 per hundred million long flights.  Flying still wins for the Auckland-Wellington route, even if driving doesn’t increase pulmonary embolism risk at all (it probably increases it but by less than driving)

If you compare a 500km drive with a long-haul intercontinental flight the numbers get less clear.  Flying to London could possibly be more dangerous than driving to Wellington, especially if you are a safe driver but at relatively high risk of blood clots.

After all these calculations it’s important to keep a sense of perspective. Driving is pretty safe. Flying is even safer.

November 20, 2012

Avoiding midlife uncertainty

Stuff and the Herald have the identical AP story, so you can read either one

Chimpanzees in a midlife crisis? It sounds like a setup for a joke. But there it is, in the title of a report published in a scientific journal: ‘Evidence for a midlife crisis in great apes.

The researchers asked handlers to estimate ‘well-being’ for 508 great apes: 172 orang-utans, the rest, chimpanzees.  They fitted a statistical model to look for a decrease in mid-life followed by an increase, and got dramatic graphs

The x-axis is in years, showing the trough of despondency in the mid-thirties.  The y-axis isn’t in anything — the curves were rescaled to look similar and the numbers are arbitrary.

The reason the curves look so dramatic is partly the higher-than-wide shape of the graph, but mostly the lack of any indication of uncertainty. The data are actually consistent with a wide range of flatter or steeper U-shapes and with the `mid-life’ crisis happening anywhere over quite a range of years.  I can’t be more precise than that, because the researchers don’t even provide the necessary information to compute the uncertainty in the curve [they give uncertainties in regression coefficients, but not correlations between them].

However, they do have an appendix that looks at chopping up age into five-year bands and estimating the midlife crisis that way.  They don’t give a graph, but they do give enough information to draw one. It’s not as impressive.

The U-shaped pattern does seem to probably be real (though the extent to which the so-called mid-life crisis is really the apes’ problem rather than than the handlers’ problem isn’t clear), but the graphs in the research paper are overselling it. Badly.

[Update: the intervals in the plot are +/- 1.4 standard errors for the coefficient. This should be in the ballpark for a 95% interval for the mean for that age group]

November 12, 2012

The rise of the machines

The Herald has an interesting story on the improvements in prediction displayed last week: predicting the US election results , but more importantly, predicting the path of Hurricane Sandy.  They say

In just two weeks, computer models have displayed an impressive prediction prowess.

 The math experts came out on top thanks to better and more accessible data and rapidly increasing computer power.

It’s true that increasing computer power has been important in both these examples, but it’s been important in two quite different ways.  Weather prediction, use the most powerful computers that the metereologists can afford, and they are still nowhere near the point of diminishing returns.  There aren’t many problems like this.

Election forecasting, on the other hand, uses simple models that could even be run on hand calculators, if you were sufficiently obsessive and knowledgeable about computational statistics and numerical approximations.  The importance of increases in computer power is that anyone in the developed world has access to computing resources that make the actual calculations trivial.  Ubiquitous computing, rather than supercomputers, are what has revolutionised statistics.  If you combine the cheapest second-hand computer you can find with free software downloaded from the Internet, you have the sort of modelling resources that the top academic and industry research groups were just starting to get twenty years ago.

Cheap computing means that we can tackle problems that wouldn’t have been worthwhile before.  For example, in a post about the lottery, I wanted to calculate the probability that distributing 104 wins over 12 months would give 15 or more wins in one of the months.  I could probably work this out analytically, at least to a reasonable approximation, but it would be too slow and too boring to do for a blog post.  In less than a minute I could write code to estimate the probabilities by simulation, and run 10,000 samples.  If more accuracy was needed I could easily extend this to millions of samples.  That particular calculation doesn’t really matter to anyone, but the same principles apply to real research problems.

November 11, 2012

Psychic water bills

I’ve just received a water bill which, among other information, estimates my average daily water use for November.  That’s a pretty good trick for something that must have been mailed in the first few days of the month. They mean October, I assume.

Apart from their off-by-one labelling of months, Watercare are interesting because of their ‘estimated’ water usage.  They recently changed to sending monthly bills, but they still only try to read the meter ever second month, and in my case have failed to find it twice.  My first bill on moving in was for three months, and it was relatively high. I fixed the leaky seal in the toilet and expected the bills to go down.  The following month, the meter wasn’t read but my estimated daily water use went up about 7%.  The next month, again there was no meter reading, and the estimated daily use was another 10% higher. The next month the estimated daily use was down about 8%, again with no reading.

I can see why the estimated total usage would fluctuate based on the varying time between estimates, but it’s hard to see what basis Watercare had for estimating I was using more water in September (actually August) than in August (actually July) without any actual data.  I wouldn’t have expected the average Aucklander to use more water in winter, and a research report from Branz confirms my expectation.

This month, now they have found the meter, the estimated use has fallen about 85%, catching up on three months of overbilling.