February 19, 2018

Ihaka Lecture Series – live and live-streamed in March

The theme of this year’s Ihaka Lecture Series is “A thousand words: Visualising statistical data”. The distillation of data into an honest and compelling graphic is essential component of modern (data) science, and this year, we have three experts exploring different facets of data visualisation.

Each event begins at 6pm in the Large Chemistry Lecture Theatre, Building 301, 23 Symonds Street, Central Auckland, with drinks, nibbles and chat – just turn up – and the talks get underway at 6.30pm. Each one will be live-streamed – details will be on the info pages, the links to which are given below.

On March 7, Professor Dianne Cook from Monash University (right) looks at simple tools for helping to decide if the patterns you think you see in the data are really there. Details. Statschat interviewed Di last year about the woman behind the data work, and it was a very popular read. It’s here. Di’s website is here.

On March 14, Associate Professor Paul Murrell from the Department of Statistics, The University of Auckland (left) will embark on a daring statistical graphics journey featuring the BrailleR package for visually-impaired users, high-performance computing, te reo, and XKCD. Details. Paul was a student when R was being developed by Ross Ihaka and Robert Gentleman, and has been part of the R Core Development team since 1999.

On March 21, Alberto Cairo, the Knight Chair in Visual Journalism at the University of Miami (below right) teaches principles so we all become more critical and better informed readers of charts. This lecture is non-technical – if you have any journalist friends, let them know. Details. His website is here.

The series is named after Ross Ihaka, Associate Professor in the Department of Statistics at the  University of Auckland. Ross, along with Robert Gentleman, co-created R – a statistical programming language now used by the majority of the world’s practicing statisticians. It is hard to over-emphasise the importance of Ross’s contribution to our field, so we named this lecture series in his honour to recognise his work and contributions to our field in perpetuity.



February 17, 2018

Read me first?

There’s a viral story that viral stories are shared by people who don’t actually read them. I saw it again today in a tweet from Newseum Insititute

If you search for the study it doesn’t take long to start suspecting that the majority of news sources sharing this study didn’t read it first.  One that at least links is from the Independent, in June 2016.

The research paper is here. The money quote looks like this, from section 3.3

First, 59% of the shared URLs are never clicked or, as we call them, silent.

We can expand this quotation slightly

First, 59% of the shared URLs are never clicked or, as we call them, silent. Note that we merged URLs pointing to the same article, so out of 10 articles mentioned on Twitter, 6 typically on niche topics are never clicked

That’s starting to sound a bit different. And more complicated.

What the researchers did was to look at bit.ly URLs to news stories from five major sources, and see if they had ever been clicked. They divided the links into two groups: primary URLs tweeted by the media source itself (eg @NYTimes), and secondary URLs tweeted by anyone else. The primary URLs were always clicked at least once — you’d expect that just for checking purposes.  The secondary URLs, as you’d expect, averaged fewer clicks per tweet; 59% were not clicked at all.

That’s being interpreted as if it were 59% of retweets didn’t involve any clicks. But it isn’t. It’s quite likely that most of these links were never retweeted.  And there’s nothing in the data about whether the person who first tweeted the link read the story: there certainly isn’t any suggestion that person didn’t read the story.

So, if I read some annoying story about near-Earth asteroids on the Herald and if tweeted a bit.ly URL, there’s a chance no-one would click on it. And, looking at my Twitter analytics, I can see that does sometimes happen. When it happens, people usually don’t retweet the link either, and it definitely doesn’t go viral.

If I retweeted the official @NZHerald link about the story, then it would almost certainly have been clicked by someone. The research would say nothing whatsoever about the chance that I (or any of the other retweeters) had read it.


February 16, 2018

Best places to retire?

There’s a fun visualisation in the Herald of best places in NZ to retire. Chris Knox’s design lets you adjust the relative importance of a set of factors, and also see which factors are responsible for a good or bad ranking for your favorite region. For nerds, he’s even put up the code and data.

If you play around with the sliders enough, you can get Dunedin or Christchurch to the top, but you can’t get Auckland or Wellington there. Since about 30% of people over 65 actually do live in those two cities, there’s presumably some important decision factors that are left out and that would make cities look better if they were put in.

There’s at least two sorts of factors. First, that many people live in cities. You might well want to retire somewhere close to your friends and whānau.  Second, that you want the amenities of a city: public transport, taxis, libraries, cinemas, museums, stadiums, fair-quality cheap restaurants.

The interactive is just for fun, but similar principles apply to serious decision-making tools.  The ‘best’ decision depends a lot on your personal criteria for ‘best’, and oversimplifying these criteria will give you something that looks like an objective, data-based policy choice, but really isn’t.

February 14, 2018

Most inaccurate media number ever?

In 2012, the Telegraph and other UK papers were off by five orders of magnitude when they said there were only 100 adult cod in the North Sea. The Washington Post beats that easily.

Quantum computers are straight out of science fiction. Take the “traveling salesman problem,” where a salesperson has to visit a specific set of cities, each only once, and return to the first city by the most efficient route possible. As the number of cities increases, the problem becomes exponentially complex. It would take a laptop computer 1,000 years to compute the most efficient route between 22 cities, for example. A quantum computer could do this within minutes, possibly seconds.

As mathematician Bill Cook pointed out on Twitter, his iMac can solve a 22-city problem in 0.005 seconds, roughly six trillion times faster than the Washington Post claims.  It looks as though the writer has assumed there is no more efficient algorithm than just trying all the possible routes, one at a time.  At one billion attempts per second, that would take on the order of a thousand years.  In fact, there are more efficient algorithms, it’s just that these algorithms also get very slow as the number of points increases. Prof Cook estimates in the twitter thread that 1000 CPU-years might allow a 100,000-point problem to be solved: six trillion times more computing gets you only a 5000-fold increase in the number of points.

For the cryptographic problems that are the actual point of the story, fast quantum algorithms are known. If large enough quantum computers can be built, these security protections are toast. And as the story says, there’s research going on now to find replacements if/when they’re needed (though what the story says about those algorithms is almost completely unhelpful).  But the travelling salesman problem is not one of the ones for which fast quantum algorithms are known. On the contrary, there are reasons to suspect fast quantum algorithms don’t even exist for “NP-complete” problems such as the travelling salesman.


Update: Scott Aaronson, who knows from quantum, is Not Impressed.

February 13, 2018

Super 15 Predictions for Round 1

Team Ratings for Round 1

The basic method is described on my Department home page.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Hurricanes 16.18 16.18 0.00
Crusaders 15.23 15.23 0.00
Lions 13.81 13.81 0.00
Highlanders 10.29 10.29 -0.00
Chiefs 9.29 9.29 0.00
Brumbies 1.75 1.75 0.00
Stormers 1.48 1.48 -0.00
Sharks 1.02 1.02 0.00
Blues -0.24 -0.24 -0.00
Waratahs -3.92 -3.92 -0.00
Jaguares -4.64 -4.64 0.00
Bulls -4.79 -4.79 0.00
Reds -9.47 -9.47 0.00
Rebels -14.96 -14.96 0.00
Sunwolves -18.42 -18.42 0.00


Predictions for Round 1

Here are the predictions for Round 1. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Stormers vs. Jaguares Feb 17 Stormers 10.10
2 Lions vs. Sharks Feb 17 Lions 16.30


Opinions about immigrants

Ipsos MORI do a nice set of surveys about public misperceptions: ask a sample of people for their estimate of a number and compare it to the actual value.

The newest set includes a question about the proportion of the prison population than are immigrants. Here’s (a redrawing of) their graph, with NZ in all black.

People think more than a quarter of NZ prisoners are immigrants; it’s actually less than 2%. I actually prefer this as a ratio

The ratio would be better on a logarithmic scale, but I don’t feel like doing that today since it doesn’t affect the main point of this pointpost.

A couple of years ago, though, the question was about what proportion of the overall population were immigrants. That time people also overestimated a lot.  We can ask how much of the overestimation for the prison question can be explained by people just thinking there are more immigrants than there really are.

Here’s the ratio of the estimated proportion of immigrants among the prison population and the total population

The bar for New Zealand is to the left; New Zealand recognises that immigrants are less likely to be in prison than people born here. Well, the surveys taken two years apart are consistent with us recognising that, at least.

That’s just a ratio of two estimates. We can also compare to the reality. If we divide this ratio by the true ratio we find out how much more likely people think an individual immigrant is to end up in prison compared to how likely they really are.

It seems strange that NZ is suddenly at the top. What’s going on?

New Zealand has a lot of immigrants, and we only overestimate the actual number by about a half (we said 37%; it was 25% in 2017). But we overestimate the proportion among prisoners by a lot. That is, we get this year’s survey question badly wrong, but without even the excuse of being seriously deluded about how many immigrants there are.

February 6, 2018


February 2, 2018

Map whining

I’ve seen this map several times on Twitter. It’s originally from a British company, ThirtyFifty. It gives a really nice idea of where the world’s wine regions are, and why, in terms of temperature.

There’s one problem, which is probably more obvious to New Zealanders than northern-hemisphere people. The latitude lines are wrong.

The 50N line and the equator are right, but the 30N line is a little too far north; the 30S line is a little too far north; and the 50S line is way too far north — it’s at about 45S, measured by the southern tip of Tasmania and the middle of NZ’s South Island.

Diagnostic accuracy: twitter followers

The New York Times and Stuff both have recent stories about fake Twitter followers. There’s an important difference. The Times focuses on a particular company that they claim sells fake followers; Stuff talks about two apps that claim to be able to detect fakes by looking at their Twitter accounts.

The difference matters. If you bought fake followers from a company such as the one the Times describes, then you (or a ‘rogue employee’) knew about it with pretty much 100% accuracy.  If you’re relying on algorithmic identification, you’d need some idea of the accuracy for it to be any use — and an algorithm that performs fairly well on average for celebrity accounts could still be wrong quite often for ordinary accounts. If you know that 80% of accounts with a given set of properties are fake, and someone has 100,000 followers with those properties, it might well be reasonable to conclude they have 80,000 fake followers.  It’s a lot less safe to conclude that a particular follower, Eve Rybody, say, is a fake.

Stuff says

Twitter Audit analyses the number of tweets, date of the last tweet, and ratio of followers to friends to determine whether a user is real or “fake”.

SocialBakers’ Maie Crumpton says it’s possible for celebrities to have 50 per cent “fake” or empty follower accounts through no fault of their own. SocialBakers’ labels an account fake or empty if it follows fewer than 50 accounts and has no followers.

Twitter Audit thinks I’ve got 50 fake followers. It won’t tell me who they are unless I pay, but I think it’s probably wrong. I have quite a few followers who are inactive or who are read-only tweeters, and some that aren’t real people but are real organisations.

Twitter users can’t guard against followers being bought for them by someone else but Brislen and Rundle agree it is up to tweeters to protect their reputation by actively managing their account and blocking fakes.

I don’t think I’d agree even if you could reliably detect individual fake accounts; I certainly don’t agree if you can’t.

February 1, 2018

Another test for Alzheimers

As StatsChat readers will know, there are lots of candidate tests for Alzheimer’s Disease, all of which are much better than just flipping a coin. These tests may be useful in selecting people for clinical trials, and if we ever get effective disease-modifying treatments, for deciding who to treat. At the moment, though, these tests aren’t much use.

BBC News has another one

Scientists in Japan and Australia have developed a blood test that can detect the build-up of toxic proteins linked to Alzheimer’s disease.

The work, published in the journal Nature, is an important step towards a blood test for dementia.

The test was 90% accurate when trialled on healthy people, those with memory loss and Alzheimer’s patients.

If you’re not paying careful attention, 90% accuracy sounds pretty good. But that’s the accuracy in a group of people where about half of them have Alzheimer’s.  In the population, where most people are ok, the false positive rate will still be scarily high.

Also, in contrast to some tests I’ve written about, the main focus of this paper is differentiating Alzheimer’s Disease from other sorts of dementia

The research paper says

The plasma composite biomarker showed 96.7% sensitivity, 81.0% specificity, and 90.2% accuracy in the overall data (n = 51) when predicting individual Aβ status (Aβ+ or Aβ) using the common cut-off value (0.376) (Extended Data Fig. 8e–g). The results suggest that the plasma biomarker could be helpful for the differential diagnosis of AD and aid in determining therapeutic strategies, by providing additional information on the brain Aβ deposition status of individuals.

That’s going to be useful when we get treatments, since the treatments for Alzheimer’s probably won’t work on unrelated kinds of dementia, but it doesn’t really fit with the framing in the story

Alzheimer’s disease starts years before patients have any symptoms of memory loss.

The key to treating the dementia will be getting in early before the permanent loss of brain cells.