Twitter is not a random sample
If you’ve ever viewed Twitter as a gauge of public opinion, a weathervane marking the mood of the masses, you are very much mistaken.
That is the rather surprising finding of a new US study, which suggests the microblog zeitgeist differs markedly from mainstream public opinion.
Apart from being completely unsurprising, this is a useful thing to have data on. The Pew Charitable Trusts, who do a lot of surveys, compared actual opinion polls to tweet summaries for some major political and social issues in the US, and found they didn’t agree.
Along the same lines, it was reported last month that Google’s Flu Trends overestimated the number of flu cases this year (after having initially underestimated the H1N1 pandemic), probably because the high level of publicity for the flu vaccine this year made people more aware.
These data summaries can be very useful, because they are much less expensive and give much more detail in space and time than traditional data collection, but they are also sensitive to changes in online behaviour. Getting anything accurate out of them requires calibration to ‘ground truth’, as a previous generation of Big Data systems called it.
Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »