April 2, 2012

Big data and Downton Abbey

The hit British TV series Downton Abbey has drawn some fire for alleged anachronisms: phrases that just don’t fit Georgian-era Britain.

Ben Schmidt has unleashed gigabytes of data on this problem, with the Google Books n-grams.  When Google digitized lots of books, it also tabulated the frequencies of words, pairs of words, triples of words, and so on, by year of publication. In two posts, Ben compares word pairs from the TV script with the Google frequencies for books published in the 1910s and the 1990s.   The comparison shows up several two-word phrases that were much less common in Downton Abbey’s historical period than they are now, but still appear in the script.  In some cases these phrases were not observed at all in written English until much later; in other cases they existed but were rare.

As a check on the process, he also looks at a genuine play from the period, George Bernard Shaw’s Heartbreak House, which passes the phrase test with flying colors.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments