October 20, 2016

Brute force and ignorance

At a conference earlier this week, a research team from Microsoft described a computer system for speech transcription. For the first time ever, this system did better than humans on a standard set of recordings.

What’s more impressive — and StatsChat relevant — is that this computer system does not understand anything about the conversations it writes down. The system does not know English, or any other human language, even in the sense that Siri does.

It has some preconceived notions about what tends to follow a particular word, pair of words, or triple of words, and about what sequences of sounds tend to follow each other, but nothing about nouns or verbs or how colorless green ideas sleep. As with modern image recognition, the system is just based on heaps and heaps of data and powerful computers.  It’s computing and statistics, not linguistics.

In a comment to a post at Language Log, the linguist Geoffrey Pullum says

I must confess that I never thought I would see this day. In the 1980s, I judged fully automated recognition of connected speech (listening to connected conversational speech and writing down accurately what was said) to be too difficult for machines, far more difficult than syntactic and semantic processing (taking an error-free written sentence as input, recognizing which sentence it was, analysing it into its structural parts, and using them to figure out its literal meaning). I thought the former would never be accomplished without reliance on the latter.

There are many problems where enough data is not available to construct a model with no understanding of the problem. There won’t be a shortage of work for human statisticians or linguists any time soon. But there are problems where brute force and ignorance works, and they aren’t always the ones we expect.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »