July 23, 2014

Human statisticians not obsolete

There’s a website, OnlyBoth.com, that, as it says

Discovers New Insights from Data.
Writes Them Up in Perfect English.
All Automated.

You can test this by asking it for ‘insights’ in some example areas. One area is baseball, so naturally I selected the Seattle Mariners, and 2009, when I still lived in Seattle. OnlyBoth returns several names where it found insights, and I chose ‘Matt Tuiasosopo’ — the most obvious thing about him is that he comes from a famous local football family, but I was interested in what new insight the data revealed.

Matt Tuiasosopo in 2009 was the 2nd-youngest (23 yrs) of the 25 hitters who were born in Washington and played for the Seattle Mariners.

outdone by Matt Tuiasosopo in 2008 (22 yrs).

I don’t think our students need to be too worried yet.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    David Lockhart

    Here is one of my favorites, which I discovered a few months ago:

    http://www.wolframalpha.com/input/?i=how+many+calories+is+a+marathon%3F

    It appears that they’ve since added a field called “input interpretation”, which is actually helpful in addressing this particular problem.

    10 years ago

  • avatar

    They’re not the only ones trying to automate it – this one looks interesting, but there’s no public demo, you have to request it:

    http://www.businessinsider.com.au/narrative-science-2014-7

    10 years ago

  • avatar
    Raul Valdes-Perez

    Well, Matt Tuiasosopo in 2009 only had 22 at-bats, so there’s not a whole lot to say about what makes him special, beyond the quote, or the second item about him:

    Matt Tuiasosopo of the 2009 Seattle Mariners was the heaviest (225 lbs) of the 21 second basemen who played in Safeco Field.

    Would you prefer something snide like “Maybe he should have played football”?

    10 years ago

    • avatar
      Thomas Lumley

      No, I’d prefer it not to claim it had found insights about him.

      10 years ago

      • avatar
        Raul Valdes-Perez

        What is an example of an insight into a baseball player/season that you would want to see?

        10 years ago

        • avatar
          Thomas Lumley

          I don’t know. But I wouldn’t use it as an demonstration of my statistical abilities unless I had a better idea.

          I’d also point out that the ‘insight’ is not, actually, grammatical English. The first sentence is, but the add-on isn’t.

          Most importantly, though, that particular insight reveals the thinness of the data model. The system that produced that text pretty clearly does not understand that “Matt Tuiasosopo in 2009” and “Matt Tuiasosopo in 2008” are the same individual. This sort of thing — obvious to a person, but not to a computer — is one of the difficulties in getting actual insights automatically.

          10 years ago

  • avatar
    Raul Valdes-Perez

    It’s rather common not to require full grammaticality for elements such as side comments, headings, etc. For evidence, see the title of your blog posting, which lacks a verb.

    The data model does acknowledge that Matt T in 2008 is the same person as Matt T in 2009. This is seen in the language of other outputs that only make use of static attributes like height, weight, birthplace, throwing arm, etc. Age is not static though.

    In your cited example, it might be better to say “himself” rather than repeat his name, which is a design choice. The objects of analysis are player/seasons (actually player/season/teams, since a player can be traded and thus play for multiple teams during a single season).

    But I get it … you don’t like the technology!

    10 years ago