May 13, 2016

Aggregation, not ok?

You’ve probably heard of OkCupid, a dating site. People give sites like that a lot of personal information. And, in a sense, the information is obviously not going to be kept secret — after all, the point of using a dating site is to be found by people you don’t already know.  When someone writes a script to collect the data from large numbers of users, and then publishes it in a convenient and easy to process format, you can just about see how they’d think that was ok. It’s harder to see how they’d be surprised not everyone feels that way.

Aggregation makes a difference because we can search, match, and analyse the data by computer. That’s important for two reasons.

First, it’s quicker and easier — you can get a set of records grouped by sexual preference or other interests almost as quickly as you can think of the question, and you can link usernames or other information to other datasets. The database includes potential matching variables like income, education level, age, job, country, city, which you could still use just taking down data one person at a time by hand, but it would be slow and boring.

Second, the database is impersonal. If you stood outside a gay bar watching who went in and out, you couldn’t really pretend you were innocently using publicly visible information.  If you signed up and went through dating profiles one at a time, it would be easier to pretend, but you’d still tend to see the people behind the data. When it’s a big spreadsheet, it’s easier to ignore how the people would feel about it.

Sometimes people aggregate and publish data knowing it may do harm, because they think there’s a higher interest involved in getting the data out — even if the data release is obviously illegal. This release isn’t obviously illegal (though there are possibilities), but the higher interest is pretty obscure too. The accompanying research paper says

As an example of the analyses one can do with the dataset, a cognitive ability test is constructed from 14 suitable items. To validate the dataset and the test, the relationship of cognitive ability to religious beliefs and political interest/participation is examined.

Those variables are so not what’s going to attract people to these data. But even if you think it’s important for anyone on the internet to be able to do that sort of correlation for variables such as sexual orientation and drug use, it’s hard to think of a reason to include the OkCupid username.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments