October 25, 2016

Oversampling

From the election on the other side of the Pacific.

Wikileaks also shows how John Podesta rigged the polls by oversampling democrats, a voter suppression technique.

Now, as Josh Marshall at Talking Points Memo goes on to point out, the email in question is not to or from John Podesta, is eight years old, and refers to the Democrats internal polls not to public polls. So it’s kind of uninteresting. Except to me.  I’m a professional sampling nerd. I do research on oversampling; I publish papers on it; I write software about it: ways to do it and ways to correct for it. And just like a sailing nerd who has heard Bermuda rigging described as a threat to democracy, I’m going to explain more than you ever needed to know about oversampling.

The most basic form of oversampling in medical research has been widely used for over sixty years. If you want to study whether, say, smoking causes lung cancer, it’s very inefficient to take a representative sample of the population because most people, fortunately, don’t have lung cancer. You need to sample maybe 1000 people to get two people with lung cancer. If you have access to hospital records you could find maybe 200 people with lung cancer and 800 healthy control people.   Your case-control sample would have about the same cost as a representative sample of 1000 people, but nearly 100 times more information.  And there are more complex versions of the same idea.

Your case-control sample isn’t representative, but you can still learn things from it.  At a simple level, if the lung cancer cases are more likely to smoke than the controls in the sample, that will also be true in the population. The relationship won’t be the same as in the population, but it will be in the same direction.  For more detailed analysis we can undo the oversampling. Suppose we want to estimate the proportion of smokers in the population. The proportion of smokers in the sample is going to be too high, because we’ve oversampled lung-cancer patients, who are more likely to smoke. To be precise, we’ve got one hundred times too many lung cancer patients in the sample. We can fix that by giving each of them one hundred times less weight in estimating the population total. If 180 of the 200 lung cancer patients smoked, and 100 of the 800 controls did, you’d have a weighted numerator of 180×(1/100)+100×1, and a weighted denominator of 200×(1/100)+800×1, for an unbiased estimate of  12.7%, compared to the unweighted, biased (180+200)/1000 = 38%.

In polling, your question might be what issues are important to swing voters. You’d try to oversample swing voters to ask them, and not waste time and money annoying people whose minds were made up.  Obviously that would make your sample un-representative of the whole population. That’s the point; you want to talk to swing voters, not to a representative sample.  Or you might want to compare the thinking of (generally pro-trump) evangelical Christians and (often anti-Trump) Mormons. Again, if you oversampled conservative religious groups you’d end up with an unrepresentative sample; again, that would be the point. Oversampling isn’t the best strategy when your primary purpose is finding out what a representative sample thinks; it often is the best strategy when you want to know more about some smaller group of people.

However, if you also wanted an estimate of the overall popular vote you could easily undo the oversampling and downweight the swing voters in your sample to get an unbiased estimate as we did with the smoking rates.  You have to do that anyway;  even if you try to get a representative sample it probably won’t work because some groups of people are less likely to answer their phones and agree to talk to you.  The weighting you use to fix up accidental over- and under- sampling is exactly the same as the weighting you use when it’s deliberate.

 

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    Richard Penny

    As you point out, it’s not the oversampling that’s the problem, it’s what one does with the data collected. And to me another example of a innocuous technical word that appears to many to have sinister overtones.

    In NZ one often wants to compare Maori and non-Maori populations. This often results in Maori being oversampled compared the non-Maori as they are less than 20% of the population.

    Oversampling Maori is good for comparison, and if you need NZ estimates you work with the weights to account for the relative differences in the probability of members of the group being selected.

    I leave it to the reader to work out how one oversamples the Maori.

    7 years ago