Two principles to avoid common data mistakes

Email This Page


If David Brooks is correct, the “rising philosophy of the day” is “data-ism.” But you don’t have to believe David Brooks. Just look at the big data (e.g. Google Trends) on “big data.”

For the political junkies, data became sexy in 2012. First, the New York Times’ Nate Silver’s meta-analyses of polling data triumphed over the pundits’ “gut feelings.” Second, the Obama campaign successfully used data analytics to increase voter turnout. This caused people to pay attention (witness, for example, David Brooks’ new devotion to the subject as prime column-fodder).

Of course, for those of us in the transparency and accountability advocacy community, data has long been a prized commodity. And as governments around the world increasingly commit to open data promises, more and more data is becoming available.

At its best, data allows us to transcend our personal anecdotal experiences, giving us the big picture. It allows us to detect relationships and patterns that we wouldn’t otherwise see. Using data smartly can help us to make better decisions about both our own lives and our society.

But it’s important to understand that data and data analysis are merely tools. They can be used well, or they can be used poorly. It is remarkably easy both to mislead and to be misled by data. Hence the old adage: “There are three kinds of lies: lies, damned lies, and statistics.”

For many people, data can quickly overwhelm and confuse. It’s easy to misinterpret data, or to use it irresponsibly. We as humans are not particularly good at intuitively grasping large numbers, and our educational system generally does a poor job of helping us to counter this problem.

For that reason, I want to offer two basic principles that I think could prevent a majority of the data mistakes that I observe:

  1. Cherry-picking works better with fruit than data
  2. Correlation provokes questions better than it answers them

Let’s go at these one at a time.

Cherry-picking works better with fruit than data

It’s actually really easy to prove your point if you limit the cases to just those that prove your point. Problem is, it’s not really proving your point. It’s just selecting the cases that prove your point. Data scientists call this selection bias.

In this post, I’ll cover two common problems in selection bias: 1) Non-representativeness; and 2) Selecting on your outcome variable. Non-representativeness is the broader problem. Selecting on your outcome variable is a more specific type of non-representatives. So let’s start with the general problem of non-representativeness.

To discuss representativeness, I’m going to use an extended example that will be familiar to many people: polling in the 2012 U.S. presidential election.

Say we wanted to know how likely Barack Obama was to defeat Mitt Romney before the election. We could either ask a bunch of pundits what they thought, or we could take a nationally representative survey of likely voters. I’ll take the nationally representative sample any time.

A typical poll will sample about 1,000 adults. These 1,000 adults are supposed to stand in for an entire country of voters, and the law of large numbers makes it a pretty good bet that if the sample is representative, 1,000 observations is good enough for the whole country. But being representative is the key. And pollsters try very hard to make sure that their samples are representative – that is, that the sample looks like the country at large on the key variables that might be relevant, such as age, gender, ideology, income, location, ethnicity, etc. Still, different polling agencies have had different ideas about what a representative sample should look like, which sometimes leads to different results.

The now-famous Nate Silver did the pollsters one better. He aggregated all the polling data into one super-poll, getting the biggest sample possible, and thus taking even more advantage of the law of large numbers. He also looked at how well different polling agencies had performed in the past, and gave extra points for those whose predictions more closely matched election-day results, while devaluing those polling agencies that were consistently off. The assumption here was that the polling agencies that did better probably used more representative samples.

It’s key here to understand that the default assumption of most statistics is that things are basically random, like a flip of a coin. It’s only when the coin shows heads 19 out of 20 times that modern statistical analysis will allow you to say that this doesn’t look like a random coin flip anymore: Something else is probably going on. And the more you flip the coin and it turns up heads, the more certain you can be that something other than randomness is at work.

That’s why it’s good to have many observations: the more you can observe something happening over and over again, the more likely it is that you are observing something that is really happening, and not just based on chance. It is much more likely to get 10 heads in a row in a coin toss than it is to get 1,000 consecutive heads in a row. That’s why it’s better to poll 1,000 people than 10 people, and even better to combine 10 polls to get 10,000 people.

This goes for more than just polling. If you observe anything happen in just a few cases, you have no idea whether it was just a random occurrence. But the more you can document it happening, the surer you can be it’s not just a random occurrence.

Recall that Nate Silver’s outputs were all in terms of probability. In the final days, liberals were enthused as the chance of Obama’s victory rose to 90.9% on Election Day. How did Silver calculate this?

Silver knew that each poll taken was not perfect. Most polls reported a range of error. Look closely, for example, at the fine print in the final Gallup 2012 national tracking poll, showing Romney up 49% to 48%: “For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of error is ±2 percentage points.” What Gallup is admitting is that even with 3,117 adults surveyed in this poll, the results might be off a little. Probably (95% chance) they are off by less than 2 percentage points. But there’s also a 5% chance they’re off by more than this.

You can think of it this way: If Gallup ran this poll 100 times, they would have returned a range of results. Most common would be Romney up 49-48, but you’d also get a fair number of Obama 49-48 scores, and occasionally an even wider split (maybe a Romney 52-46 here and there, or an Obama 51-47).

What Silver did was to pay attention to these reported error ranges and then run a bunch of simulations to generate the likelihood of these different possible outcomes. What he asked was this: Given all the polls in all the states and their range of potential outcomes, what was the likelihood of Obama winning enough states to win the Electoral College? And even though most of the polls in the key swing states showed Obama ahead, on Election Day there was still a 9.1% chance that Romney would win based on polling data.

Tags: , ,

Related posts