P-Hacking is a big problem. It can lead to bad decisions, wasted effort, and misplaced confidence in how your business works.
P-Hacking sounds like something you do to pass a drug test. Actually, it’s something you do to pass a statistical test. “P” refers to the “P” value, the probability that an observed result is the result of random chance and not something real. “Hacking” in this case means manipulating, so P-Hacking is manipulating an experiment in order to make the P value look more significant than it really is, so that it looks like you discovered a real effect when in fact there may be nothing there.
It’s the equivalent of smoke and mirrors for statistics nerds. And it’s really really common. So common that some of the foundational research in the social sciences has turned out to not be true. It’s led to a “Replication Crisis” in some fields. forcing a fresh look at many important experiments.
And as scientific techniques like A/B testing have become more common in the business world, P-Hacking has followed. A recent analysis of thousands of A/B tests through a commercial platform found convincing evidence of P-Hacking in over half the tests where a little P-Hacking might make the difference between a result that’s declared “significant” and one that’s just noise.
The problem is that P-Hacking is subtle: it’s easy to do without realizing it, hard to detect, and extremely tempting when there’s an incentive to produce results.
One common form of P-Hacking, and the one observed the recent analysis, is stopping an A/B test early when it shows a positive result. This may seem innocuous, but in reality it distorts the P value and gives you a better chance of hitting your threshold for statistical significance.
Think of it this way: If you consider a P value of less than 0.05 to be “significant” (a common threshold), that means that there’s supposed to be a 5% chance that you would have gotten the same result by random chance if there was actually no difference between your A and B test cases. It’s the equivalent of rolling one of those 20-sided Dungeons and Dragons dice and declaring that “20” means you found something real.
But if you peek at the results of your A/B test early, that’s a little like giving yourself extra rolls of the dice. So Monday you roll 8 and keep the experiment running. Tuesday you roll 12 and keep running. Wednesday you roll 20 and declare that you found something significant and stop. Maybe if you had continued the experiment you would have kept rolling 20 on Thursday and Friday, but maybe not. You don’t know because you stopped the experiment early.
The point is that by taking an early look at the results and deciding to end the test as soon as the results crossed your significance threshold, you’re getting to roll the dice a few more times and increase the odds of showing a “significant” result when in fact there was no effect.
If there is a real effect, we expect the P value to keep dropping (showing more and more significance) as we collect more data. But the P value can bounce around, and even when the experiment is run perfectly with no P-Hacking there’s still a one-in-20 chance that you’ll see a “significant” result that’s completely bogus. If you’re P-Hacking, the odds of a bogus result can increase a lot.
What makes this so insidious is that we are all wired to want to find something. Null results–finding the things that don’t have any effect–are boring. Positive results are much more interesting. We all want to go to our boss or client and talk about what we discovered, not what we didn’t discover.
How can you avoid P-Hacking? It’s hard. You need to be very aware of what your statistical tests mean and how they relate to the way you designed your study. Here’s some tips:
- Be aware that every decision you make while an A/B test is underway could be another roll of the dice. Don’t change anything about your study design once data collection has started.
- Every relationship you analyze is also another roll of the dice. If you look at 20 different metrics that are just random noise, you actually expect that one of them will show a statistically significant trend with p < 0.05.
- When in doubt, collect more data. When there’s a real effect or trend, the statistical significance should improve as you collect more data. Bogus effects tend to go away.
- Don’t think of statistical significance as some hard threshold. In reality, this is just a tool for estimating whether or not the results of an analysis are real or bogus, and there’s nothing magical about crossing p < 0.05, p <0.01, or any other threshold.
Another useful tip is to change the way you think and speak about statistical significance. When I discuss data with clients, I prefer to avoid the phrase “statistically significant” entirely: I’ll use descriptive phrases like, “there’s probably something real” when the P value is close to the significance threshold, and “there’s almost certainly a real effect” when the P value is well below the significance threshold.
I find this gives my clients a much better understanding of what the data really means. All statistics are inherently fuzzy, and anointing some results as “statistically significant” tends to give a false impression of Scientific Truth.