It’s Time for Science to Abandon the Term ‘Statistically Significant’
The aim of science is to establish facts, as accurately as possible. It is therefore crucially important to determine whether an observed phenomenon is real, or whether it’s the result of pure chance. If you declare that you’ve discovered something when in fact it’s just random, that’s called a false discovery or a false positive. And false positives are alarmingly common in some areas of medical science. In 2005, the epidemiologist John Ioannidis at Stanford caused a storm when he wrote the paper ‘Why Most Published Research Findings Are False’, focusing on results in certain areas of biomedicine. He’s been vindicated by subsequent investigations. For example, a recent article found that repeating 100 different results in experimental psychology confirmed the original conclusions in only 38 per cent of cases. It’s probably at least as bad for brain-imaging studies and cognitive neuroscience. How can this happen?
The problem of how to distinguish a genuine observation from random chance is a very old one. It’s been debated for centuries by philosophers and, more fruitfully, by statisticians. It turns on the distinction between induction and deduction. Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.
What matters to a scientific observer is how often you’ll be wrong if you claim that an effect is real, rather than being merely random. That’s a question of induction, so it’s hard. In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning. In the 1920s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.
Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there is no real effect, but rather a calculation of what would be expected if there were no real effect. The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.
The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.