One chance in 3.5 million

Yes, more on statistics.

In a recent NY Times article, science reporter Dennis Overbye discusses recent talks from Fermilab and CERN scientists which may hint at the discovery of the much-anticipated Higgs Boson. The executive summary is: it hasn’t been found yet.

But in the course of the article, Overbye points out that

To qualify as a discovery, some bump in the data has to have less than one chance in 3.5 million of being an unlucky fluctuation in the background noise.

That particular number is the so-called “five sigma” level from the Gaussian distribution. Normally, I would use this as an opportunity to discuss exactly what probability means in this context — is it a Bayesian “degree of belief” or a frequentist “p-value”, but for this discussion that distinction doesn’t matter: the important point is that one in 3.5 million is a very small chance indeed. [For the aficionados, the number is the probability that x > μ + 5 σ when x is described by a Gaussian distribution of mean μ and variance σ2.]

Why are we physicists so conservative? Are we just being very careful not to get it wrong, especially when making such a potentially important — Nobel-worthy! — discovery? Even for less ground-breaking results, the limit is often taken to be three sigma, which is about one chance in 750. This is a lot less conservative, but still pretty improbable: I’d happily bet a reasonable amount on a sporting event if I really thought I had 749 chances out of 750 of winning. However, there’s a maxim among scientists: half of all three sigma results are wrong. This may be an exaggeration, but certainly nobody believes “one in 750” is a good description of the probability (nor one in 3.5 million for five sigma results). How could this be? Fifty percent — one in two — is several hundred times more likely than 1/750.

There are several explanations, and any or all of them may be true for a particular result. First, people often underestimate their errors. More specifically, scientists often only include errors for which they can construct a distribution function — so-called statistical or random errors. But the systematic errors, which are, roughly speaking, every other way that the experimental results could be wrong, are usually not accounted for, and of course any “unknown systematics” are ignored by definition, and usually not discovered until well after the fact.

The controversy surrounding the purported measurements of the variation of the fine-structure constant that I discussed last week lies almost entirely in the different groups’ ability to incorporate a good model for the systematic errors in their very precise spectral measurements.

And then of course there are the even-less quantifiable biases that alter what results get published and how we interpret them. Chief among these may be publication or reporting bias: scientists and journals are more likely to publish, or even discuss, exciting new results than supposedly boring confirmations of the old paradigm. If there were a few hundred unpublished three-sigma unexciting confirmations for every published groundbreaking result, we would expect many of those to be statistical flukes. Some of these may be related to the so-called “decline effect” that Jonah Lehrer wrote about in the New Yorker recently: new results seem to get less statistically significant over time as more measurements are made. Finally, as my recent interlocutor, Andrew Gelman, points outclassical statistical methods that work reasonably well when studying moderate or large effects… fall apart in the presence of small effects.”

(In fact, Overbye discussed the large number of “false detections” in astronomy and physics in another Times article almost exactly a year ago.)

Unfortunately all of this can make it very difficult to interpret — and trust — statistical statements in the scientific literature, although we in the supposedly hard sciences have it a little easier as we can often at least enumerate the possible problems even if we can’t always come up with a good statistical model to describe our ignorance in detail.