What We Talk About When We Talk About Probability

In his most recent post, Cosmic Variance’s Mark Trodden talks about one of the presentations we both saw at last week’s meeting in Ishcia, where he explains one of the hot new techniques for analyzing cosmological data, the (so-called) Bayesian Evidence.

Let’s unpack this term. First, “Bayesian”, named after the Reverend Thomas Bayes. The question is: what is probability? If you’ve never taken a statistics class, then you (probably!) think that probability is defined so that the more probable something is, the more certain it is to happen. So we can define probability such that if P=1, it is certain to happen, and if P=0 it is certain never to happen. It turns out that you can show mathematically that there’s only one self-consistent way to define “degrees of belief” between 0 and 1: if P=1/2, then, given some information about an event, it is just as likely to happen as not happen. In the Bayesian interpretation of statistics, the crucial part of the last sentence is “given some information about an event”. In technical terms, all probability is conditional. For example, think about the statistician’s favorite example, tossing a coin. Usually, we say that a fair coin has P=1/2 to be heads, and P=1/2 to be tails. But that’s only because we usually don’t know enough about the way the coin is tossed (whether it starts with heads or tails up, the exact details of the way it is tossed, etc.) to make a better prediction: probability is not a statement about the physics of coin-tossing so much as it is about the information we have about coin-tossing in a particular circumstance.

But if you suffered through a “probability and statistics” class, you learned to equate probability not with any sort of information or belief, but with frequency: the fraction of times that some event (for example, heads), would happen in some (imaginary!) infinitely long set of somehow similar experiments. When we’re trying to measure something, we report something like the age of the Universe is 13.6±1 Gyrs (a Gyr is a Billion years). With this “frequentist” interpretation of probability, this means something like the following: I make up some algorithm, which transforms whatever data I have into a single measurement of the age (very often, this is just an average of individual measurements), which gives me the 13.6 Gyr. I get the error bars by calculating the distribution of results from that imaginary infinite set of experiments where I start with a true value of 13.6 Gyr: 67% of them will give a result between 12.6 and 14.6 Gyr.

(If this sounds confusing, don’t worry. There are strong arguments in the statistics community that these frequentist methods are actually philosophically or mathematically incoherent!.)

Actually, even this description isn’t quite right: really, you need to consider a distribution of other underlying true values. For the aficianados, the full construction of these error bars was originally done by Neyman, and was discussed more recently — and compared to the Bayesian case — by Feldman and Cousins.

The alternative, Bayesian, view of probability, is simply that when I state the age of the Universe is 13.6±1 Billion years, it means that I am 67% certain that the value is between 12.6 and 14.6 Billion years. In the early 20th Century, Bruno de Finetti showed that you can further refine exactly what “67% certain” means in terms of odds and wagering: if something is 50% certain, I would give even odds on a bet; if it’s 67% certain I would give 2:1 odds, etc. Probability is thus intimately tied to information: what I knew before I performed the experiment (more precisely, what I know in the absence of that experimental data) — this is called the prior probability and what new information the experiment gives — the likelihood.

(As an aside, I think that Mark has actually given a frequentist interpretation of error bars in his discussion!)

But sometimes we don’t just want to measure some parameter like the age of the Universe. Rather we might want to discriminate between two different models describing the same data. For example, we might want to consider the Big Bang model, in which the Universe started out hot and dense at some time in the past, compared to the Steady State model, in which the Universe has always been expanding, but with matter created to ‘fill in the gaps’ (until the 1960s, this model was a serious contender for explaining cosmological data). At this level, we don’t care about the value of the age, only that there is some finite maximum age on the one hand or not on the other. But there is nothing particularly difficult about answering this sort of question in the Bayesian formalism. The “evidence” is simply the likelihood of a model, rather than of a parameter within the model (in fact, I advocate just calling this quantity the “model likelihood”, rather than the more mysterious and easily abused “evidence” favored by those educated in Cambridge). Formally, this happens to be calculated by integrating over the parameter probabilities, and it turns out that it can be seen as two competing factors. One factor gives a higher probability for models that simply fit the data better for some values of the parameters. Another, competing, factor is higher for models that are simpler (defined in this case as models having fewer parameters, or with less available parameter space). This latter piece is often called the “Ockham factor”, after the famous razor, “entities should not be multiplied unnecessarily,” colloquially rendered as “keep it simple, stupid” — models with too many knobs to twiddle are penalized, unless they fit the data much better.

In a cosmological context, (as I discussed a while ago, and Hobson & Lasenby talked about last week, and which Andrew Liddle has written about extensively recently) it turns out that a relatively simple model with about 6 parameters fits the data very well — none of the possible bells and whistles seem to fit the data better enough to be worth the added complexity. This is both a remarkable triumph of our theorizing and an indication that we need to get better data since, probably, these models are too simple to be enough to describe the universe in all its detailed glory.