[Apologies — this is long, technical, and there are too few examples. I am putting it out for commentary more than anything else…]

In some recent articles and blog posts (including one in response to astronomer David Hogg), Columbia University statistician Andrew Gelman has outlined the philosophical position that he and some of his colleagues and co-authors hold. While starting from a resolutely Bayesian perspective on using statistical methods to measure the parameters of a model, he and they depart from the usual story when evaluating models and comparing them to one another. Rather than using the techniques of Bayesian model comparison, they eschew them in preference to a set of techniques they describe as ‘model checking’. Let me apologize in advance if I misconstrue or caricature their views in any way in the following.

In the formalism of model comparison, the statistician or scientist needs to fully specify her model: what are the numbers needed to describe the model, how does the data depend upon them (the likelihood), as well as a reasonable guess for what those numbers night be in the absence of data (the prior). Given these ingredients, one can first combine them to form the posterior distribution to estimate the parameters but then go beyond this to actually determine the probability of the fully-specified model itself.

The first part of the method, estimating the parameters, is usually robust to the choice of a prior distribution for the parameters. In many cases, one can throw the possibilities wide open (an approximation to some sort of ‘complete ignorance’) and get a meaningful measurement of the parameters. In mathematical language, we take the limit of the posterior distribution as we make the prior distribution arbitrarily wide, and this limit often exists.

The problem, noticed by most statisticians and scientists who try to apply these methods is that the next step, comparing models, is almost always sensitive to the details of the choice of prior: as the prior distribution gets wider and wider, the probability for the model gets lower and lower without limit; a model with an infinitely wide prior has zero probability compared to one with a finite width.

In some situations, where we do not wish to model some sort of ignorance, this is fine. But in others, even if we know it is unreasonable to accept an arbitrarily large value for some parameter, we really cannot reasonably choose between, say, an upper limit of 10^{100} and 10^{50}, which may have vastly different consequences.

The other problem with model comparison is that, as the name says, it involves comparing models: it is impossible to merely reject a model *tout court*. But there are certainly cases when we would be wise to do so: the data have a noticeable, significant curve, but our model is a straight line. Or, more realistically (but also much more unusually in the history of science): we know about the advance of the perihelion of Mercury, but Einstein hasn’t yet come along to invent General Relativity; or Planck has written down the black body law but quantum mechanics hasn’t yet been formulated.

These observations lead Gelman and company to reject Bayesian model comparison entirely in favor of what they call ‘model checking’. Having made inferences about the parameters of a model, you next create simulated data from the posterior distribution and compare those simulations to the actual data. This latter step is done using some of the techniques of orthodox ‘frequentist’ methods: choosing a statistic, calculating p-values, and worrying about whether your observation is unusual because it lies in the tail of a distribution.

Having suggested these techniques, they go on to advocate a broader philosophical position on the use of probability in science: it is ‘hypothetico-deductive’, rather than ‘inductive’; Popperian rather than Kuhnian. (For another, even more critical, view of Kuhn’s philosophy of science, I recommend filmmaker Errol Morris’ excellent series of blog posts in the New York Times recounting his time as a graduate student in philosophy with Kuhn.)

At this point, I am sympathetic with their position, but worried about the details. A p-value is well-determined, but remains a kind of meaningless number: the probability of finding the value of your statistic as measured or worse. But you didn’t get a worse value, so it’s not clear why this number is meaningful. On the other hand, it is clearly an indication of something: if it is unlikely to have got a worse value then your data must, in some perhaps ill-determined sense, be itself unlikely. Indeed I think it is worries like this that lead them very often to prefer purely graphical methods — the simulations ‘don’t look like’ the data.

The fact is, however, these methods work. They draw attention to data that do not fit the model and, with well-chosen statistics or graphs, lead the scientist to understand what might be wrong with the model. So perhaps we can get away without mathematically meaningful probabilities as long as we are “just” using them to guide our intuition rather than make precise statements about truth or falsehood.

Having suggested these techniques, they go on to make a rather strange leap: deciding amongst any discrete set of parameters falls into the category of model comparison, against their rules. I’m not sure this restriction is necessary: if the posterior distribution for the discrete parameters makes sense, I don’t see why we should reject the inferences made from it.

In these articles they also discuss what it means for a model to be true or false, and what implications that has for the meaning of probability. As they argue, all models are in fact known to be *false*, certainly in the social sciences that most concerns Gelman, and for the most part in the physical sciences as well, in the sense that they are not completely true in every detail. Newton was wrong, because Einstein was more right, and Einstein is most likely wrong because there is likely to be an even better theory of quantum gravity. Hence, they say, the subjective view of probability is wrong, since no scientist really believes in the truth of the model she is checking. I agree, but I think this is a caricature of the subjective view of probability: it misconstrues the meaning of ‘subjectivity’. If I had to use probabilities only to reflect what I truly believe, I wouldn’t be able to do science, since the only thing that I am sure about my belief system is that it is incoherent:

Do I contradict myself?

Very well then I contradict myself,

(I am large, I contain multitudes.)

—Walt Whitman, Song of Myself

Subjective probability, at least the way it is actually used by practicing scientists, is a sort of “as-if” subjectivity — how would an agent reason *if* her beliefs were reflected in a certain set of probability distributions? This is why when I discuss probability I try to make the pedantic point that all probabilities are conditional, at least on some background prior information or context. So we shouldn’t really ever write a probability that statement “A” is true as P(A), but rather as P(A|I) for some background information, “I”. If I change the background information to “J”, it shouldn’t surprise me that P(A|I)≠P(A|J). The whole point of doing science is to reason from assumptions and data; it is perfectly plausible for an actual scientist to restrict the context to a choice between two alternatives that she knows to be false. This view of probability owes a lot to Ed Jaynes (as also elucidated by Keith van Horn and others) and would probably be held by most working scientists if you made them elucidate their views in a consistent way.

Still, these philosophical points do not take away from Gelman’s more practical ones, which to me seem distinct from those loftier questions and from each other: first, that the formalism of model comparison is often too sensitive to prior information; second, that we should be able to do some sort of alternative-free model checking in order to falsify a model even if we don’t have any well-motivated substitute. Indeed, I suspect that most scientists, even hardcore Bayesians, work this way even if they (we) don’t always admit it.

## Roberto Trotta

It seems to me that the strictly Popperian view of science (all models start off as being infinitely improbable, hence no amount of supporting evidence will ever make them probable) is untenable in practice. Every working scientists knows that models that make predictions that are verified by observations do gain in credibility. This of course is automatically reflected in the Bayesian model comparison framework.

Most people would probably agree that the Bayesian approach answers the right question (as opposed to p-values!), although it does seem desirable to have an absolute scale for model quality in a Bayesian sense (rather than a relative scale as in model comparison). This is what Bayesian doubt tries to achieve (in this article and this), although the notion remains very much work in progress...

## Anton Garrett

First, there is some confusion here about what probability is. In any real problem you want the strength of implication of one binary proposition by another; and this quantity was proved in 1946 by RT Cox, from the Boolean calculus of the propositions, to obey the sum and product rules (and their immediate corollary Bayes' theorem). Since degree of implication is what we want and it obeys the sum and product rules, it seems sensible to take it as 'the probability' - but if anybody objects to this definition of the p-word, it is better not to engage them but simply sidestep and get on with solving the problem by calculating the degree of implication.

That said, there is a mystery: Suppose you plot a theoretical curve, and it is a straight line having a particular gradient, and then you plot the experimental datapoints, and find that they fit beautifully on a straight line but of a very different gradient, then why do we reject the theory even if we don't have a competitor? This rejection is a problem for objective Bayesians, for whom theories can only be tested against each other on the basis of their predictions, and the probability for each theory we throw into the game must (conditioned identically) add up to unity. In contrast, frequentist hypothesis testing can happily take place in a vacuum with no predictive alternative.

I think the answer is as follows. When our minds see the datapoints lying on a straight line of a very different gradient, they immediately invent a new hypothesis - that the correct curve is a straight line, but of a different gradient than that predicted by the formal theory. Then our minds intuitively perform an Ockham-type Bayesian analysis, in which "straight line with data-estimated gradient" wins out over "straight line with the gradient predicted by the formal theory". The capacity of our minds to invent a new hypothesis in the light of the data is something that the Bayesian calculation does not replicate, and this explains the discrepancy.

Popper's concern that there are an infinite number of theories, all a priori equally improbable, is as confused as much that this over-rated philosopher wrote. It is only theories with testable predictions that get thrown into the Bayesian survival-of-the-fittest contest. A deeper problem with Popper's view of physical science is that he rejected inductive logic but accepted probability - yet inductive logic, when done correctly, IS (Bayesian) probabilistic reasoning. Probabilistic testing is the only practical way to compare theories; without it, you are left only with the black-and-white of True or False, whence Popper's 'falsifiability' criterion. Whoever thought that it was the height of scientific ambition to see your theory proved WRONG?