Continuing my recent, seemingly interminable, series of too-technical posts on probability theory… To understand this one you’ll need to remember Bayes’ Theorem, and the resulting need for a Bayesian statistician to come up with an appropriate prior distribution to describe her state of knowledge in the absence of the experimental data she is considering, updated to the posterior distribution after considering that data. I should perhaps follow the guide of blogging-hero Paul Krugman and explicitly label posts like this as “wonkish”.
(If instead you’d prefer something a little more tutorial, I can recommend the excellent recent post from my colleague Ted Bunn, discussing hypothesis testing, stopping rules, and cheating at coin flips.)
Deborah Mayo has begun her own series of posts discussing some of the articles in a recent special volume of the excellently-named journal, “Rationality, Markets and Morals” on the topic Statistical Science and Philosophy of Science.
She has started with a discussion Stephen Senn’s “You May Believe You are a Bayesian But You Are Probably Wrong”: she excerpts the article here and then gives her own deconstruction in the sequel.
Senn’s article begins with a survey of the different philosophical schools of statistics: not just frequentist versus Bayesian (for which he also uses the somewhat old-fashioned names of “direct” versus “inverse” probability), but also how the practitioners choose to apply the probabilities that they calculate: either directly in terms of inferences about the world versus using those probabilities to make decisions in order to give a further meaning to the probability.
Having cleaved the statistical world in four, Senn makes a clever rhetorical move. In a wonderfully multilevelled backhanded compliment, he writes
If any one of the four systems had a claim to our attention then I find de Finetti’s subjective Bayes theory extremely beautiful and seductive (even though I must confess to also having some perhaps irrational dislike of it). The only problem with it is that it seems impossible to apply.
He discusses why it is essentially impossible to perform completely coherent ground-up analyses within the Bayesian formalism:
This difficulty is usually described as being the difficulty of assigning subjective probabilities but, in fact, it is not just difficult because it is subjective: it is difficult because it is very hard to be sufficiently imaginative and because life is short.
And, later on:
The … test is that whereas the arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it. This means, for example, that model checking is not allowed.
I think that these criticisms mis-state the practice of Bayesian statistics, at least by the scientists I know (mostly cosmologists and astronomers). We do not treat statistics as a grand system of inference (or decision) starting from single, primitive state of knowledge which we use to reason all the way through to new theoretical paradigms. The caricature of Bayesianism starts with a wide open space of possible theories, and we add data, narrowing our beliefs to accord with our data, using the resulting posterior as the prior for the next set of data to come across our desk.
Rather, most of us take a vaguely Jaynesian view, after the cranky Edwin Jaynes, as espoused in his forty years of papers and his polemical book Probability Theory: The Logic of Science — all probabilities are conditional upon information (although he would likely have been much more hard-core). Contra Senn’s suggestions, the individual doesn’t need to continually adjust her subjective probabilities until she achieves an overall coherence in her views. She just needs to present (or summarise in a talk or paper) a coherent set of probabilities based on given background information (perhaps even more than one set). As long as she carefully states the background information (and the resulting prior), the posterior is a completely coherent inference from it.
In this view, probability doesn’t tell us how to do science, just analyse data in the presence of known hypotheses. We are under no obligation to pursue a grand plan, listing all possible hypotheses from the outset. Indeed we are free to do ‘exploratory data analysis’ using (even) not-at-all-Bayesian techniques to help suggest new hypotheses. This is a point of view espoused most forcefully by Andrew Gelman (author of another paper in the special volume of RMM).
Of course this does not solve all formal or philosophical problems with the Bayesian paradigm. In particular, as I’ve discussed a few times recently, it doesn’t solve what seems to me the most knotty problem of hypothesis testing in the presence of what one would like to be ‘wide open’ prior information.
[Apologies — this is long, technical, and there are too few examples. I am putting it out for commentary more than anything else…]
In some recent articles and blog posts (including one in response to astronomer David Hogg), Columbia University statistician Andrew Gelman has outlined the philosophical position that he and some of his colleagues and co-authors hold. While starting from a resolutely Bayesian perspective on using statistical methods to measure the parameters of a model, he and they depart from the usual story when evaluating models and comparing them to one another. Rather than using the techniques of Bayesian model comparison, they eschew them in preference to a set of techniques they describe as ‘model checking’. Let me apologize in advance if I misconstrue or caricature their views in any way in the following.
In the formalism of model comparison, the statistician or scientist needs to fully specify her model: what are the numbers needed to describe the model, how does the data depend upon them (the likelihood), as well as a reasonable guess for what those numbers night be in the absence of data (the prior). Given these ingredients, one can first combine them to form the posterior distribution to estimate the parameters but then go beyond this to actually determine the probability of the fully-specified model itself.
The first part of the method, estimating the parameters, is usually robust to the choice of a prior distribution for the parameters. In many cases, one can throw the possibilities wide open (an approximation to some sort of ‘complete ignorance’) and get a meaningful measurement of the parameters. In mathematical language, we take the limit of the posterior distribution as we make the prior distribution arbitrarily wide, and this limit often exists.
The problem, noticed by most statisticians and scientists who try to apply these methods is that the next step, comparing models, is almost always sensitive to the details of the choice of prior: as the prior distribution gets wider and wider, the probability for the model gets lower and lower without limit; a model with an infinitely wide prior has zero probability compared to one with a finite width.
In some situations, where we do not wish to model some sort of ignorance, this is fine. But in others, even if we know it is unreasonable to accept an arbitrarily large value for some parameter, we really cannot reasonably choose between, say, an upper limit of 10100 and 1050, which may have vastly different consequences.
The other problem with model comparison is that, as the name says, it involves comparing models: it is impossible to merely reject a model tout court. But there are certainly cases when we would be wise to do so: the data have a noticeable, significant curve, but our model is a straight line. Or, more realistically (but also much more unusually in the history of science): we know about the advance of the perihelion of Mercury, but Einstein hasn’t yet come along to invent General Relativity; or Planck has written down the black body law but quantum mechanics hasn’t yet been formulated.
These observations lead Gelman and company to reject Bayesian model comparison entirely in favor of what they call ‘model checking’. Having made inferences about the parameters of a model, you next create simulated data from the posterior distribution and compare those simulations to the actual data. This latter step is done using some of the techniques of orthodox ‘frequentist’ methods: choosing a statistic, calculating p-values, and worrying about whether your observation is unusual because it lies in the tail of a distribution.
Having suggested these techniques, they go on to advocate a broader philosophical position on the use of probability in science: it is ‘hypothetico-deductive’, rather than ‘inductive’; Popperian rather than Kuhnian. (For another, even more critical, view of Kuhn’s philosophy of science, I recommend filmmaker Errol Morris’ excellent series of blog posts in the New York Times recounting his time as a graduate student in philosophy with Kuhn.)
At this point, I am sympathetic with their position, but worried about the details. A p-value is well-determined, but remains a kind of meaningless number: the probability of finding the value of your statistic as measured or worse. But you didn’t get a worse value, so it’s not clear why this number is meaningful. On the other hand, it is clearly an indication of something: if it is unlikely to have got a worse value then your data must, in some perhaps ill-determined sense, be itself unlikely. Indeed I think it is worries like this that lead them very often to prefer purely graphical methods — the simulations ‘don’t look like’ the data.
The fact is, however, these methods work. They draw attention to data that do not fit the model and, with well-chosen statistics or graphs, lead the scientist to understand what might be wrong with the model. So perhaps we can get away without mathematically meaningful probabilities as long as we are “just” using them to guide our intuition rather than make precise statements about truth or falsehood.
Having suggested these techniques, they go on to make a rather strange leap: deciding amongst any discrete set of parameters falls into the category of model comparison, against their rules. I’m not sure this restriction is necessary: if the posterior distribution for the discrete parameters makes sense, I don’t see why we should reject the inferences made from it.
In these articles they also discuss what it means for a model to be true or false, and what implications that has for the meaning of probability. As they argue, all models are in fact known to be false, certainly in the social sciences that most concerns Gelman, and for the most part in the physical sciences as well, in the sense that they are not completely true in every detail. Newton was wrong, because Einstein was more right, and Einstein is most likely wrong because there is likely to be an even better theory of quantum gravity. Hence, they say, the subjective view of probability is wrong, since no scientist really believes in the truth of the model she is checking. I agree, but I think this is a caricature of the subjective view of probability: it misconstrues the meaning of ‘subjectivity’. If I had to use probabilities only to reflect what I truly believe, I wouldn’t be able to do science, since the only thing that I am sure about my belief system is that it is incoherent:
Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)
— Walt Whitman, Song of Myself
Subjective probability, at least the way it is actually used by practicing scientists, is a sort of “as-if” subjectivity — how would an agent reason if her beliefs were reflected in a certain set of probability distributions? This is why when I discuss probability I try to make the pedantic point that all probabilities are conditional, at least on some background prior information or context. So we shouldn’t really ever write a probability that statement “A” is true as P(A), but rather as P(A|I) for some background information, “I”. If I change the background information to “J”, it shouldn’t surprise me that P(A|I)≠P(A|J). The whole point of doing science is to reason from assumptions and data; it is perfectly plausible for an actual scientist to restrict the context to a choice between two alternatives that she knows to be false. This view of probability owes a lot to Ed Jaynes (as also elucidated by Keith van Horn and others) and would probably be held by most working scientists if you made them elucidate their views in a consistent way.
Still, these philosophical points do not take away from Gelman’s more practical ones, which to me seem distinct from those loftier questions and from each other: first, that the formalism of model comparison is often too sensitive to prior information; second, that we should be able to do some sort of alternative-free model checking in order to falsify a model even if we don’t have any well-motivated substitute. Indeed, I suspect that most scientists, even hardcore Bayesians, work this way even if they (we) don’t always admit it.
The perfect stocking-stuffer for that would-be Bayesian cosmologist you’ve been shopping for:
As readers here will know, the Bayesian view of probability is just that probabilities are statements about our knowledge of the world, and thus eminently suited to use in scientific inquiry (indeed, this is really the only consistent way to make probabilistic statements of any sort!). Over the last couple of decades, cosmologists have turned to Bayesian ideas and methods as tools to understand our data. This book is a collection of specially-commissioned articles, intended as both a primer for astrophysicists new to this sort of data analysis and as a resource for advanced topics throughout the field.
Our back-cover blurb:
In recent years cosmologists have advanced from largely qualitative models of the Universe to precision modelling using Bayesian methods, in order to determine the properties of the Universe to high accuracy. This timely book is the only comprehensive introduction to the use of Bayesian methods in cosmological studies, and is an essential reference for graduate students and researchers in cosmology, astrophysics and applied statistics.
The first part of the book focuses on methodology, setting the basic foundations and giving a detailed description of techniques. It covers topics including the estimation of parameters, Bayesian model comparison, and separation of signals. The second part explores a diverse range of applications, from the detection of astronomical sources (including through gravitational waves), to cosmic microwave background analysis and the quantification and classification of galaxy properties. Contributions from 24 highly regarded cosmologists and statisticians make this an authoritative guide to the subject.
Before it happened, I would have said slim. But since it happened, 100%.—Lawrence Fishburne, CSI, on the chances of being hit in the head by a tortoise dropped by a bird of prey.
I know this is a tired topic, but I am unable to resist using this as an opportunity to slag off Stanley Fish’s idiotic attempt to equate “faith” in religion with “faith” in science. In both cases, we are talking about conditional probability, P(hypothesis | information ), which is read as “the probability of the hypothesis given the information”. I suppose that when the religious discuss “faith” in science, they are referring to the fact that something needs to go on the right side of the bar — all probabilities are conditional on something. But a crucial difference between religion and science is that the religious only put a couple of things on the right side: the words of a holy book (and don’t ask me why one should choose one book over another), or just the effects of some vaporous conversion experience which leaves all such probabilities as tautologies — god exists since I know god exists. For science, however, we get to condition our probabilities on, well, pretty much anything and everything. And the more we learn, the better it gets.
OK, this is going to be a very long post. About something I don’t pretend to be expert in. But it is science, at least.
A couple of weeks ago, Radio 4’s highbrow “In Our Time” tackled the so-called “Measurement Problem”. That is: quantum mechanics predicts probabilities, not definite outcomes. And yet we see a definite world. Whenever we look, a particle is in a particular place. A cat is either alive or dead, in Schrodinger’s infamous example. So, lots to explain in just setting up the problem, and even more in the various attempts so far to solve it (none quite satisfactory). This is especially difficult because the measurement problem is, I think, unique in physics: quantum mechanics appears to be completely true and experimentally verified, without contradiction so far. And yet it seems incomplete: the “problem” arises because the equations of quantum mechanics only provide a recipe for the calculations of probabilities, but doesn’t seem to explain what’s going on underneath. For that, we need to add a layer of interpretation on top. Melvyn Bragg had three physicists down to the BBC studios, each with his own idea of what that layer might look like.
Unfortunately, the broadcast seemed to me a bit of a shambles: the first long explanation by Basil Hiley of Birkbeck of quantum mechanics used the terms “wavefunction” and “linear superposition” without even an attempt at a definition. Things got a bit better as Bragg tried to tease things out, but I can’t imagine the non-physicists that were left listening got much out of it. Hiley himself worked with David Bohm on one possible solution to the measurement problem, the so-called “Pilot Wave Theory” (another term which was used a few times without definition) in which quantum mechanics is actually a deterministic theory — the probabilities come about because there is information to which we do not — and in principle cannot — have access to about the locations and trajectories of particles.
Roger Penrose proved to be remarkably positivist in his outlook: he didn’t like the other interpretations on offer simply because they make no predictions beyond standard quantum mechanics and are therefore untestable. (Others see this as a selling point for these interpretations, however — there is no contradiction with experiment!) To the extent I understand his position, Penrose himself prefers the idea that quantum mechanics is actually incomplete, and that when it is finally reconciled with General Relativity (in a Theory of Everything or otherwise), we will find that it actually does make specific, testable predictions.
There was a long discussion by Simon Saunders of that sexiest of interpretations of quantum mechanics, the Many Worlds Interpretation. The latest incarnation of Many-Worlds theory is centered around workers in or near Oxford: Saunders himself, David Wallace and most famously David Deutsch. The Many-Worlds interpretation (also known as the Everett Interpretation after its initial proponent) attempts to solve the problem by saying that there is nothing special about measurement at all — the simple equations of quantum mechanics always obtain. In order for this to occur, then all possible outcomes of any experiment must be actualized: that is, their must be a world for each outcome. But we’re not just talking about outcomes of science experiments here, but rather any time that quantum mechanics could have predicted something other than what (seemingly) actually happened. Which is all the time, to all of the particles in the Universe, everywhere. This is, to say the least, “ontologically extravagant”. Moreover, it has always been plagued by at least one fundamental problem: what, exactly, is the status of probability in the many-worlds view? When more than one quantum-mechanical possibility presents itself, each splits into its own world, with a probability related to the aforementioned wavefunction. But what beyond this does it mean for one branch to have a higher probability? The Oxonian many-worlders have tried to use decision theory to reconcile this with the prescriptions of quantum mechanics: from very minimal requirements of rationality alone, can we derive the probability rule? They claim to have done so, and they further claim that their proof only makes sense in the Many-Worlds picture. This is, roughly, because only in the Everett picture is their no “fact of the matter” at all about what actually happens in a quantum outcome — in all other interpretations the very existence of a single actual outcome is enough to scupper the proof. (I’m not so sure I buy this — surely we are allowed to base rational decisions on only the information at hand, as opposed to all of the information potentially available?)
At bottom, these interpretations of quantum mechanics (aka solutions to the measurement problem) are trying to come to grips with the fact that quantum mechanics seems to be fundamentally about probability, rather than the way things actually are. And, as I’ve discussed elsewhere, time and time again, probability is about our states of knowledge, not the world. But we are justly uncomfortable with 70s-style “Tao-of-Physics” ideas that make silly links between consciousness and the world at large.
But there is an interpretation that takes subjective probability seriously without resorting to the extravagance of many (very, very many) worlds. Chris Fuchs, along with his collaborators Carlton Caves and Ruediger Schack have pursued this idea with some success. Whereas the many-worlds interpretation requires a universe that seems far too full for me, the Bayesian interpretation is somewhat underdetermined: there is a level of being that is, literally unspeakable: there is no information to be had about the quantum realm beyond our experimental results. This is, as Fuchs points out, a very strong restriction on how we can assign probabilities to events in the world. But I admit some dissatisfaction at the explanatory power of the underlying physics at this point (discussed in some technical detail in a review by yet another Oxford philosopher of science, Christopher Timpson).
In both the Bayesian and Many Worlds interpretations (at least in the modern versions of the latter), probability is supposed to be completely subjective, as it should be. But something still seems to be missing: probability assignments are, in fact, testable, using techniques such as Bayesian model selection. What does it mean, in the purely subjective interpretation, to be correct, or at least more correct? Sometimes, this is couched as David Lewis’ “principal principle” (it’s very hard to find a good distillation of this on the web, but here’s a try): there is something out there called “objective chance” and our subjective probabilities are meant to track it (I am not sure this is coherent, and even Lewis himself usually gave the example of a coin toss, in which there is nothing objective at all about the chance involved: if you know the initial conditions of the coin and the way it is flipped and caught, you can predict the outcome with certainty.) But something at least vaguely objective seems to be going on in quantum mechanics: more probable outcomes happen more often, at least for the probability assignments that physicists make given what we know about our experiments. This isn’t quite “objective chance” perhaps, but it’s not clear that there isn’t another layer of physics still to be understood.
In today’s Sunday NY Times Magazine, there’s a long article by psychologist Steven Pinker, on “Personal Genomics”, the growing ability for individuals to get information about their genetic inheritance. He discusses the evolution of psychological traits versus intelligence, and highlights the complicated interaction amongst genes, and between genes and society.
But what caught my eye was this paragraph:
What should I make of the nonsensical news that I… have a “twofold risk of baldness”? … 40 percent of men with the C version of the rs2180439 SNP are bald, compared with 80 percent of men with the T version, and I have the T. But something strange happens when you take a number representing the proportion of people in a sample and apply it to a single individual…. Anyone who knows me can confirm that I’m not 80 percent bald, or even 80 percent likely to be bald; I’m 100 percent likely not to be bald. The most charitable interpretation of the number when applied to me is, “If you knew nothing else about me, your subjective confidence that I am bald, on a scale of 0 to 10, should be 8.” But that is a statement about your mental state, not my physical one. If you learned more clues about me (like seeing photographs of my father and grandfathers), that number would change, while not a hair on my head would be different. [Emphasis mine].
That “charitable interpretation” of the 80% likelihood to be bald is exactly Bayesian statistics (which I’ve talked about, possibly ad nauseum, before) : it’s the translation from some objective data about the world — the frequency of baldness in carriers of this gene — into a subjective statement about the top of Pinker’s head, in the absence of any other information. And that’s the point of probability: given enough of that objective data, scientists will come to agreement. But even in the state of uncertainty that most scientists find themselves, Bayesian probability forces us to enumerate the assumptions (usually called “prior probabilities”) that enter into our assignments reasoning along with the data. Hence, if you knew Pinker, your prior probability is that he’s fully hirsute (perhaps not 100% if you allow for the possibility of hair extensions and toupees); but if you didn’t then you’d probably be willing to take 4:1 odds on a bet about his baldness — and you would lose to someone with more information.
In science, of course, it usually isn’t about wagering, but just about coming to agreement about the state of the world: do the predictions of a theory fit the data, given the inevitable noise in our measurements, and the difficulty of working out the predictions of interesting theoretical ideas? In cosmology, this is particularly difficult: we can’t go out and do the equivalent of surveying a cross section of the population for their genes: we’ve got only one universe, and can only observe a small patch of it. So probabilities become even more subjective and difficult to tie uniquely to the data. Hence the information available to us on the very largest observable scales is scarce, and unlikely to improve much, despite tantalizing hints of data discrepant with our theories, such as the possibly mysterious alignment of patterns in the Cosmic Microwave Background on very large angles of the sky (discussed recently by Peter Coles here). Indeed, much of the data pointing to a possible problem was actually available from the COBE Satellite; results from the more recent and much more sensitive WMAP Satellite have only reinforced the original problems — we hope that the Planck Surveyor — to be launched in April! — will actually be able to shed light on the problem by providing genuinely new information about the polarization of the CMB on large scales to complement the temperature maps from COBE and WMAP.
With yesterday’s article on “Faith” (vs Science) in the Guardian, and today’s London debate between bioligist Lewis Wolpert and the pseudorational William Lane Craig (previewed on the BBC’s Today show this morning), the UK seems to be the hotbed of tension between science and religion. I’ll leave it to the experts for a fuller exposition, but I was particularly intrigued (read: disgusted) by Craig’s claims that so little of science has been “proved”, and hence it was OK to believe in other unproven things like, say, a Christian God (alhough I prefer the Flying Spaghetti Monster).
The question is: what constitutes “proof”?
Craig claimed that such seemingly self-evident facts such as the existence of the past, or even the existence of other minds, were essentially unproven and unprovable. Here, Craig is referring to proofs of logic and mathematics, those truths which follow necessarily from the very structure of geometry and math. The problem with this standard of proof is that it applies to not a single interesting statement about the external world. All you can see with this sort of proof are statements like 1+1=2, or that Fermat’s Theorem and the Poincaré Conjecture are true, or that the sum of the angles of a triangle on a plane are 180 degrees. But you can’t prove in this way that Newton’s laws hold, or that we descended from the ancestor’s of today’s apes.
For these latter sorts of statements, we have to resort to scientific proof, which is a different but still rigorous standard. Scientific proofs are unavoidably contingent, based upon the data we have and the setting in which we interpret that data. What we can do over time is get better and better data, and minimize the restrictions of that theoretical setting. Hence, we can reduce Darwinian evolution to a simple algorithm: if there is a mechanism for passing along inherited characteristics, and if there are random mutations in those characteristics, and if there is some competition among offspring, then evolution will occur. Furthermore, if evolution does occur, then the archaeological record makes it exceedingly likely that present-day species have evolved in the accepted.
Similarly, given our observations of the movement of bodies on relatively small scales, it is exceedingly likely that a theory like Einstein’s General Relativity holds to describe gravity. Given observations on large scales, it is exceedingly likely that the Universe started out in a hot and dense state about 14 billion years ago, and has been expanding ever since.
The crucial words in the last couple of paragraphs are “exceedingly likely” — scientific proofs aren’t about absolute truth, but probability. Moreover, they are about what is known as “conditional probability” — how likely something is to be true given other pieces of knowledge. As we accumulate more and more knowledge, plausible scientific theories become more and more probable. (Regular readers will note that almost everything eventually comes back to Bayesian Statistics.)
Hence, we can be pretty sure that the Big Bang happened, that Evolution is responsible for the species present on the earth today, and that, indeed, other minds exist and that the cosmos wasn’t created in media res sometime yesterday.
This pretty high standard of proof must be contrasted with religious statements about the world which, if anything, get less likely as more and more contradictory data comes in. Of course, since the probabilities are conditional, believers are allowed to make everything contingent not upon observed data, but on their favorite religious story: the probability of evolution given the truth of the New Testament may be pretty small, but that’s a lot to, uh, take on faith, especially given all of its internal contradictions. (The smarter and/or more creative theologians just keep making the religious texts more and more metaphorical but I assume they want to draw the line somewhere before they just become wonderfully-written books).
In a bid to combine numeracy with sports coverage, The Observer presents the Poisson distribution as a headline on its front page today. Supposedly, it has something to do with predicting the number of goals a team will score in the World Cup. The Poisson distribution is
P(n) = λn e-λ/n!
This gives the probability, P(n), that a team will score n goals in a given game, if the average number of goals they are expected to score is λ. Unfortunately, that’s the easy part; the hard part is figuring out that expected number, λ. We don’t just want to know the average number that, say, England scores in all the games it has ever played, but the average in circumstances like the World Cup: against Brazil and not, for example, Belarus (OK, that’s a bad example…).
I was pleased to note that the article said that the company making these predictions, “Decision Technology,” uses “maximum likelihood estimation”, which is strongly related to the Bayesian Probability theory I wrote about a few weeks ago.