[Update: I have fixed some broken links, and modified the discussion of QBism and the recent paper by Chris Fuchs— thanks to Chris himself for taking the time to read and find my mistakes!]
For some reason, I’ve come across an idea called “Knightian Uncertainty” quite a bit lately. Frank Knight was an economist of the free-market conservative “Chicago School”, who considered various concepts related to probability in a book called Risk, Uncertainty, and Profit. He distinguished between “risk”, which he defined as applying to events to which we can assign a numerical probability, and “uncertainty”, to those events about which we know so little that we don’t even have a probability to assign, or indeed those events whose possibility we didn’t even contemplate until they occurred. In Rumsfeldian language, “risk” applies to “known unknowns”, and “uncertainty” to “unknown unknowns”. Or, as Nicholas Taleb put it, “risk” is about “white swans”, while “uncertainty” is about those unexpected “black swans”.
(As a linguistic aside, to me, “uncertainty” seems a milder term than “risk”, and so the naming of the concepts is backwards.)
Actually, there are a couple of slightly different concepts at play here. The black swans or unknown-unknowns are events that one wouldn’t have known enough about to even include in the probabilities being assigned. This is much more severe than those events that one knows about, but for which one doesn’t have a good probability to assign.
And the important word here is “assign”. Probabilities are not something out there in nature, but in our heads. So what should a Bayesian make of these sorts of uncertainty? By definition, they can’t be used in Bayes’ theorem, which requires specifying a probability distribution. Bayesian theory is all about making models of the world: we posit a mechanism and possible outcomes, and assign probabilities to the parts of the model that we don’t know about.
So I think the two different types of Knightian uncertainty have quite a different role here. In the case where we know that some event is possible, but we don’t really know what probabilities to assign to it, we at least have a starting point. If our model is broad enough, then enough data will allow us to measure the parameters that describe it. For example, in recent years people have started to realise that the frequencies of rare, catastrophic events (financial crashes, earthquakes, etc.) are very often well described by so-called power-law distributions. These assign much greater probabilities to such events than more typical Gaussian (bell-shaped curve) distributions; the shorthand for this is that power-law distributions have much heavier tails than Gaussians. As long as our model includes the possibility of these heavy tails, we should be able to make predictions based on data, although very often those predictions won’t be very precise.
But the “black swan” problem is much worse: these are possibilities that we don’t even know enough about to consider in our model. Almost by definition, one can’t say anything at all about this sort of uncertainty. But what one must do is be open-minded enough to adjust our models in the face of new data: we can’t predict the black swan, but we should expand the model after we’ve seen the first one (and perhaps revise our model for other waterfowl to allow more varieties!). In more traditional scientific settings, involving measurements with errors, this is even more difficult: a seemingly anomalous result, not allowed in the model, may be due to some mistake in the experimental setup or in our characterisation of the probabilities of those inevitable errors (perhaps they should be described by heavy-tailed power laws, rather than Gaussian distributions as above).
I first came across the concept as an oblique reference in a recent paper by Chris Fuchs, writing about his idea of QBism (or see here for a more philosophically-oriented discussion), an interpretation of quantum mechanics that takes seriously the Bayesian principle that all probabilities are about our knowledge of the world, rather than the world itself (which is a discussion for another day). He tentatively opined that the probabilities in quantum mechanics are themselves “Knightian”, referring not to a reading of Knight himself but to some recent, and to me frankly bizarre, ideas from Scott Aaronson, discussed in his paper, The Ghost in the Quantum Turing Machine, and an accompanying blog post, trying to base something like “free will” (a term he explicitly does not apply to this idea, however) on the possibility of our brains having so-called “freebits”, quantum states whose probabilities are essentially uncorrelated with anything else in the Universe. This arises from what is to me a mistaken desire to equate “freedom” with complete unpredictability. My take on free will is instead aligned with that of Daniel Dennett, at least the version from his Consciousness Explained from the early 1990s, as I haven’t yet had the chance to read his recent From Bacteria to Bach and Back: a perfectly deterministic (or quantum mechanically random, even allowing for the statistical correlations that Aaronson wants to be rid of) version of free will is completely sensible, and indeed may be the only kind of free will worth having.
Fuchs himself tentatively uses Aaronson’s “Knightian Freedom” to refer to his own idea
that nature does what it wants, without a mechanism underneath, and without any “hidden hand” of the likes of Richard von Mises’s Kollective or Karl Popper’s propensities or David Lewis’s objective chances, or indeed any conception that would diminish the autonomy of nature’s events,
which I think is an attempt (and which I admit I don’t completely understand) to remove the probabilities of quantum mechanics entirely from any mechanistic account of physical systems, despite the incredible success of those probabilities in predicting the outcomes of experiments and other observations of quantum mechanical systems. I’m not quite sure this is what either Knight nor Aaronson had in mind with their use of “uncertainty” (or “freedom”), since at least in quantum mechanics, we do know what probabilities to assign, given certain other personal (as Fuchs would have it) information about the system. My Bayesian predilections make me sympathetic with this idea, but then I struggle to understand what, exactly, quantum mechanics has taught us about the world: why do the predictions of quantum mechanics work?
When I’m not thinking about physics, for the last year or so my mind has been occupied with politics, so I was amused to see Knightian Uncertainty crop up in a New Yorker article about Trump’s effect on the stock market:
Still, in economics there’s a famous distinction, developed by the great Chicago economist Frank Knight, between risk and uncertainty. Risk is when you don’t know exactly what will happen but nonetheless have a sense of the possibilities and their relative likelihood. Uncertainty is when you’re so unsure about the future that you have no way of calculating how likely various outcomes are. Business is betting that Trump is risky but not uncertain—he may shake things up, but he isn’t going to blow them up. What they’re not taking seriously is the possibility that Trump may be willing to do things—like start a trade war with China or a real war with Iran—whose outcomes would be truly uncertain.
It’s a pretty low bar, but we can only hope.
I recently finished my last term lecturing our second-year Quantum Mechanics course, which I taught for five years. It’s a required class, a mathematical introduction to one of the most important set of ideas in all of physics, and really the basis for much of what we do, whether that’s astrophysics or particle physics or almost anything else. It’s a slightly “old-fashioned” course, although it covers the important basic ideas: the Schrödinger Equation, the postulates of quantum mechanics, angular momentum, and spin, leading almost up to what is needed to understand the crowning achievement of early quantum theory: the structure of the hydrogen atom (and other atoms).
A more modern approach might start with qubits: the simplest systems that show quantum mechanical behaviour, and the study of which has led to the revolution in quantum information and quantum computing.
Moreover, the lectures rely on the so-called Copenhagen interpretation, which is the confusing and sometimes contradictory way that most physicists are taught to think about the basic ontology of quantum mechanics: what it says about what the world is “made of” and what happens when you make a quantum-mechanical measurement of that world. Indeed, it’s so confusing and contradictory that you really need another rule so that you don’t complain when you start to think too deeply about it: “shut up and calculate”. A more modern approach might also discuss the many-worlds approach, and — my current favorite — the (of course) Bayesian ideas of QBism.
The students seemed pleased with the course as it is — at the end of the term, they have the chance to give us some feedback through our “Student On-Line Evaluation” system, and my marks have been pretty consistent. Of the 200 or so students in the class, only about 90 bother to give their evaluations, which is disappointingly few. But it’s enough (I hope) to get a feeling for what they thought.
So, most students Definitely/Mostly Agree with the good things, although it’s clear that our students are most disappointed in the feedback that they receive from us (this is a more general issue for us in Physics at Imperial and more generally, and which may partially explain why most of them are unwilling to feed back to us through this form).
But much more fun and occasionally revealing are the “free-text comments”. Given the numerical scores, it’s not too surprising that there were plenty of positive ones:
Excellent lecturer - was enthusiastic and made you want to listen and learn well. Explained theory very well and clearly and showed he responded to suggestions on how to improve.
Possibly the best lecturer of this term.
Thanks for providing me with the knowledge and top level banter.
One of my favourite lecturers so far, Jaffe was entertaining and cleary very knowledgeable. He was always open to answering questions, no matter how simple they may be, and gave plenty of opportunity for students to ask them during lectures. I found this highly beneficial. His lecturing style incorporates well the blackboards, projectors and speach and he finds a nice balance between them. He can be a little erratic sometimes, which can cause confusion (e.g. suddenly remembering that he forgot to write something on the board while talking about something else completely and not really explaining what he wrote to correct it), but this is only a minor fix. Overall VERY HAPPY with this lecturer!
But some were more mixed:
One of the best, and funniest, lecturers I’ve had. However, there are some important conclusions which are non-intuitively derived from the mathematics, which would be made clearer if they were stated explicitly, e.g. by writing them on the board.
I felt this was the first time I really got a strong qualitative grasp of quantum mechanics, which I certainly owe to Prof Jaffe’s awesome lectures. Sadly I can’t quite say the same about my theoretical grasp; I felt the final third of the course less accessible, particularly when tackling angular momentum. At times, I struggled to contextualise the maths on the board, especially when using new techniques or notation. I mostly managed to follow Prof Jaffe’s derivations and explanations, but struggled to understand the greater meaning. This could be improved on next year. Apart from that, I really enjoyed going to the lectures and thought Prof Jaffe did a great job!
The course was inevitably very difficult to follow.
And several students explicitly commented on my attempts to get students to ask questions in as public a way as possible, so that everyone can benefit from the answers and — this really is true! — because there really are no embarrassing questions!
Really good at explaining and very engaging. Can seem a little abrasive at times. People don’t like asking questions in lectures, and not really liking people to ask questions in private afterwards, it ultimately means that no questions really get answered. Also, not answering questions by email makes sense, but no one really uses the blackboard form, so again no one really gets any questions answered. Though the rationale behind not answering email questions makes sense, it does seem a little unnecessarily difficult.
We are told not to ask questions privately so that everyone can learn from our doubts/misunderstandings, but I, amongst many people, don’t have the confidence to ask a question in front of 250 people during a lecture.
Forcing people to ask questions in lectures or publically on a message board is inappropriate. I understand it makes less work for you, but many students do not have the confidence to ask so openly, you are discouraging them from clarifying their understanding.
Inevitably, some of the comments were contradictory:
Would have been helpful to go through examples in lectures rather than going over the long-winded maths to derive equations/relationships that are already in the notes.
Professor Jaffe is very good at explaining the material. I really enjoyed his lectures. It was good that the important mathematics was covered in the lectures, with the bulk of the algebra that did not contribute to understanding being left to the handouts. This ensured we did not get bogged down in unnecessary mathematics and that there was more emphasis on the physics. I liked how Professor Jaffe would sometimes guide us through the important physics behind the mathematics. That made sure I did not get lost in the maths. A great lecture course!
And also inevitably, some students wanted to know more about the exam:
- It is a difficult module, however well covered. The large amount of content (between lecture notes and handouts) is useful. Could you please identify what is examinable though as it is currently unclear and I would like to focus my time appropriately?
And one comment was particularly worrying (along with my seeming “a little abrasive at times”, above):
- The lecturer was really good in lectures. however, during office hours he was a bit arrogant and did not approach the student nicely, in contrast to the behaviour of all the other professors I have spoken to
If any of the students are reading this, and are willing to comment further on this, I’d love to know more — I definitely don’t want to seem (or be!) arrogant or abrasive.
But I’m happy to see that most students don’t seem to think so, and even happier to have learned that I’ve been nominated “multiple times” for Imperial’s Student Academic Choice Awards!
Finally, best of luck to my colleague Jonathan Pritchard, who will be taking over teaching the course next year.
Continuing my recent, seemingly interminable, series of too-technical posts on probability theory… To understand this one you’ll need to remember Bayes’ Theorem, and the resulting need for a Bayesian statistician to come up with an appropriate prior distribution to describe her state of knowledge in the absence of the experimental data she is considering, updated to the posterior distribution after considering that data. I should perhaps follow the guide of blogging-hero Paul Krugman and explicitly label posts like this as “wonkish”.
(If instead you’d prefer something a little more tutorial, I can recommend the excellent recent post from my colleague Ted Bunn, discussing hypothesis testing, stopping rules, and cheating at coin flips.)
Deborah Mayo has begun her own series of posts discussing some of the articles in a recent special volume of the excellently-named journal, “Rationality, Markets and Morals” on the topic Statistical Science and Philosophy of Science.
She has started with a discussion Stephen Senn’s “You May Believe You are a Bayesian But You Are Probably Wrong”: she excerpts the article here and then gives her own deconstruction in the sequel.
Senn’s article begins with a survey of the different philosophical schools of statistics: not just frequentist versus Bayesian (for which he also uses the somewhat old-fashioned names of “direct” versus “inverse” probability), but also how the practitioners choose to apply the probabilities that they calculate: either directly in terms of inferences about the world versus using those probabilities to make decisions in order to give a further meaning to the probability.
Having cleaved the statistical world in four, Senn makes a clever rhetorical move. In a wonderfully multilevelled backhanded compliment, he writes
If any one of the four systems had a claim to our attention then I find de Finetti’s subjective Bayes theory extremely beautiful and seductive (even though I must confess to also having some perhaps irrational dislike of it). The only problem with it is that it seems impossible to apply.
He discusses why it is essentially impossible to perform completely coherent ground-up analyses within the Bayesian formalism:
This difficulty is usually described as being the difficulty of assigning subjective probabilities but, in fact, it is not just difficult because it is subjective: it is difficult because it is very hard to be sufficiently imaginative and because life is short.
And, later on:
The … test is that whereas the arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it. This means, for example, that model checking is not allowed.
I think that these criticisms mis-state the practice of Bayesian statistics, at least by the scientists I know (mostly cosmologists and astronomers). We do not treat statistics as a grand system of inference (or decision) starting from single, primitive state of knowledge which we use to reason all the way through to new theoretical paradigms. The caricature of Bayesianism starts with a wide open space of possible theories, and we add data, narrowing our beliefs to accord with our data, using the resulting posterior as the prior for the next set of data to come across our desk.
Rather, most of us take a vaguely Jaynesian view, after the cranky Edwin Jaynes, as espoused in his forty years of papers and his polemical book Probability Theory: The Logic of Science — all probabilities are conditional upon information (although he would likely have been much more hard-core). Contra Senn’s suggestions, the individual doesn’t need to continually adjust her subjective probabilities until she achieves an overall coherence in her views. She just needs to present (or summarise in a talk or paper) a coherent set of probabilities based on given background information (perhaps even more than one set). As long as she carefully states the background information (and the resulting prior), the posterior is a completely coherent inference from it.
In this view, probability doesn’t tell us how to do science, just analyse data in the presence of known hypotheses. We are under no obligation to pursue a grand plan, listing all possible hypotheses from the outset. Indeed we are free to do ‘exploratory data analysis’ using (even) not-at-all-Bayesian techniques to help suggest new hypotheses. This is a point of view espoused most forcefully by Andrew Gelman (author of another paper in the special volume of RMM).
Of course this does not solve all formal or philosophical problems with the Bayesian paradigm. In particular, as I’ve discussed a few times recently, it doesn’t solve what seems to me the most knotty problem of hypothesis testing in the presence of what one would like to be ‘wide open’ prior information.
I spent a quick couple of days last week at the The Controversy about Hypothesis Testing meeting in Madrid.
The topic of the meeting was indeed the question of “hypothesis testing”, which I addressed in a post a few months ago: how do you choose between conflicting interpretations of data? The canonical version of this question was the test of Einstein’s theory of relativity in the early 20th Century — did the observations of the advance of the perihelion of Mercury (and eventually of the gravitational lensing of starlight by the sun) match the predictions of Einstein’s theory better than Newton’s? And of course there are cases in which even more than a scientific theory is riding on the outcome: is a given treatment effective? I won’t rehash here my opinions on the subject, except to say that I think there really is a controversy: the purported Bayesian solution runs into problems in realistic cases of hypotheses about which we would like to claim some sort of “ignorance” (always a dangerous word in Bayesian circles), while the orthodox frequentist way of looking at the problem is certainly ad hoc and possibly incoherent, but nonetheless seems to work in many cases.
Sometimes, the technical worries don’t apply, and the Bayesian formalism provides the ideal solution. For example, my colleague Daniel Mortlock has applied the model-comparison formalism to deciding whether objects in his UKIDSS survey data are more likely to be distant quasars or nearby and less interesting objects. (He discussed his method here a few months ago.)
In between thoughts about hypothesis testing, I experienced the cultural differences between the statistics community and us astrophysicists and cosmologists, of which I was the only example at the meeting: a typical statistics talk just presents pages of text and equations with the occasional poorly-labeled graph thrown in. My talks tend to be a bit heavier on the presentation aspects, perhaps inevitably so given the sometimes beautiful pictures that package our data.
On the other hand, it was clear that the statisticians take their Q&A sessions very seriously, prodded in this case by the word “controversy” in the conference’s title. In his opening keynote, Jose Bernardo up from Valencia for the meeting discussed his work as a so-called “Objective Bayesian”, prompting a question from the mathematically-oriented philosopher Deborah Mayo. Mayo is an arch-frequentist (and blogger) who prefers to describe her particular version as “Error Statistics”, concerned (if I understand correctly after our wine-fuelled discussion at the conference dinner) with the use of probability and statistics to criticise the errors we make in our methods, in contrast with the Bayesian view of probability as a description of our possible knowledge of the world. These two points of view are sufficiently far apart that Bernardo countered one of the questions with the almost-rude but definitely entertaining riposte “You are bloody inconsistent — you are not mathematicians.” That was probably the most explicit almost-personal attack of the meeting, but there were similar exchanges. Not mine, though: my talk was a little more didactic than most, as I knew that I had to justify the science as well as the statistics that lurks behind any analysis of data.
So I spent much of my talk discussing the basics of modern cosmology, and applying my preferred Bayesian techniques in at least one big-picture case where the method works: choosing amongst the simple set of models that seem to describe the Universe, at least from those that obey General Relativity and the Cosmological Principle, in which we do not occupy a privileged position and which, given our observations, are therefore homogeneous and isotropic on the largest scales. Given those constraints, all we need to specify (or measure) are the amounts of the various constituents in the universe: the total amount of matter and of dark energy. The sum of these, in turn, determines the overall geometry of the universe. In the appropriate units, if the total is one, the universe is flat; if it’s larger, the universe is closed, shaped like a three-dimensional sphere; if smaller, it’s a three-dimensional hyperboloid or saddle. What we find when we make the measurement is that the amount of matter is about 0.282±0.02, and of dark energy about 0.723±0.02. Of course, these add up to just greater than one; model-selection (or hypothesis testing in other forms) allows us to say that the data nonetheless give us reason to prefer the flat Universe despite the small discrepancy.
After the meeting, I had a couple of hours free, so I went across Madrid to the Reina Sofia, to stand amongst the Picassos and Serras. And I was lucky enough to have my hotel room above a different museum:
[Apologies — this is long, technical, and there are too few examples. I am putting it out for commentary more than anything else…]
In some recent articles and blog posts (including one in response to astronomer David Hogg), Columbia University statistician Andrew Gelman has outlined the philosophical position that he and some of his colleagues and co-authors hold. While starting from a resolutely Bayesian perspective on using statistical methods to measure the parameters of a model, he and they depart from the usual story when evaluating models and comparing them to one another. Rather than using the techniques of Bayesian model comparison, they eschew them in preference to a set of techniques they describe as ‘model checking’. Let me apologize in advance if I misconstrue or caricature their views in any way in the following.
In the formalism of model comparison, the statistician or scientist needs to fully specify her model: what are the numbers needed to describe the model, how does the data depend upon them (the likelihood), as well as a reasonable guess for what those numbers night be in the absence of data (the prior). Given these ingredients, one can first combine them to form the posterior distribution to estimate the parameters but then go beyond this to actually determine the probability of the fully-specified model itself.
The first part of the method, estimating the parameters, is usually robust to the choice of a prior distribution for the parameters. In many cases, one can throw the possibilities wide open (an approximation to some sort of ‘complete ignorance’) and get a meaningful measurement of the parameters. In mathematical language, we take the limit of the posterior distribution as we make the prior distribution arbitrarily wide, and this limit often exists.
The problem, noticed by most statisticians and scientists who try to apply these methods is that the next step, comparing models, is almost always sensitive to the details of the choice of prior: as the prior distribution gets wider and wider, the probability for the model gets lower and lower without limit; a model with an infinitely wide prior has zero probability compared to one with a finite width.
In some situations, where we do not wish to model some sort of ignorance, this is fine. But in others, even if we know it is unreasonable to accept an arbitrarily large value for some parameter, we really cannot reasonably choose between, say, an upper limit of 10100 and 1050, which may have vastly different consequences.
The other problem with model comparison is that, as the name says, it involves comparing models: it is impossible to merely reject a model tout court. But there are certainly cases when we would be wise to do so: the data have a noticeable, significant curve, but our model is a straight line. Or, more realistically (but also much more unusually in the history of science): we know about the advance of the perihelion of Mercury, but Einstein hasn’t yet come along to invent General Relativity; or Planck has written down the black body law but quantum mechanics hasn’t yet been formulated.
These observations lead Gelman and company to reject Bayesian model comparison entirely in favor of what they call ‘model checking’. Having made inferences about the parameters of a model, you next create simulated data from the posterior distribution and compare those simulations to the actual data. This latter step is done using some of the techniques of orthodox ‘frequentist’ methods: choosing a statistic, calculating p-values, and worrying about whether your observation is unusual because it lies in the tail of a distribution.
Having suggested these techniques, they go on to advocate a broader philosophical position on the use of probability in science: it is ‘hypothetico-deductive’, rather than ‘inductive’; Popperian rather than Kuhnian. (For another, even more critical, view of Kuhn’s philosophy of science, I recommend filmmaker Errol Morris’ excellent series of blog posts in the New York Times recounting his time as a graduate student in philosophy with Kuhn.)
At this point, I am sympathetic with their position, but worried about the details. A p-value is well-determined, but remains a kind of meaningless number: the probability of finding the value of your statistic as measured or worse. But you didn’t get a worse value, so it’s not clear why this number is meaningful. On the other hand, it is clearly an indication of something: if it is unlikely to have got a worse value then your data must, in some perhaps ill-determined sense, be itself unlikely. Indeed I think it is worries like this that lead them very often to prefer purely graphical methods — the simulations ‘don’t look like’ the data.
The fact is, however, these methods work. They draw attention to data that do not fit the model and, with well-chosen statistics or graphs, lead the scientist to understand what might be wrong with the model. So perhaps we can get away without mathematically meaningful probabilities as long as we are “just” using them to guide our intuition rather than make precise statements about truth or falsehood.
Having suggested these techniques, they go on to make a rather strange leap: deciding amongst any discrete set of parameters falls into the category of model comparison, against their rules. I’m not sure this restriction is necessary: if the posterior distribution for the discrete parameters makes sense, I don’t see why we should reject the inferences made from it.
In these articles they also discuss what it means for a model to be true or false, and what implications that has for the meaning of probability. As they argue, all models are in fact known to be false, certainly in the social sciences that most concerns Gelman, and for the most part in the physical sciences as well, in the sense that they are not completely true in every detail. Newton was wrong, because Einstein was more right, and Einstein is most likely wrong because there is likely to be an even better theory of quantum gravity. Hence, they say, the subjective view of probability is wrong, since no scientist really believes in the truth of the model she is checking. I agree, but I think this is a caricature of the subjective view of probability: it misconstrues the meaning of ‘subjectivity’. If I had to use probabilities only to reflect what I truly believe, I wouldn’t be able to do science, since the only thing that I am sure about my belief system is that it is incoherent:
Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)
— Walt Whitman, Song of Myself
Subjective probability, at least the way it is actually used by practicing scientists, is a sort of “as-if” subjectivity — how would an agent reason if her beliefs were reflected in a certain set of probability distributions? This is why when I discuss probability I try to make the pedantic point that all probabilities are conditional, at least on some background prior information or context. So we shouldn’t really ever write a probability that statement “A” is true as P(A), but rather as P(A|I) for some background information, “I”. If I change the background information to “J”, it shouldn’t surprise me that P(A|I)≠P(A|J). The whole point of doing science is to reason from assumptions and data; it is perfectly plausible for an actual scientist to restrict the context to a choice between two alternatives that she knows to be false. This view of probability owes a lot to Ed Jaynes (as also elucidated by Keith van Horn and others) and would probably be held by most working scientists if you made them elucidate their views in a consistent way.
Still, these philosophical points do not take away from Gelman’s more practical ones, which to me seem distinct from those loftier questions and from each other: first, that the formalism of model comparison is often too sensitive to prior information; second, that we should be able to do some sort of alternative-free model checking in order to falsify a model even if we don’t have any well-motivated substitute. Indeed, I suspect that most scientists, even hardcore Bayesians, work this way even if they (we) don’t always admit it.
Embarrassing update: as pointed out by Vladimir Nesov in the comments, all of my quantitative points below are incorrect. To maximize expected winnings, you should bet on whichever alternative you judge to be most likely. If you have a so-called logarithmic utility function — which already has the property of growing faster for small amounts than large — you should bet proportional to your odds on each answer. In fact, it’s exactly arguments like these that lead many to conclude that the logarithmic utility function is in some sense “correct”. So, in order to be led to betting more on the low-probability choices, one needs a utiltity that changes even faster for small amounts and slower for large amounts. But I disagree that this is “implausible” — if I think that is the best strategy to use, I should adjust my utility function, not change my strategy to match one that has been externally imposed. Just like probabilities, utility functions encode our preferences. Of course, I should endeavor to be consistent, to always use the same utility function, at least in the same circumstances, taking into account what economists call “externalities”.Anyway, all of this goes to show that I shouldn’t write long, technical posts after the office Christmas party….
The original post follows, mistakes included.
An even more unlikely place to find Bayesian inspiration was Channel 4’s otherwise insipid game show, “The Million Pound Drop”. In the version I saw, B-list celebs start out with a million pounds (sterling), and are asked a series of multiple-choice questions. For each one, they can bet any fraction of their remaining money on any set of answers; any money bet on wrong answers is lost (we’ll ignore the one caveat, that the contestants must wager no money on at least one answer, which means there’s always the chance that they will lose the entire stake).
Is there a best strategy for this game? Obviously, the overall goal is to maximize the actual winnings at the end of the series of questions. In the simplest example, let’s say a question is “What year did England last win the football world cup?” with possible answers “1912”, “1949”, “1966”, and “never”. In this case (assuming you know the answer), the only sensible course is to bet everything on “1966”.
Now, let’s say that the question is “When did the Chicago Bulls last win an NBA title?” with possible answers, “1953”, “1997”, “1998”, “2009”. The contestants, being fans of Michael Jordan, know that it’s either 1997 or 1998, but aren’t sure which — it’s a complete toss-up between the two. Again in this case, the strategy is clear: bet the same amount on each of the two — the expected winning is half of your stake no matter what. (The answer is 1998.)
But now let’s make it a bit more complicated: the question is “Who was the last American to win a gold medal in Olympic Decathlon?” with answers “Bruce Jenner”, “Brian Clay”, “Jim Thorpe”, and “Jess Owens”. Well, I remember that Jenner won in the 70s, and that Thorpe and Owens predate that by decades, so the only possibilities are Jenner and Clay, whom I’ve never heard of. So I’m pretty sure the answer is Jenner, but I’m by no means certain: let’s say that I’m 99:1 in favor of Jenner over Clay.
In order to maximize my expected winnings, I should bet 99 times as much on Jenner as Clay. But there’s a problem here: if it’s Clay, I end up with only one percent of my initial stake, and that one percent — which I have to go on and play more rounds with — is almost too small to be useful. This means that I don’t really want to maximize my expected winnings, but rather something that economists and statisticians call the “utility function”, or conversely, to minimize the loss function, functions which describes how useful some amount of winnings are to me: a thousand dollars is more than a thousand times useful than one dollar, but a million dollars is less than twice as useful as half a million dollars, at least in this context.
So in this case, a small amount of winnings is less useful than one might naively expect, and the utility function should reflect that by growing faster for small amounts and slower for larger amounts — I should perhaps bet ten percent on Clay. If it’s Jenner, I still get 90% of my stake, but if it’s Clay, I end up with a more-useful 10%. (The answer is Clay, by the way.)
This is the branch of statistics and mathematics called decision theory: how we go from probabilities to actions. It comes into play when we don’t want to just report probabilities, but actually act on them: whether to actually prescribe a drug, perform a surgical procedure, or build a sea-wall against a possible flood. In each of these cases, in addition to knowing the efficacy of the action, we need to understand its utility: if a flood is 1% likely over the next century and would cost one million pounds, but would save one billion in property damage and 100 lives if the flood occurred, we need to compare spending a million now versus saving a billion later (taking the “nonlinear” effects above into account) and complicate that with the loss from even more tragic possibilities. One hundred fewer deaths has the same utility as some amount of money saved, but I am glad I’m not on the panel that has to make that assignment. It is important to point out, however, that whatever decision is made, by whatever means, it is equivalent to some particularly set of utilities, so we may as well be explicit about it.
Happily, these sorts of questions tend to arise less in the physical sciences where probabilistic results are allowed, although the same considerations arise at a higher level: when making funding decisions…
I've come across a couple bits of popular/political culture that give me the opportunity to discuss one of my favorite topics: the uses and abuses of probability theory.
The first is piece by Nate Silver of the New York Times' FiveThirtyEight blog, dedicated to trying to crunch the political numbers of polls and other data in as transparent a manner as possible. Usually, Silver relies on a relentlessly frequentist take on probability: he runs lots of simulations letting the inputs vary according to the poll results (correctly taking into account the "margin of error" and more than occasionally using other information to re-weight the results of different polls. Nonetheless, these techniques give a good summary of the results at any given time -- and have been far and away the best discussion of the numerical minutiae of electioneering for both the 2008 and 2010 US elections.
But yesterday, Silver wrote a column: A Bayesian Take on Julian Assange which tackles the question of Assange's guilt in the sexual-assault offense with which he has been charged. Bayes' theorem, you will probably recall if you've been reading this blog, states that the probability of some statement ("Assange is innocent of sexual assault, despite the charges against him") is the product of the probability that he would be charged if he were innocent (the "likelihood") times the probability of his innnocence in the absence of knowledge about the charge (the "prior"):
P(innocent|charged, context) ∝ P(innocent | context) × P(charged|innocent, context)where P(A|B) means the probability of A given B, and the "∝" means that I've left off an overall number that you can mulitply by. The most important thing I've left in here is the "context": all of these probabilities depend upon the entire context in which you consider the problem.
To figure out these probabilities, there are no simulations we can perform -- we can't run a big social-science model of Swedish law-enforcement, possibly in contact with, say, American diplomats, and make small changes and see what happens. We just need to assign probabilities to these statements.
But even to do that requires considerable thought, and important decisions about the context in which we want to make these assignments. For Silver, the important context is that there is evidence that other governments, particularly the US, may have an ulterior motive for wanting to not just prosecute, but persecute Assange. Hence, the probability of his being unjustly accused [P(charged|innocent, context)] is larger than it would be for, say, an arbitrary Australian citizen traveling in Britain. Usually, Bayesian probability is accused of needing a subjective prior, but in this case the context affects and adds a subjective aspect to the likelihood.
Some of the commenters on the site make a different point: given that Assange is, at least in some sense, a known criminal (he has leaked secret documents, which is likely against the law), he is more likely to commit other criminal acts. This time, the likelihood is not affected, but the prior: the commenter believes that Assange is less likely to be innocent irrespective of the information about the charge.
Next: game shows.
One of my holiday treks this year was across town to visit Bunhill Fields, final resting place of two of my favorite Londoners: William Blake and Thomas Bayes.
Blake is of course one of the most famous poets in the English language, but most people know him only from short poems like The Tiger [sic] (“Tyger, Tyger burning bright/ In the forests of the night/ What immortal hand or eye/ Could frame thy fearful symmetry”) and Jerusalem, sung in Anglican churches each week. But most of Blake’s work is much too weird to make it into church. It is peopled by gods and monsters, illuminated by Blake’s own wonderful over-the-top illustrations. (For example, America: A Prophecy, his poetic interpretation of the American Revolutionary War, begins “The shadowy Daughter of Urthona stood before red Orc/When fourteen suns had faintly journey’d o’er his dark abode” — George Washington and Thomas Jefferson don’t make Blake’s version.)
Blake’s gravestone sits right on the pavement in the middle of Bunhill Fields, and as such unfortunately has been slightly damaged.
I don’t read Blake every day or even every week, but I probably do use Bayes’s famous theorem at least that often. As I and other bloggers have gone on and on about, Bayes’s theorem is the mathematical statement of how we ought to rigorously and consistently incorporate new information into our model of the world. Bayes himself wrote down only a version appropriate for a restricted version of this problem, and used words, rather than mathematica symbols. Nowadays, we usually write it mathematically, and in a completely general form, as
Inscription: “Rev. Thomas Bayes, son of the said Joshua and Ann Bayes, 7 April 1761. In recognition of Thomas Bayes’s important work in probability this vault was restored in 1960 with contributions received from statisticians throughout the world.” (With restoration and upkeep since by Bayesian Efficient Strategic Trading of Hoboken, NJ, USA —across the Hudson River from New York City— and ISBA, the International Society for Bayesian Analysis.)
Luckily, not all the astrophysics news this week was so bad.
First, and most important, two of our Imperial College Astrophysics postgraduate students, Stuart Sale and Paniez Paykari, passed their PhD viva exams, and so are on their ways to officially being Doctors of Philosophy. Congratulations to both, especially (if I may say so) to Dr Paykari, who I had the pleasure and fortune to supervise and collaborate with. Both are on their way to continue their careers as postdocs in far-flung lands.
Second, the first major results from the Herschel Space Telescope, Planck’s sister satellite, were released. There are impressive pictures dwarf planets in the outer regions of our solar system, of star-forming regions in the Milky Way galaxy, of the vary massive Virgo Cluster of galaxies, and of the so-called “GOODS” (Great Observatory Origins Deep Survey) field, one of the most well-studied areas of sky. All of these open new windows into these areas of astrophysics, with Herschel’s amazing sensitivity.
Finally, tantalisingly, the Cryogenic Dark Matter Search (CDMS) released the results of its latest (and final) effort to search for the Dark Matter that seems to make up most of the matter in the Universe, but doesn’t seem to be the same stuff as the normal atoms that we’re made of. Under some theories, the dark matter would interact weakly with normal matter, and in such a way that it could possibly be distinguished from all the possible sources of background. These experiments are therefore done deep underground — to shield from cosmic rays which stream through us all the time — and with the cleanest and purest possible materials — to avoid contamination with both both naturally-occurring radioactivity and the man-made kind which has plagued us since the late 1940s.
With all of these precautions, CDMS expected to see a background rate of about 0.8 events during the time they were observing. And they saw (wait for it) two events! This is on the one hand more than a factor of two greater than the expected number, but on the other is only one extra count. To put this in perspective, I’ve made a couple of graphs where I try to approximate their results (for aficionados, these are just simple plots of the Poisson distribution). The first shows the expected number of counts from the background alone:
(I should point out a few caveats in my micro-analysis of their data. First, I don’t take into account the uncertainty in their background rate, which they say is really 0.8±0.1±0.2, where the first uncertainty, ±0.1 is “statistical”, because they only had a limited number of background measurements, and the second, ±0.2, is “systematic”, due to the way they collect and analyse their data. Eventually, one could take this into account via Bayesian marginalization, although ideally we’d need some more information about their experimental setup. Second, I’ve only plotted the likelihood above, but true Bayesians will want to apply a prior probability and plot the posterior distribution. The most sensible choice (the so-called Jeffreys prior) for this case would in fact make the probability peak at zero signal. Finally, one would really like to formally compare the no-signal model with a signal-greater-than-zero model, and the best way to do this would be using the tool of Bayesian model comparison.)
Nonetheless, in their paper they go on to interpret these results in the context of particle physics, which can eventually be used to put limits on the parameters of supersymmetric theories which may be tested further at the LHC accelerator over the next couple of years.
I should bring this back to the aforementioned bad news. The UK has its own dark matter direct detection experiments as well. In particular, Imperial leads the ZEPLIN-III experiment which has, at times, had the world’s best limits on dark matter, and is poised to possibly confirm this possible detection — this will be funded for the next couple of years. Unfortunately, STFC has decided that the next generation of dark matter experiments, EURECA and LUX-ZEPLIN, needed to make convincing statements about these results, weren’t possible to fund.
The perfect stocking-stuffer for that would-be Bayesian cosmologist you’ve been shopping for:
As readers here will know, the Bayesian view of probability is just that probabilities are statements about our knowledge of the world, and thus eminently suited to use in scientific inquiry (indeed, this is really the only consistent way to make probabilistic statements of any sort!). Over the last couple of decades, cosmologists have turned to Bayesian ideas and methods as tools to understand our data. This book is a collection of specially-commissioned articles, intended as both a primer for astrophysicists new to this sort of data analysis and as a resource for advanced topics throughout the field.
Our back-cover blurb:
In recent years cosmologists have advanced from largely qualitative models of the Universe to precision modelling using Bayesian methods, in order to determine the properties of the Universe to high accuracy. This timely book is the only comprehensive introduction to the use of Bayesian methods in cosmological studies, and is an essential reference for graduate students and researchers in cosmology, astrophysics and applied statistics.
The first part of the book focuses on methodology, setting the basic foundations and giving a detailed description of techniques. It covers topics including the estimation of parameters, Bayesian model comparison, and separation of signals. The second part explores a diverse range of applications, from the detection of astronomical sources (including through gravitational waves), to cosmic microwave background analysis and the quantification and classification of galaxy properties. Contributions from 24 highly regarded cosmologists and statisticians make this an authoritative guide to the subject.
You can order it now from Amazon UK or Amazon USA.
OK, this is going to be a very long post. About something I don’t pretend to be expert in. But it is science, at least.
A couple of weeks ago, Radio 4’s highbrow “In Our Time” tackled the so-called “Measurement Problem”. That is: quantum mechanics predicts probabilities, not definite outcomes. And yet we see a definite world. Whenever we look, a particle is in a particular place. A cat is either alive or dead, in Schrodinger’s infamous example. So, lots to explain in just setting up the problem, and even more in the various attempts so far to solve it (none quite satisfactory). This is especially difficult because the measurement problem is, I think, unique in physics: quantum mechanics appears to be completely true and experimentally verified, without contradiction so far. And yet it seems incomplete: the “problem” arises because the equations of quantum mechanics only provide a recipe for the calculations of probabilities, but doesn’t seem to explain what’s going on underneath. For that, we need to add a layer of interpretation on top. Melvyn Bragg had three physicists down to the BBC studios, each with his own idea of what that layer might look like.
Unfortunately, the broadcast seemed to me a bit of a shambles: the first long explanation by Basil Hiley of Birkbeck of quantum mechanics used the terms “wavefunction” and “linear superposition” without even an attempt at a definition. Things got a bit better as Bragg tried to tease things out, but I can’t imagine the non-physicists that were left listening got much out of it. Hiley himself worked with David Bohm on one possible solution to the measurement problem, the so-called “Pilot Wave Theory” (another term which was used a few times without definition) in which quantum mechanics is actually a deterministic theory — the probabilities come about because there is information to which we do not — and in principle cannot — have access to about the locations and trajectories of particles.
Roger Penrose proved to be remarkably positivist in his outlook: he didn’t like the other interpretations on offer simply because they make no predictions beyond standard quantum mechanics and are therefore untestable. (Others see this as a selling point for these interpretations, however — there is no contradiction with experiment!) To the extent I understand his position, Penrose himself prefers the idea that quantum mechanics is actually incomplete, and that when it is finally reconciled with General Relativity (in a Theory of Everything or otherwise), we will find that it actually does make specific, testable predictions.
There was a long discussion by Simon Saunders of that sexiest of interpretations of quantum mechanics, the Many Worlds Interpretation. The latest incarnation of Many-Worlds theory is centered around workers in or near Oxford: Saunders himself, David Wallace and most famously David Deutsch. The Many-Worlds interpretation (also known as the Everett Interpretation after its initial proponent) attempts to solve the problem by saying that there is nothing special about measurement at all — the simple equations of quantum mechanics always obtain. In order for this to occur, then all possible outcomes of any experiment must be actualized: that is, their must be a world for each outcome. But we’re not just talking about outcomes of science experiments here, but rather any time that quantum mechanics could have predicted something other than what (seemingly) actually happened. Which is all the time, to all of the particles in the Universe, everywhere. This is, to say the least, “ontologically extravagant”. Moreover, it has always been plagued by at least one fundamental problem: what, exactly, is the status of probability in the many-worlds view? When more than one quantum-mechanical possibility presents itself, each splits into its own world, with a probability related to the aforementioned wavefunction. But what beyond this does it mean for one branch to have a higher probability? The Oxonian many-worlders have tried to use decision theory to reconcile this with the prescriptions of quantum mechanics: from very minimal requirements of rationality alone, can we derive the probability rule? They claim to have done so, and they further claim that their proof only makes sense in the Many-Worlds picture. This is, roughly, because only in the Everett picture is their no “fact of the matter” at all about what actually happens in a quantum outcome — in all other interpretations the very existence of a single actual outcome is enough to scupper the proof. (I’m not so sure I buy this — surely we are allowed to base rational decisions on only the information at hand, as opposed to all of the information potentially available?)
At bottom, these interpretations of quantum mechanics (aka solutions to the measurement problem) are trying to come to grips with the fact that quantum mechanics seems to be fundamentally about probability, rather than the way things actually are. And, as I’ve discussed elsewhere, time and time again, probability is about our states of knowledge, not the world. But we are justly uncomfortable with 70s-style “Tao-of-Physics” ideas that make silly links between consciousness and the world at large.
But there is an interpretation that takes subjective probability seriously without resorting to the extravagance of many (very, very many) worlds. Chris Fuchs, along with his collaborators Carlton Caves and Ruediger Schack have pursued this idea with some success. Whereas the many-worlds interpretation requires a universe that seems far too full for me, the Bayesian interpretation is somewhat underdetermined: there is a level of being that is, literally unspeakable: there is no information to be had about the quantum realm beyond our experimental results. This is, as Fuchs points out, a very strong restriction on how we can assign probabilities to events in the world. But I admit some dissatisfaction at the explanatory power of the underlying physics at this point (discussed in some technical detail in a review by yet another Oxford philosopher of science, Christopher Timpson).
In both the Bayesian and Many Worlds interpretations (at least in the modern versions of the latter), probability is supposed to be completely subjective, as it should be. But something still seems to be missing: probability assignments are, in fact, testable, using techniques such as Bayesian model selection. What does it mean, in the purely subjective interpretation, to be correct, or at least more correct? Sometimes, this is couched as David Lewis’ “principal principle” (it’s very hard to find a good distillation of this on the web, but here’s a try): there is something out there called “objective chance” and our subjective probabilities are meant to track it (I am not sure this is coherent, and even Lewis himself usually gave the example of a coin toss, in which there is nothing objective at all about the chance involved: if you know the initial conditions of the coin and the way it is flipped and caught, you can predict the outcome with certainty.) But something at least vaguely objective seems to be going on in quantum mechanics: more probable outcomes happen more often, at least for the probability assignments that physicists make given what we know about our experiments. This isn’t quite “objective chance” perhaps, but it’s not clear that there isn’t another layer of physics still to be understood.
In today’s Sunday NY Times Magazine, there’s a long article by psychologist Steven Pinker, on “Personal Genomics”, the growing ability for individuals to get information about their genetic inheritance. He discusses the evolution of psychological traits versus intelligence, and highlights the complicated interaction amongst genes, and between genes and society.
But what caught my eye was this paragraph:
What should I make of the nonsensical news that I… have a “twofold risk of baldness”? … 40 percent of men with the C version of the rs2180439 SNP are bald, compared with 80 percent of men with the T version, and I have the T. But something strange happens when you take a number representing the proportion of people in a sample and apply it to a single individual…. Anyone who knows me can confirm that I’m not 80 percent bald, or even 80 percent likely to be bald; I’m 100 percent likely not to be bald. The most charitable interpretation of the number when applied to me is, “If you knew nothing else about me, your subjective confidence that I am bald, on a scale of 0 to 10, should be 8.” But that is a statement about your mental state, not my physical one. If you learned more clues about me (like seeing photographs of my father and grandfathers), that number would change, while not a hair on my head would be different. [Emphasis mine].
That “charitable interpretation” of the 80% likelihood to be bald is exactly Bayesian statistics (which I’ve talked about, possibly ad nauseum, before) : it’s the translation from some objective data about the world — the frequency of baldness in carriers of this gene — into a subjective statement about the top of Pinker’s head, in the absence of any other information. And that’s the point of probability: given enough of that objective data, scientists will come to agreement. But even in the state of uncertainty that most scientists find themselves, Bayesian probability forces us to enumerate the assumptions (usually called “prior probabilities”) that enter into our assignments reasoning along with the data. Hence, if you knew Pinker, your prior probability is that he’s fully hirsute (perhaps not 100% if you allow for the possibility of hair extensions and toupees); but if you didn’t then you’d probably be willing to take 4:1 odds on a bet about his baldness — and you would lose to someone with more information.
In science, of course, it usually isn’t about wagering, but just about coming to agreement about the state of the world: do the predictions of a theory fit the data, given the inevitable noise in our measurements, and the difficulty of working out the predictions of interesting theoretical ideas? In cosmology, this is particularly difficult: we can’t go out and do the equivalent of surveying a cross section of the population for their genes: we’ve got only one universe, and can only observe a small patch of it. So probabilities become even more subjective and difficult to tie uniquely to the data. Hence the information available to us on the very largest observable scales is scarce, and unlikely to improve much, despite tantalizing hints of data discrepant with our theories, such as the possibly mysterious alignment of patterns in the Cosmic Microwave Background on very large angles of the sky (discussed recently by Peter Coles here). Indeed, much of the data pointing to a possible problem was actually available from the COBE Satellite; results from the more recent and much more sensitive WMAP Satellite have only reinforced the original problems — we hope that the Planck Surveyor — to be launched in April! — will actually be able to shed light on the problem by providing genuinely new information about the polarization of the CMB on large scales to complement the temperature maps from COBE and WMAP.
This post is a work in progress, but I’ve decided to post it in its unfinished state. Comments and questions welcome!
This week I went to a seminar on the new results from the MiniBooNE experiment given here at Imperial by Morgan Wascko.
The MiniBooNE results have been discussed in depth elsewhere. Like MINOS last year, MiniBooNE was looking at the masses of neutrinos. Specifically, it was looking for the oscillation between electron neutrinos and mu neutrinos. A decade ago, the LSND experiment saw events indicating that mu antineutrinos could oscillate into electron antineutrinos, which gave evidence of a mass difference between the two “flavors”. Unfortunately, this evidence was at odds with the results of two other neutrino oscillation experiments, at least in the standard model with three (but only three) different flavors. MiniBooNE set out to test these results. The results so far seem to contradict those from LSND (at about “98% confidence”).
But here I want to talk about a specific aspect of the statistical methods that MiniBooNE (and many other modern particle physics experiments). How did they come up with that 98% number? Over the last couple of decades, the particle physics has arrived at what it considers a pretty rigorous set of methods. It relies on two chief tenets. First, make sure you can simulate every single aspect of your experiment, varying all of the things that you don’t know for sure (aspects of nuclear physics, the properties of the detectors, and, of course, the unknown physics). Compare these simulations to your data in order to “tune” those numbers to their (assumedly) actual values. Finally, delay looking at the part of your data that contains the actual signal until the very end of the process. Putting these all together means that you can do a “blind analysis” only “opening the box” at that final stage.
Why do they go through all of this trouble? Basically, to avoid what is technically known as “bias” — the scary truth that we scientists can’t be trusted to be completely rational. If you look at your data while you’re trying to understand everything about your experiment, you’re likely to stop adjusting all the parameters when you get an answer that seems right, that matches your underlying prejudices. (Something like this is a well known problem in the medical world: “publication bias” in which only successful studies for a given treatment ever see the light of day.)
Even with the rigorous controls of a blind analysis, the MiniBooNE experimenters have still had to intervene in the process more than they would have liked: they adjusted the lower-limit of the particle energies that they analyzed in order to remove an anomalous discrepancy with expectations in their simulations. To be fair, the analysis was still blind, but it had the effect of removing an excess of events at the now-discarded low energies. This excess doesn’t look anything like the signal for which they were searching — and it occurs in a regime where you might have less confidence in the experimental results, but it does need to be understood. (Indeed, unlike the official talks which attempt to play down this anomaly, the front-page NY Times article on the experiment highlights it.)
Particle physicists can do this because they are in the lucky (and expensive) position of building absolutely everything about their experimental setup: the accelerators that create their particles, the targets that they aim them at, and the detectors that track the detritus of the collisions, all lovingly and carefully crafted out of the finest of materials. We astrophysicists don’t have the same luxury: we may build the telescope, but everything else is out there in the heavens, out of our control. Moreover, particle experiments enjoy a surfeit of data — billions of events that don’t give information about the desired physical signal, but do let us calibrate the experiment itself. In astrophysics, we often have to use the very same data to calibrate as we use to measure the physics we’re interested in. (Cosmic Microwave Background experiments are a great example of this: it’s very difficult to get lots of good data on our detectors’ performance except in situ.)
It also happens that the dominant statistical ideology in the particle physics community is “frequentist”, in contrast to the Bayesian methods that I never shut up about. Part of the reason for the difference is purely practical: frequentist methods make sense when you can perform the needed “Monte Carlo simulations” of your entire experiment, varying all of the unknowns, and tune your methods against the experimental results. In astrophysics, and especially in cosmology, this is more difficult: there is only one Universe (at least only one that we can measure). But there would be nothing to stop us from doing a blind analysis, simultaneously measuring — and, in the parlance of the trade, marginalizing over — the parameters that describe our experiment that are “tuned” in the Monte Carlo analysis. Indeed, the particle physics community, were it to see the Bayesian light and truth, could in principle do this, too. The problem is simply that this would be a much more computationally difficult task.
With yesterday’s article on “Faith” (vs Science) in the Guardian, and today’s London debate between bioligist Lewis Wolpert and the pseudorational William Lane Craig (previewed on the BBC’s Today show this morning), the UK seems to be the hotbed of tension between science and religion. I’ll leave it to the experts for a fuller exposition, but I was particularly intrigued (read: disgusted) by Craig’s claims that so little of science has been “proved”, and hence it was OK to believe in other unproven things like, say, a Christian God (alhough I prefer the Flying Spaghetti Monster).
The question is: what constitutes “proof”?
Craig claimed that such seemingly self-evident facts such as the existence of the past, or even the existence of other minds, were essentially unproven and unprovable. Here, Craig is referring to proofs of logic and mathematics, those truths which follow necessarily from the very structure of geometry and math. The problem with this standard of proof is that it applies to not a single interesting statement about the external world. All you can see with this sort of proof are statements like 1+1=2, or that Fermat’s Theorem and the Poincaré Conjecture are true, or that the sum of the angles of a triangle on a plane are 180 degrees. But you can’t prove in this way that Newton’s laws hold, or that we descended from the ancestor’s of today’s apes.
For these latter sorts of statements, we have to resort to scientific proof, which is a different but still rigorous standard. Scientific proofs are unavoidably contingent, based upon the data we have and the setting in which we interpret that data. What we can do over time is get better and better data, and minimize the restrictions of that theoretical setting. Hence, we can reduce Darwinian evolution to a simple algorithm: if there is a mechanism for passing along inherited characteristics, and if there are random mutations in those characteristics, and if there is some competition among offspring, then evolution will occur. Furthermore, if evolution does occur, then the archaeological record makes it exceedingly likely that present-day species have evolved in the accepted.
Similarly, given our observations of the movement of bodies on relatively small scales, it is exceedingly likely that a theory like Einstein’s General Relativity holds to describe gravity. Given observations on large scales, it is exceedingly likely that the Universe started out in a hot and dense state about 14 billion years ago, and has been expanding ever since.
The crucial words in the last couple of paragraphs are “exceedingly likely” — scientific proofs aren’t about absolute truth, but probability. Moreover, they are about what is known as “conditional probability” — how likely something is to be true given other pieces of knowledge. As we accumulate more and more knowledge, plausible scientific theories become more and more probable. (Regular readers will note that almost everything eventually comes back to Bayesian Statistics.)
Hence, we can be pretty sure that the Big Bang happened, that Evolution is responsible for the species present on the earth today, and that, indeed, other minds exist and that the cosmos wasn’t created in media res sometime yesterday.
This pretty high standard of proof must be contrasted with religious statements about the world which, if anything, get less likely as more and more contradictory data comes in. Of course, since the probabilities are conditional, believers are allowed to make everything contingent not upon observed data, but on their favorite religious story: the probability of evolution given the truth of the New Testament may be pretty small, but that’s a lot to, uh, take on faith, especially given all of its internal contradictions. (The smarter and/or more creative theologians just keep making the religious texts more and more metaphorical but I assume they want to draw the line somewhere before they just become wonderfully-written books).
The work that I’ve been doing with my student is featured on the cover of this week’s New Scientist. Unfortunately, a subscription is necessary to read the full article online, but if you do manage to find it on the web or the newsstand, you’ll find a much better explanation of the physics than I can manage here, as well as my koan-like utterances such as “if you look over here, you’re also looking over there”. There are more illuminating quotes from my friends and colleagues Glenn Starkman, Janna Levin and Dick Bond (all of whom I worked with at CITA in the 90s, coincidentally).
We’re exploring the overall topology of space, separate from its geometry. Geometry is described by the local curvature of space: what happens to straight lines like rays of light — do parallel rays intersect, do triangles have 180 degrees? But topology describes the way different parts of that geometry are connected to one another. Could I keep going in one direction and end up back where I started — even if space is flat, or much sooner than I would have thought by calculating the circumference of a sphere? The only way this can happen is if space has four-or-more-dimensional “handles” or “holes” (like a coffee mug or a donut). We can only picture this sort of topology by actually curving those surfaces, but mathematically we can describe topology and geometry completely independently, and there’s no reason to assume that the Universe shouldn’t allow both of them to be complicated and interesting. My student, Anastasia Niarchou, and I have made predictions about the patterns that might show up in the Cosmic Microwave Background in these weird “multi-connected” universes. This figure shows the kinds of patterns that you might see in the sky:
The first four are examples of these multi-connected universes, the final one is the standard, simply-connected case. We’ve then carefully compared these predictions with data from the WMAP satellite, using the Bayesian methodology that I never shut up about. Unfortunately, we have determined that the Universe doesn’t have one of a small set of particularly interesting topologies — but there are still plenty more to explore.
Update: From the comment below, it seems I wasn’t clear about what I meant by asking if I could “keep going in one direction and end up back where I started”. In a so-called “closed” universe (with k=-1, as noted in the comment) shaped like a sphere sitting in four dimensions, one can indeed go straight on and end up back where you started. This sort of Universe is, however, still simply-connected, and wasn’t what I was talking about. Even in a Universe that is locally curved like a sphere, it’s possible to have multiply-connected topology, so that you end up back again much sooner, or from a different direction, than you would have thought (from measuring the apparent circumference of the sphere). You can picture this in a three-dimensional cartoon by picturing a globe and trying to “tile” it with identical curved pieces. Except for making them all long and then (like peeling an orange along lines of longitude), this is actually a hard problem, and indeed it can only be done in a small number of ways. Each of those ways corresponds to the whole universe: when you leave one edge of the tile, you re-enter another one. In our three-dimensional space, this corresponds to leaving one face of a polyhedron and re-entering somewhere else. Very hard to picture, even for those of us who play with it every day. I fear this discussion may have confused the issue even further. If so, go read the article in New Scientist!
In his most recent post, Cosmic Variance’s Mark Trodden talks about one of the presentations we both saw at last week’s meeting in Ishcia, where he explains one of the hot new techniques for analyzing cosmological data, the (so-called) Bayesian Evidence.
Let’s unpack this term. First, “Bayesian”, named after the Reverend Thomas Bayes. The question is: what is probability? If you’ve never taken a statistics class, then you (probably!) think that probability is defined so that the more probable something is, the more certain it is to happen. So we can define probability such that if P=1, it is certain to happen, and if P=0 it is certain never to happen. It turns out that you can show mathematically that there’s only one self-consistent way to define “degrees of belief” between 0 and 1: if P=1/2, then, given some information about an event, it is just as likely to happen as not happen. In the Bayesian interpretation of statistics, the crucial part of the last sentence is “given some information about an event”. In technical terms, all probability is conditional. For example, think about the statistician’s favorite example, tossing a coin. Usually, we say that a fair coin has P=1/2 to be heads, and P=1/2 to be tails. But that’s only because we usually don’t know enough about the way the coin is tossed (whether it starts with heads or tails up, the exact details of the way it is tossed, etc.) to make a better prediction: probability is not a statement about the physics of coin-tossing so much as it is about the information we have about coin-tossing in a particular circumstance.
But if you suffered through a “probability and statistics” class, you learned to equate probability not with any sort of information or belief, but with frequency: the fraction of times that some event (for example, heads), would happen in some (imaginary!) infinitely long set of somehow similar experiments. When we’re trying to measure something, we report something like the age of the Universe is 13.6±1 Gyrs (a Gyr is a Billion years). With this “frequentist” interpretation of probability, this means something like the following: I make up some algorithm, which transforms whatever data I have into a single measurement of the age (very often, this is just an average of individual measurements), which gives me the 13.6 Gyr. I get the error bars by calculating the distribution of results from that imaginary infinite set of experiments where I start with a true value of 13.6 Gyr: 67% of them will give a result between 12.6 and 14.6 Gyr.
(If this sounds confusing, don’t worry. There are strong arguments in the statistics community that these frequentist methods are actually philosophically or mathematically incoherent!.)
Actually, even this description isn’t quite right: really, you need to consider a distribution of other underlying true values. For the aficianados, the full construction of these error bars was originally done by Neyman, and was discussed more recently — and compared to the Bayesian case — by Feldman and Cousins.
The alternative, Bayesian, view of probability, is simply that when I state the age of the Universe is 13.6±1 Billion years, it means that I am 67% certain that the value is between 12.6 and 14.6 Billion years. In the early 20th Century, Bruno de Finetti showed that you can further refine exactly what “67% certain” means in terms of odds and wagering: if something is 50% certain, I would give even odds on a bet; if it’s 67% certain I would give 2:1 odds, etc. Probability is thus intimately tied to information: what I knew before I performed the experiment (more precisely, what I know in the absence of that experimental data) — this is called the prior probability and what new information the experiment gives — the likelihood.
(As an aside, I think that Mark has actually given a frequentist interpretation of error bars in his discussion!)
But sometimes we don’t just want to measure some parameter like the age of the Universe. Rather we might want to discriminate between two different models describing the same data. For example, we might want to consider the Big Bang model, in which the Universe started out hot and dense at some time in the past, compared to the Steady State model, in which the Universe has always been expanding, but with matter created to ‘fill in the gaps’ (until the 1960s, this model was a serious contender for explaining cosmological data). At this level, we don’t care about the value of the age, only that there is some finite maximum age on the one hand or not on the other. But there is nothing particularly difficult about answering this sort of question in the Bayesian formalism. The “evidence” is simply the likelihood of a model, rather than of a parameter within the model (in fact, I advocate just calling this quantity the “model likelihood”, rather than the more mysterious and easily abused “evidence” favored by those educated in Cambridge). Formally, this happens to be calculated by integrating over the parameter probabilities, and it turns out that it can be seen as two competing factors. One factor gives a higher probability for models that simply fit the data better for some values of the parameters. Another, competing, factor is higher for models that are simpler (defined in this case as models having fewer parameters, or with less available parameter space). This latter piece is often called the “Ockham factor”, after the famous razor, “entities should not be multiplied unnecessarily,” colloquially rendered as “keep it simple, stupid” — models with too many knobs to twiddle are penalized, unless they fit the data much better.
In a cosmological context, (as I discussed a while ago, and Hobson & Lasenby talked about last week, and which Andrew Liddle has written about extensively recently) it turns out that a relatively simple model with about 6 parameters fits the data very well — none of the possible bells and whistles seem to fit the data better enough to be worth the added complexity. This is both a remarkable triumph of our theorizing and an indication that we need to get better data since, probably, these models are too simple to be enough to describe the universe in all its detailed glory.