[Update: I have fixed some broken links, and modified the discussion of QBism and the recent paper by Chris Fuchs— thanks to Chris himself for taking the time to read and find my mistakes!]
For some reason, I’ve come across an idea called “Knightian Uncertainty” quite a bit lately. Frank Knight was an economist of the free-market conservative “Chicago School”, who considered various concepts related to probability in a book called Risk, Uncertainty, and Profit. He distinguished between “risk”, which he defined as applying to events to which we can assign a numerical probability, and “uncertainty”, to those events about which we know so little that we don’t even have a probability to assign, or indeed those events whose possibility we didn’t even contemplate until they occurred. In Rumsfeldian language, “risk” applies to “known unknowns”, and “uncertainty” to “unknown unknowns”. Or, as Nicholas Taleb put it, “risk” is about “white swans”, while “uncertainty” is about those unexpected “black swans”.
(As a linguistic aside, to me, “uncertainty” seems a milder term than “risk”, and so the naming of the concepts is backwards.)
Actually, there are a couple of slightly different concepts at play here. The black swans or unknown-unknowns are events that one wouldn’t have known enough about to even include in the probabilities being assigned. This is much more severe than those events that one knows about, but for which one doesn’t have a good probability to assign.
And the important word here is “assign”. Probabilities are not something out there in nature, but in our heads. So what should a Bayesian make of these sorts of uncertainty? By definition, they can’t be used in Bayes’ theorem, which requires specifying a probability distribution. Bayesian theory is all about making models of the world: we posit a mechanism and possible outcomes, and assign probabilities to the parts of the model that we don’t know about.
So I think the two different types of Knightian uncertainty have quite a different role here. In the case where we know that some event is possible, but we don’t really know what probabilities to assign to it, we at least have a starting point. If our model is broad enough, then enough data will allow us to measure the parameters that describe it. For example, in recent years people have started to realise that the frequencies of rare, catastrophic events (financial crashes, earthquakes, etc.) are very often well described by so-called power-law distributions. These assign much greater probabilities to such events than more typical Gaussian (bell-shaped curve) distributions; the shorthand for this is that power-law distributions have much heavier tails than Gaussians. As long as our model includes the possibility of these heavy tails, we should be able to make predictions based on data, although very often those predictions won’t be very precise.
But the “black swan” problem is much worse: these are possibilities that we don’t even know enough about to consider in our model. Almost by definition, one can’t say anything at all about this sort of uncertainty. But what one must do is be open-minded enough to adjust our models in the face of new data: we can’t predict the black swan, but we should expand the model after we’ve seen the first one (and perhaps revise our model for other waterfowl to allow more varieties!). In more traditional scientific settings, involving measurements with errors, this is even more difficult: a seemingly anomalous result, not allowed in the model, may be due to some mistake in the experimental setup or in our characterisation of the probabilities of those inevitable errors (perhaps they should be described by heavy-tailed power laws, rather than Gaussian distributions as above).
I first came across the concept as an oblique reference in a recent paper by Chris Fuchs, writing about his idea of QBism (or see here for a more philosophically-oriented discussion), an interpretation of quantum mechanics that takes seriously the Bayesian principle that all probabilities are about our knowledge of the world, rather than the world itself (which is a discussion for another day). He tentatively opined that the probabilities in quantum mechanics are themselves “Knightian”, referring not to a reading of Knight himself but to some recent, and to me frankly bizarre, ideas from Scott Aaronson, discussed in his paper, The Ghost in the Quantum Turing Machine, and an accompanying blog post, trying to base something like “free will” (a term he explicitly does not apply to this idea, however) on the possibility of our brains having so-called “freebits”, quantum states whose probabilities are essentially uncorrelated with anything else in the Universe. This arises from what is to me a mistaken desire to equate “freedom” with complete unpredictability. My take on free will is instead aligned with that of Daniel Dennett, at least the version from his Consciousness Explained from the early 1990s, as I haven’t yet had the chance to read his recent From Bacteria to Bach and Back: a perfectly deterministic (or quantum mechanically random, even allowing for the statistical correlations that Aaronson wants to be rid of) version of free will is completely sensible, and indeed may be the only kind of free will worth having.
Fuchs himself tentatively uses Aaronson’s “Knightian Freedom” to refer to his own idea
that nature does what it wants, without a mechanism underneath, and without any “hidden hand” of the likes of Richard von Mises’s Kollective or Karl Popper’s propensities or David Lewis’s objective chances, or indeed any conception that would diminish the autonomy of nature’s events,
which I think is an attempt (and which I admit I don’t completely understand) to remove the probabilities of quantum mechanics entirely from any mechanistic account of physical systems, despite the incredible success of those probabilities in predicting the outcomes of experiments and other observations of quantum mechanical systems. I’m not quite sure this is what either Knight nor Aaronson had in mind with their use of “uncertainty” (or “freedom”), since at least in quantum mechanics, we do know what probabilities to assign, given certain other personal (as Fuchs would have it) information about the system. My Bayesian predilections make me sympathetic with this idea, but then I struggle to understand what, exactly, quantum mechanics has taught us about the world: why do the predictions of quantum mechanics work?
When I’m not thinking about physics, for the last year or so my mind has been occupied with politics, so I was amused to see Knightian Uncertainty crop up in a New Yorker article about Trump’s effect on the stock market:
Still, in economics there’s a famous distinction, developed by the great Chicago economist Frank Knight, between risk and uncertainty. Risk is when you don’t know exactly what will happen but nonetheless have a sense of the possibilities and their relative likelihood. Uncertainty is when you’re so unsure about the future that you have no way of calculating how likely various outcomes are. Business is betting that Trump is risky but not uncertain—he may shake things up, but he isn’t going to blow them up. What they’re not taking seriously is the possibility that Trump may be willing to do things—like start a trade war with China or a real war with Iran—whose outcomes would be truly uncertain.
It’s a pretty low bar, but we can only hope.
I recently finished my last term lecturing our second-year Quantum Mechanics course, which I taught for five years. It’s a required class, a mathematical introduction to one of the most important set of ideas in all of physics, and really the basis for much of what we do, whether that’s astrophysics or particle physics or almost anything else. It’s a slightly “old-fashioned” course, although it covers the important basic ideas: the Schrödinger Equation, the postulates of quantum mechanics, angular momentum, and spin, leading almost up to what is needed to understand the crowning achievement of early quantum theory: the structure of the hydrogen atom (and other atoms).
A more modern approach might start with qubits: the simplest systems that show quantum mechanical behaviour, and the study of which has led to the revolution in quantum information and quantum computing.
Moreover, the lectures rely on the so-called Copenhagen interpretation, which is the confusing and sometimes contradictory way that most physicists are taught to think about the basic ontology of quantum mechanics: what it says about what the world is “made of” and what happens when you make a quantum-mechanical measurement of that world. Indeed, it’s so confusing and contradictory that you really need another rule so that you don’t complain when you start to think too deeply about it: “shut up and calculate”. A more modern approach might also discuss the many-worlds approach, and — my current favorite — the (of course) Bayesian ideas of QBism.
The students seemed pleased with the course as it is — at the end of the term, they have the chance to give us some feedback through our “Student On-Line Evaluation” system, and my marks have been pretty consistent. Of the 200 or so students in the class, only about 90 bother to give their evaluations, which is disappointingly few. But it’s enough (I hope) to get a feeling for what they thought.
So, most students Definitely/Mostly Agree with the good things, although it’s clear that our students are most disappointed in the feedback that they receive from us (this is a more general issue for us in Physics at Imperial and more generally, and which may partially explain why most of them are unwilling to feed back to us through this form).
But much more fun and occasionally revealing are the “free-text comments”. Given the numerical scores, it’s not too surprising that there were plenty of positive ones:
Excellent lecturer - was enthusiastic and made you want to listen and learn well. Explained theory very well and clearly and showed he responded to suggestions on how to improve.
Possibly the best lecturer of this term.
Thanks for providing me with the knowledge and top level banter.
One of my favourite lecturers so far, Jaffe was entertaining and cleary very knowledgeable. He was always open to answering questions, no matter how simple they may be, and gave plenty of opportunity for students to ask them during lectures. I found this highly beneficial. His lecturing style incorporates well the blackboards, projectors and speach and he finds a nice balance between them. He can be a little erratic sometimes, which can cause confusion (e.g. suddenly remembering that he forgot to write something on the board while talking about something else completely and not really explaining what he wrote to correct it), but this is only a minor fix. Overall VERY HAPPY with this lecturer!
But some were more mixed:
One of the best, and funniest, lecturers I’ve had. However, there are some important conclusions which are non-intuitively derived from the mathematics, which would be made clearer if they were stated explicitly, e.g. by writing them on the board.
I felt this was the first time I really got a strong qualitative grasp of quantum mechanics, which I certainly owe to Prof Jaffe’s awesome lectures. Sadly I can’t quite say the same about my theoretical grasp; I felt the final third of the course less accessible, particularly when tackling angular momentum. At times, I struggled to contextualise the maths on the board, especially when using new techniques or notation. I mostly managed to follow Prof Jaffe’s derivations and explanations, but struggled to understand the greater meaning. This could be improved on next year. Apart from that, I really enjoyed going to the lectures and thought Prof Jaffe did a great job!
The course was inevitably very difficult to follow.
And several students explicitly commented on my attempts to get students to ask questions in as public a way as possible, so that everyone can benefit from the answers and — this really is true! — because there really are no embarrassing questions!
Really good at explaining and very engaging. Can seem a little abrasive at times. People don’t like asking questions in lectures, and not really liking people to ask questions in private afterwards, it ultimately means that no questions really get answered. Also, not answering questions by email makes sense, but no one really uses the blackboard form, so again no one really gets any questions answered. Though the rationale behind not answering email questions makes sense, it does seem a little unnecessarily difficult.
We are told not to ask questions privately so that everyone can learn from our doubts/misunderstandings, but I, amongst many people, don’t have the confidence to ask a question in front of 250 people during a lecture.
Forcing people to ask questions in lectures or publically on a message board is inappropriate. I understand it makes less work for you, but many students do not have the confidence to ask so openly, you are discouraging them from clarifying their understanding.
Inevitably, some of the comments were contradictory:
Would have been helpful to go through examples in lectures rather than going over the long-winded maths to derive equations/relationships that are already in the notes.
Professor Jaffe is very good at explaining the material. I really enjoyed his lectures. It was good that the important mathematics was covered in the lectures, with the bulk of the algebra that did not contribute to understanding being left to the handouts. This ensured we did not get bogged down in unnecessary mathematics and that there was more emphasis on the physics. I liked how Professor Jaffe would sometimes guide us through the important physics behind the mathematics. That made sure I did not get lost in the maths. A great lecture course!
And also inevitably, some students wanted to know more about the exam:
- It is a difficult module, however well covered. The large amount of content (between lecture notes and handouts) is useful. Could you please identify what is examinable though as it is currently unclear and I would like to focus my time appropriately?
And one comment was particularly worrying (along with my seeming “a little abrasive at times”, above):
- The lecturer was really good in lectures. however, during office hours he was a bit arrogant and did not approach the student nicely, in contrast to the behaviour of all the other professors I have spoken to
If any of the students are reading this, and are willing to comment further on this, I’d love to know more — I definitely don’t want to seem (or be!) arrogant or abrasive.
But I’m happy to see that most students don’t seem to think so, and even happier to have learned that I’ve been nominated “multiple times” for Imperial’s Student Academic Choice Awards!
Finally, best of luck to my colleague Jonathan Pritchard, who will be taking over teaching the course next year.
Nearly a decade ago, blogging was young, and its place in the academic world wasn’t clear. Back in 2005, I wrote about an anonymous article in the Chronicle of Higher Education, a so-called “advice” column admonishing academic job seekers to avoid blogging, mostly because it let the hiring committee find out things that had nothing whatever to do with their academic job, and reject them on those (inappropriate) grounds.
I thought things had changed. Many academics have blogs, and indeed many institutions encourage it (here at Imperial, there’s a College-wide list of blogs written by people at all levels, and I’ve helped teach a course on blogging for young academics). More generally, outreach has become an important component of academic life (that is, it’s at least necessary to pay it lip service when applying for funding or promotions) and blogging is usually seen as a useful way to reach a wide audience outside of one’s field.
So I was distressed to see the lament — from an academic blogger — “Want an academic job? Hold your tongue”. Things haven’t changed as much as I thought:
… [A senior academic said that] the blog, while it was to be commended for its forthright tone, was so informal and laced with profanity that the professor could not help but hold the blog against the potential faculty member…. It was the consensus that aspiring young scientists should steer clear of such activities.
Depending on the content of the blog in question, this seems somewhere between a disregard for academic freedom and a judgment of the candidate on completely irrelevant grounds. Of course, it is natural to want the personalities of our colleagues to mesh well with our own, and almost impossible to completely ignore supposedly extraneous information. But we are hiring for academic jobs, and what should matter are research and teaching ability.
Of course, I’ve been lucky: I already had a permanent job when I started blogging, and I work in the UK system which doesn’t have a tenure review process. And I admit this blog has steered clear of truly controversial topics (depending on what you think of Bayesian probability, at least).
If you’re the kind of person who reads this blog, then you won’t have missed yesterday’s announcement of the first Planck cosmology results.
The most important is our picture of the cosmic microwave background itself:
But it takes a lot of work to go from the data coming off the Planck satellite to this picture. First, we have to make nine different maps, one at each of the frequencies in which Planck observes, from 30 GHz (with a wavelength of 1 cm) up to 850 GHz (0.350 mm) — note that the colour scales here are the same:
At low and high frequencies, these are dominated by the emission of our own galaxy, and there is at least some contamination over the whole range, so it takes hard work to separate the primordial CMB signal from the dirty (but interesting) astrophysics along the way. In fact, it’s sufficiently challenging that the team uses four different methods, each with different assumptions, to do so, and the results agree remarkably well.
In fact, we don’t use the above CMB image directly to do the main cosmological science. Instead, we build a Bayesian model of the data, combining our understanding of the foreground astrophysics and the cosmology, and marginalise over the astrophysical parameters in order to extract as much cosmological information as we can. (The formalism is described in the Planck likelihood paper, and the main results of the analysis are in the Planck cosmological parameters paper.)
The main tool for this is the power spectrum, a plot which shows us how the different hot and cold spots on our CMB map are distributed: In this plot, the left-hand side (low ℓ) corresponds to large angles on the sky and high ℓ to small angles. Planck’s results are remarkable for covering this whole range from ℓ=2 to ℓ=2500: the previous CMB satellite, WMAP, had a high-quality spectrum out to ℓ=750 or so; ground- and balloon-based experiments like SPT and ACT filled in some of the high-ℓ regime.
It’s worth marvelling at this for a moment, a triumph of modern cosmological theory and observation: our theoretical models fit our data from scales of 180° down to 0.1°, each of those bumps and wiggles a further sign of how well we understand the contents, history and evolution of the Universe. Our high-quality data has refined our knowledge of the cosmological parameters that describe the universe, decreasing the error bars by a factor of several on the six parameters that describe the simplest ΛCDM universe. Moreover, and maybe remarkably, the data don’t seem to require any additional parameters beyond those six: for example, despite previous evidence to the contrary, the Universe doesn’t need any additional neutrinos.
The quantity most well-measured by Planck is related to the typical size of spots in the CMB map; it’s about a degree, with an error of less than one part in 1,000. This quantity has changed a bit (by about the width of the error bar) since the previous WMAP results. This, in turn, causes us to revise our estimates of quantities like the expansion rate of the Universe (the Hubble constant), which has gone down, in fact by enough that it’s interestingly different from its best measurements using local (non-CMB) data, from more or less direct observations of galaxies moving away from us. Both methods have disadvantages: for the CMB, it’s a very indirect measurement, requiring imposing a model upon the directly measured spot size (known more technically as the “acoustic scale” since it comes from sound waves in the early Universe). For observations of local galaxies, it requires building up the famous cosmic distance ladder, calibrating our understanding of the distances to further and further objects, few of which we truly understand from first principles. So perhaps this discrepancy is due to messy and difficult astrophysics, or perhaps to interesting cosmological evolution.
This change in the expansion rate is also indirectly responsible for the results that have made the most headlines: it changes our best estimate of the age of the Universe (slower expansion means an older Universe) and of the relative amounts of its constituents (since the expansion rate is related to the geometry of the Universe, which, because of Einstein’s General Relativity, tells us the amount of matter).
But the cosmological parameters measured in this way are just Planck’s headlines: there is plenty more science. We’ve gone beyond the power spectrum above to put limits upon so-called non-Gaussianities which are signatures of the detailed way in which the seeds of large-scale structure in the Universe was initially laid down. We’ve observed clusters of galaxies which give us yet more insight into cosmology (and which seem to show an intriguing tension with some of the cosmological parameters). We’ve measured the deflection of light by gravitational lensing. And in work that I helped lead, we’ve used the CMB maps to put limits on some of the ways in which our simplest models of the Universe could be wrong, possibly having an interesting topology or rotation on the largest scales.
But because we’ve scrutinised our data so carefully, we have found some peculiarities which don’t quite fit the models. From the days of COBE and WMAP, there has been evidence that the largest angular scales in the map, a few degrees and larger, have some “anomalies” — some of the patterns show strange alignments, some show unexpected variation between two different hemispheres of the sky, and there are some areas of the sky that are larger and colder than is expected to occur in our theories. Individually, any of these might be a statistical fluke (and collectively they may still be) but perhaps they are giving us evidence of something exciting going on in the early Universe. Or perhaps, to use a bad analogy, the CMB map is like the Zapruder film: if you scrutinise anything carefully enough, you’ll find things that look a conspiracy, but turn out to have an innocent explanation.
I’ve mentioned eight different Planck papers so far, but in fact we’ve released 28 (and there will be a few more to come over the coming months, and many in the future). There’s an overall introduction to the Planck Mission, and papers on the data processing, observations of relatively nearby galaxies, and plenty more cosmology. The papers have been submitted to the journal A&A, they’re available on the ArXiV, and you can find a list of them at the ESA site.
Even more important for my cosmology colleagues, we’ve released the Planck data, as well, along with the necessary code and other information necessary to understand it: you can get it from the Planck Legacy Archive. I’m sure we’ve only just begun to get exciting and fun science out of the data from Planck. And this is only the beginning of Planck’s data: just the first 15 months of observations, and just the intensity of the CMB: in the coming years we’ll be analysing (and releasing) more than one more year of data, and starting to dig into Planck’s observations of the polarized sky.
A week ago, I finished my first time teaching our second-year course in quantum mechanics. After a bit of a taster in the first year, the class concentrates on the famous Schrödinger equation, which describes the properties of a particle under the influence of an external force. The simplest version of the equation is just This relates the so-called wave function, ψ, to what we know about the external forces governing its motion, encoded in the Hamiltonian operator, Ĥ. The wave function gives the probability (technically, the probability amplitude) for getting a particular result for any measurement: its position, its velocity, its energy, etc. (See also this excellent public work by our department’s artist-in-residence.)
Over the course of the term, the class builds up the machinery to predict the properties of the hydrogen atom, which is the canonical real-world system for which we need quantum mechanics to make predictions. This is certainly a sensible endpoint for the 30 lectures.
But it did somehow seem like a very old-fashioned way to teach the course. Even back in the 1980s when I first took a university quantum mechanics class, we learned things in a way more closely related to the way quantum mechanics is used by practicing physicists: the mathematical details of Hilbert spaces, path integrals, and Dirac Notation.
Today, an up-to-date quantum course would likely start from the perspective of quantum information, distilling quantum mechanics down to its simplest constituents: qbits, systems with just two possible states (instead of the infinite possibilities usually described by the wave function). The interactions become less important, superseded by the information carried by those states.
Really, it should be thought of as a full year-long course, and indeed much of the good stuff comes in the second term when the students take “Applications of Quantum Mechanics” in which they study those atoms in greater depth, learn about fermions and bosons and ultimately understand the structure of the periodic table of elements. Later on, they can take courses in the mathematical foundations of quantum mechanics, and, yes, on quantum information, quantum field theory and on the application of quantum physics to much bigger objects in “solid-state physics”.
Despite these structural questions, I was pretty pleased with the course overall: the entire two-hundred-plus students take it at the beginning of their second year, thirty lectures, ten ungraded problem sheets and seven in-class problems called “classworks”. Still to come: a short test right after New Year’s and the final exam in June. Because it was my first time giving these lectures, and because it’s such an integral part of our teaching, I stuck to to the same notes and problems as my recent predecessors (so many, many thanks to my colleagues Paul Dauncey and Danny Segal).
Once the students got over my funny foreign accent, bad board handwriting, and worse jokes, I think I was able to get across both the mathematics, the physical principles and, eventually, the underlying weirdness, of quantum physics. I kept to the standard Copenhagen Interpretation of quantum physics, in which we think of the aforementioned wavefunction as a real, physical thing, which evolves under that Schrödinger equation — except when we decide to make a measurement, at which point it undergoes what we call collapse, randomly and seemingly against causality: this was Einstein’s “spooky action at a distance” which seemed to indicate nature playing dice with our Universe, in contrast to the purely deterministic physics of Newton and Einstein’s own relativity. No one is satisfied with Copenhagen, although a more coherent replacement has yet to be found (I won’t enumerate the possibilities here, except to say that I find the proliferating multiverse of Everett’s Many-Worlds interpretation ontologically extravagant, and Chris Fuchs’ Quantum Bayesianism compelling but incomplete).
I am looking forward to getting this year’s SOLE results to find out for sure, but I think the students learned something, or at least enjoyed trying to, although the applause at the end of each lecture seemed somewhat tinged with British irony.
Among the many other things I haven’t had time to blog about, this term we opened the new Imperial Centre for Inference and Cosmology, the culmination of several years of expansion in the Imperial Astrophysics group. In mid-March we had our in-house grand opening, with a ribbon-cutting by the group’s most famous alumnus.
Statistics and astronomy have a long history together, largely growing from the desire to predict the locations of planets and other heavenly bodies based on inexact measurements. In relatively modern times, that goes back at least to Legendre and Gauss who more or less independently came up with the least-squares method of combining observations, which can be thought of as based on the latter’s eponymous Gaussian distribution.
Our group had already had a much shorter but still significant history in what has come to be called “astrostatistics”, having been involved with large astronomical surveys such as UKIDSS and IPHAS and the many allowed by the infrared satellite telescope Herschel (and its predecessors ISO, IRAS and Spitzer). Along with my own work on the CMB and other applications of statistics to cosmology, the other “founding members” of ICIC include: my colleague Roberto Trotta who has made important forays into the rigorous application of principled Bayesian statistics to problems cosmology and particle physics; Jonathan Pritchard who studies the distribution of matter in the evolving Universe and what that can teach about its constituents and that evolution; and Daniel Mortlock, who has written about some of his work looking for rare and unusual objects elsewhere on this blog. We are lucky to have the initial membership of the group supplemented by Alan Heavens, who will be joining us over the summer and has a long history of working to understand the distribution of matter in the Universe throughout its history. This group will be joined by several members of the Statistics section of the Mathematics Department, in particular David van Dyk, David Hand and Axel Gandy.
One of the fun parts of starting up the new centre has been the opportunity to design our new suite of glass-walled offices. Once we made sure that there would be room for a couple of sofas and a coffee machine for the Astrophysics group to share, we needed something to allow a little privacy. For the main corridor, we settled on this:
The left side is from the Hubble Ultra-Deep field (in negative), a picture about 3 arc minutes on a side (about the size of a dime or 5p coin held at arm’s length), the deepest — most distant — optical image of the Universe yet taken. The right side is our Milky Way galaxy as reconstructed by the 2MASS survey.
The final wall is a bit different:
The middle panels show part of papers by each of those founding members of the group, flanked on the left and right side with the posthumously published paper by the Rev. Thomas Bayes who gave his name to the field of Bayesian Probability.
Of course, there has been some controversy about how we should actually refer to the place. Reading out the letters gives the amusing “I see, I see”, and IC2 (“I-C-squared”) has a nice feel and a bit of built-in mathematics, although it does sound a bit like the outcome of a late-90s corporate branding exercise (and the pedants in the group noted that technically it would then be the incorrect I×C×C unless we cluttered it with parentheses).
We’re hoping that the group will keep growing, and we look forward to applying our tools and ideas to more and more astronomical data over the coming years. One of the most important ways to do that, of course, will be through collaboration: if you’re an astronomer with lots of data, or a statistician with lots of ideas, or, like many of us, somewhere in between, please get in touch and come for a visit.
Continuing my recent, seemingly interminable, series of too-technical posts on probability theory… To understand this one you’ll need to remember Bayes’ Theorem, and the resulting need for a Bayesian statistician to come up with an appropriate prior distribution to describe her state of knowledge in the absence of the experimental data she is considering, updated to the posterior distribution after considering that data. I should perhaps follow the guide of blogging-hero Paul Krugman and explicitly label posts like this as “wonkish”.
(If instead you’d prefer something a little more tutorial, I can recommend the excellent recent post from my colleague Ted Bunn, discussing hypothesis testing, stopping rules, and cheating at coin flips.)
Deborah Mayo has begun her own series of posts discussing some of the articles in a recent special volume of the excellently-named journal, “Rationality, Markets and Morals” on the topic Statistical Science and Philosophy of Science.
She has started with a discussion Stephen Senn’s “You May Believe You are a Bayesian But You Are Probably Wrong”: she excerpts the article here and then gives her own deconstruction in the sequel.
Senn’s article begins with a survey of the different philosophical schools of statistics: not just frequentist versus Bayesian (for which he also uses the somewhat old-fashioned names of “direct” versus “inverse” probability), but also how the practitioners choose to apply the probabilities that they calculate: either directly in terms of inferences about the world versus using those probabilities to make decisions in order to give a further meaning to the probability.
Having cleaved the statistical world in four, Senn makes a clever rhetorical move. In a wonderfully multilevelled backhanded compliment, he writes
If any one of the four systems had a claim to our attention then I find de Finetti’s subjective Bayes theory extremely beautiful and seductive (even though I must confess to also having some perhaps irrational dislike of it). The only problem with it is that it seems impossible to apply.
He discusses why it is essentially impossible to perform completely coherent ground-up analyses within the Bayesian formalism:
This difficulty is usually described as being the difficulty of assigning subjective probabilities but, in fact, it is not just difficult because it is subjective: it is difficult because it is very hard to be sufficiently imaginative and because life is short.
And, later on:
The … test is that whereas the arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it. This means, for example, that model checking is not allowed.
I think that these criticisms mis-state the practice of Bayesian statistics, at least by the scientists I know (mostly cosmologists and astronomers). We do not treat statistics as a grand system of inference (or decision) starting from single, primitive state of knowledge which we use to reason all the way through to new theoretical paradigms. The caricature of Bayesianism starts with a wide open space of possible theories, and we add data, narrowing our beliefs to accord with our data, using the resulting posterior as the prior for the next set of data to come across our desk.
Rather, most of us take a vaguely Jaynesian view, after the cranky Edwin Jaynes, as espoused in his forty years of papers and his polemical book Probability Theory: The Logic of Science — all probabilities are conditional upon information (although he would likely have been much more hard-core). Contra Senn’s suggestions, the individual doesn’t need to continually adjust her subjective probabilities until she achieves an overall coherence in her views. She just needs to present (or summarise in a talk or paper) a coherent set of probabilities based on given background information (perhaps even more than one set). As long as she carefully states the background information (and the resulting prior), the posterior is a completely coherent inference from it.
In this view, probability doesn’t tell us how to do science, just analyse data in the presence of known hypotheses. We are under no obligation to pursue a grand plan, listing all possible hypotheses from the outset. Indeed we are free to do ‘exploratory data analysis’ using (even) not-at-all-Bayesian techniques to help suggest new hypotheses. This is a point of view espoused most forcefully by Andrew Gelman (author of another paper in the special volume of RMM).
Of course this does not solve all formal or philosophical problems with the Bayesian paradigm. In particular, as I’ve discussed a few times recently, it doesn’t solve what seems to me the most knotty problem of hypothesis testing in the presence of what one would like to be ‘wide open’ prior information.
I spent a quick couple of days last week at the The Controversy about Hypothesis Testing meeting in Madrid.
The topic of the meeting was indeed the question of “hypothesis testing”, which I addressed in a post a few months ago: how do you choose between conflicting interpretations of data? The canonical version of this question was the test of Einstein’s theory of relativity in the early 20th Century — did the observations of the advance of the perihelion of Mercury (and eventually of the gravitational lensing of starlight by the sun) match the predictions of Einstein’s theory better than Newton’s? And of course there are cases in which even more than a scientific theory is riding on the outcome: is a given treatment effective? I won’t rehash here my opinions on the subject, except to say that I think there really is a controversy: the purported Bayesian solution runs into problems in realistic cases of hypotheses about which we would like to claim some sort of “ignorance” (always a dangerous word in Bayesian circles), while the orthodox frequentist way of looking at the problem is certainly ad hoc and possibly incoherent, but nonetheless seems to work in many cases.
Sometimes, the technical worries don’t apply, and the Bayesian formalism provides the ideal solution. For example, my colleague Daniel Mortlock has applied the model-comparison formalism to deciding whether objects in his UKIDSS survey data are more likely to be distant quasars or nearby and less interesting objects. (He discussed his method here a few months ago.)
In between thoughts about hypothesis testing, I experienced the cultural differences between the statistics community and us astrophysicists and cosmologists, of which I was the only example at the meeting: a typical statistics talk just presents pages of text and equations with the occasional poorly-labeled graph thrown in. My talks tend to be a bit heavier on the presentation aspects, perhaps inevitably so given the sometimes beautiful pictures that package our data.
On the other hand, it was clear that the statisticians take their Q&A sessions very seriously, prodded in this case by the word “controversy” in the conference’s title. In his opening keynote, Jose Bernardo up from Valencia for the meeting discussed his work as a so-called “Objective Bayesian”, prompting a question from the mathematically-oriented philosopher Deborah Mayo. Mayo is an arch-frequentist (and blogger) who prefers to describe her particular version as “Error Statistics”, concerned (if I understand correctly after our wine-fuelled discussion at the conference dinner) with the use of probability and statistics to criticise the errors we make in our methods, in contrast with the Bayesian view of probability as a description of our possible knowledge of the world. These two points of view are sufficiently far apart that Bernardo countered one of the questions with the almost-rude but definitely entertaining riposte “You are bloody inconsistent — you are not mathematicians.” That was probably the most explicit almost-personal attack of the meeting, but there were similar exchanges. Not mine, though: my talk was a little more didactic than most, as I knew that I had to justify the science as well as the statistics that lurks behind any analysis of data.
So I spent much of my talk discussing the basics of modern cosmology, and applying my preferred Bayesian techniques in at least one big-picture case where the method works: choosing amongst the simple set of models that seem to describe the Universe, at least from those that obey General Relativity and the Cosmological Principle, in which we do not occupy a privileged position and which, given our observations, are therefore homogeneous and isotropic on the largest scales. Given those constraints, all we need to specify (or measure) are the amounts of the various constituents in the universe: the total amount of matter and of dark energy. The sum of these, in turn, determines the overall geometry of the universe. In the appropriate units, if the total is one, the universe is flat; if it’s larger, the universe is closed, shaped like a three-dimensional sphere; if smaller, it’s a three-dimensional hyperboloid or saddle. What we find when we make the measurement is that the amount of matter is about 0.282±0.02, and of dark energy about 0.723±0.02. Of course, these add up to just greater than one; model-selection (or hypothesis testing in other forms) allows us to say that the data nonetheless give us reason to prefer the flat Universe despite the small discrepancy.
After the meeting, I had a couple of hours free, so I went across Madrid to the Reina Sofia, to stand amongst the Picassos and Serras. And I was lucky enough to have my hotel room above a different museum:
Yes, more on statistics.
In a recent NY Times article, science reporter Dennis Overbye discusses recent talks from Fermilab and CERN scientists which may hint at the discovery of the much-anticipated Higgs Boson. The executive summary is: it hasn’t been found yet.
But in the course of the article, Overbye points out that
To qualify as a discovery, some bump in the data has to have less than one chance in 3.5 million of being an unlucky fluctuation in the background noise.
That particular number is the so-called “five sigma” level from the Gaussian distribution. Normally, I would use this as an opportunity to discuss exactly what probability means in this context — is it a Bayesian “degree of belief” or a frequentist “p-value”, but for this discussion that distinction doesn’t matter: the important point is that one in 3.5 million is a very small chance indeed. [For the aficionados, the number is the probability that x > μ + 5 σ when x is described by a Gaussian distribution of mean μ and variance σ2.]
Why are we physicists so conservative? Are we just being very careful not to get it wrong, especially when making such a potentially important — Nobel-worthy! — discovery? Even for less ground-breaking results, the limit is often taken to be three sigma, which is about one chance in 750. This is a lot less conservative, but still pretty improbable: I’d happily bet a reasonable amount on a sporting event if I really thought I had 749 chances out of 750 of winning. However, there’s a maxim among scientists: half of all three sigma results are wrong. This may be an exaggeration, but certainly nobody believes “one in 750” is a good description of the probability (nor one in 3.5 million for five sigma results). How could this be? Fifty percent — one in two — is several hundred times more likely than 1/750.
There are several explanations, and any or all of them may be true for a particular result. First, people often underestimate their errors. More specifically, scientists often only include errors for which they can construct a distribution function — so-called statistical or random errors. But the systematic errors, which are, roughly speaking, every other way that the experimental results could be wrong, are usually not accounted for, and of course any “unknown systematics” are ignored by definition, and usually not discovered until well after the fact.
The controversy surrounding the purported measurements of the variation of the fine-structure constant that I discussed last week lies almost entirely in the different groups’ ability to incorporate a good model for the systematic errors in their very precise spectral measurements.
And then of course there are the even-less quantifiable biases that alter what results get published and how we interpret them. Chief among these may be publication or reporting bias: scientists and journals are more likely to publish, or even discuss, exciting new results than supposedly boring confirmations of the old paradigm. If there were a few hundred unpublished three-sigma unexciting confirmations for every published groundbreaking result, we would expect many of those to be statistical flukes. Some of these may be related to the so-called “decline effect” that Jonah Lehrer wrote about in the New Yorker recently: new results seem to get less statistically significant over time as more measurements are made. Finally, as my recent interlocutor, Andrew Gelman, points out “classical statistical methods that work reasonably well when studying moderate or large effects… fall apart in the presence of small effects.”
(In fact, Overbye discussed the large number of “false detections” in astronomy and physics in another Times article almost exactly a year ago.)
Unfortunately all of this can make it very difficult to interpret — and trust — statistical statements in the scientific literature, although we in the supposedly hard sciences have it a little easier as we can often at least enumerate the possible problems even if we can’t always come up with a good statistical model to describe our ignorance in detail.
[Apologies — this is long, technical, and there are too few examples. I am putting it out for commentary more than anything else…]
In some recent articles and blog posts (including one in response to astronomer David Hogg), Columbia University statistician Andrew Gelman has outlined the philosophical position that he and some of his colleagues and co-authors hold. While starting from a resolutely Bayesian perspective on using statistical methods to measure the parameters of a model, he and they depart from the usual story when evaluating models and comparing them to one another. Rather than using the techniques of Bayesian model comparison, they eschew them in preference to a set of techniques they describe as ‘model checking’. Let me apologize in advance if I misconstrue or caricature their views in any way in the following.
In the formalism of model comparison, the statistician or scientist needs to fully specify her model: what are the numbers needed to describe the model, how does the data depend upon them (the likelihood), as well as a reasonable guess for what those numbers night be in the absence of data (the prior). Given these ingredients, one can first combine them to form the posterior distribution to estimate the parameters but then go beyond this to actually determine the probability of the fully-specified model itself.
The first part of the method, estimating the parameters, is usually robust to the choice of a prior distribution for the parameters. In many cases, one can throw the possibilities wide open (an approximation to some sort of ‘complete ignorance’) and get a meaningful measurement of the parameters. In mathematical language, we take the limit of the posterior distribution as we make the prior distribution arbitrarily wide, and this limit often exists.
The problem, noticed by most statisticians and scientists who try to apply these methods is that the next step, comparing models, is almost always sensitive to the details of the choice of prior: as the prior distribution gets wider and wider, the probability for the model gets lower and lower without limit; a model with an infinitely wide prior has zero probability compared to one with a finite width.
In some situations, where we do not wish to model some sort of ignorance, this is fine. But in others, even if we know it is unreasonable to accept an arbitrarily large value for some parameter, we really cannot reasonably choose between, say, an upper limit of 10100 and 1050, which may have vastly different consequences.
The other problem with model comparison is that, as the name says, it involves comparing models: it is impossible to merely reject a model tout court. But there are certainly cases when we would be wise to do so: the data have a noticeable, significant curve, but our model is a straight line. Or, more realistically (but also much more unusually in the history of science): we know about the advance of the perihelion of Mercury, but Einstein hasn’t yet come along to invent General Relativity; or Planck has written down the black body law but quantum mechanics hasn’t yet been formulated.
These observations lead Gelman and company to reject Bayesian model comparison entirely in favor of what they call ‘model checking’. Having made inferences about the parameters of a model, you next create simulated data from the posterior distribution and compare those simulations to the actual data. This latter step is done using some of the techniques of orthodox ‘frequentist’ methods: choosing a statistic, calculating p-values, and worrying about whether your observation is unusual because it lies in the tail of a distribution.
Having suggested these techniques, they go on to advocate a broader philosophical position on the use of probability in science: it is ‘hypothetico-deductive’, rather than ‘inductive’; Popperian rather than Kuhnian. (For another, even more critical, view of Kuhn’s philosophy of science, I recommend filmmaker Errol Morris’ excellent series of blog posts in the New York Times recounting his time as a graduate student in philosophy with Kuhn.)
At this point, I am sympathetic with their position, but worried about the details. A p-value is well-determined, but remains a kind of meaningless number: the probability of finding the value of your statistic as measured or worse. But you didn’t get a worse value, so it’s not clear why this number is meaningful. On the other hand, it is clearly an indication of something: if it is unlikely to have got a worse value then your data must, in some perhaps ill-determined sense, be itself unlikely. Indeed I think it is worries like this that lead them very often to prefer purely graphical methods — the simulations ‘don’t look like’ the data.
The fact is, however, these methods work. They draw attention to data that do not fit the model and, with well-chosen statistics or graphs, lead the scientist to understand what might be wrong with the model. So perhaps we can get away without mathematically meaningful probabilities as long as we are “just” using them to guide our intuition rather than make precise statements about truth or falsehood.
Having suggested these techniques, they go on to make a rather strange leap: deciding amongst any discrete set of parameters falls into the category of model comparison, against their rules. I’m not sure this restriction is necessary: if the posterior distribution for the discrete parameters makes sense, I don’t see why we should reject the inferences made from it.
In these articles they also discuss what it means for a model to be true or false, and what implications that has for the meaning of probability. As they argue, all models are in fact known to be false, certainly in the social sciences that most concerns Gelman, and for the most part in the physical sciences as well, in the sense that they are not completely true in every detail. Newton was wrong, because Einstein was more right, and Einstein is most likely wrong because there is likely to be an even better theory of quantum gravity. Hence, they say, the subjective view of probability is wrong, since no scientist really believes in the truth of the model she is checking. I agree, but I think this is a caricature of the subjective view of probability: it misconstrues the meaning of ‘subjectivity’. If I had to use probabilities only to reflect what I truly believe, I wouldn’t be able to do science, since the only thing that I am sure about my belief system is that it is incoherent:
Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)
— Walt Whitman, Song of Myself
Subjective probability, at least the way it is actually used by practicing scientists, is a sort of “as-if” subjectivity — how would an agent reason if her beliefs were reflected in a certain set of probability distributions? This is why when I discuss probability I try to make the pedantic point that all probabilities are conditional, at least on some background prior information or context. So we shouldn’t really ever write a probability that statement “A” is true as P(A), but rather as P(A|I) for some background information, “I”. If I change the background information to “J”, it shouldn’t surprise me that P(A|I)≠P(A|J). The whole point of doing science is to reason from assumptions and data; it is perfectly plausible for an actual scientist to restrict the context to a choice between two alternatives that she knows to be false. This view of probability owes a lot to Ed Jaynes (as also elucidated by Keith van Horn and others) and would probably be held by most working scientists if you made them elucidate their views in a consistent way.
Still, these philosophical points do not take away from Gelman’s more practical ones, which to me seem distinct from those loftier questions and from each other: first, that the formalism of model comparison is often too sensitive to prior information; second, that we should be able to do some sort of alternative-free model checking in order to falsify a model even if we don’t have any well-motivated substitute. Indeed, I suspect that most scientists, even hardcore Bayesians, work this way even if they (we) don’t always admit it.
A couple of weeks ago, a few of my astrophysics colleagues here at Imperial found the most distant quasar yet discovered, the innocuous red spot in the centre of this image:
One of them, Daniel Mortlock, has offered to explain a bit more:
Surely there’s just no way that something which happened 13 billion years ago — and tens of billions of light years away — could ever be reported as “news”? And yet that’s just what happened last week when world-renowned media outlets like the BBC, Time and, er, Irish Weather Online reported the discovery of the highly luminous quasar ULAS J1120+0641 in the early Universe. (Here is a longer list of links to discussions of the quasar in the media: although at least this discovery was generally included under the science heading — the Hawai’i Herald Tribune some how reported it as “local news”, which shows the sort of broad outlook not normally associated with the most insular of the United States.) The incongruity of the timescales involved became particular clear to me when I, as one of the team of astronomers who made this discovery, fielded phonecalls from journalists who, on the one hand, seemed quite at home with the notion of the light we’ve seen from this quasar having made its leisurely way to us for most of the history of the Universe, and then on the other hand were quite relaxed about a 6pm deadline to file a story on something they hadn’t even heard of a few hours earlier. The idea that this story might go from nothing to being in print in less than a day also made a striking contrast with the rather protracted process by which we made this discovery.
The story of the discovery of ULAS J1120+0641 starts with the United Kingdom InfraRed Telescope (UKIRT), and a meeting of British astronomers a decade ago to decide how best to use it. The consensus was to perform the UKIRT Infrared Deep Sky Survey (UKIDSS), the largest ever survey of the night sky at infrared wavelengths (i.e., between 1 micron and 3 microns), in part to provide a companion to the highly successful Sloan Digital Sky Survey (SDSS) that had recently been made at the optical wavelengths visible to the human eye. Of particular interest was the fact that the SDSS had discovered quasars — the bright cores of galaxies in which gas falling onto a super-massive black hole heats up so much it outshines all the stars in the host galaxy — so distant that they are seen as they were when the Universe was just a billion years old. Even though quasars are much rarer than ordinary galaxies, they are so much brighter that detailed measurements can be made of them, and so they are very effective probes of the early Universe. However there was a limit to how far back SDSS could search as no light emitted earlier than 900 million years after the Big Bang reaches us at optical wavelengths due to a combination of absorption by hydrogen atoms present at those early times and the expansion of the Universe stretching the unabsorbed light to infrared wavelengths. This is where UKIRT comes in — whereas distant sources like ULAS J1120+0641 are invisible to optical surveys, they can be detected using infrared surveys like UKIDSS. So, starting in 2005, UKIDSS got underway, with the eventual aim of looking at about 10% of the sky that had already been mapped at shorter wavelengths by SDSS. Given the number of slightly less distant quasars SDSS had found, we expected UKIDSS to include two or three record-breaking quasars; however it would also catalogue tens of millions of other astronomical objects (stars in our Galaxy, along with other galaxies), so actually finding the target quasars was not going to be easy.
Our basic search methodology was to identify any source that was clearly detected by UKIDSS but completely absent in the SDSS catalogues. In an ideal world this would have immediately given us our record-breaking quasars, but instead we still had a list of ten thousand candidates, all of which had the desired properties. Sadly it wasn’t a bumper crop of quasars — rather it was a result of observational noise, and most of these objects were cool stars which are faint enough at optical wavelengths that, in some cases, the imperfect measurement process meant they weren’t detected by SDSS, and hence entered our candidate list. (A comparable number of such stars would also be measured as brighter in SDSS than they actually are; however it is only the objects which are scattered faintward that caused trouble for us.) A second observation on an optical telescope would suffice to reject any of these candidates, but taking ten thousand such measurements is completely impractical. Instead, we used Bayesian statistics to extract as much information from the original SDSS and UKDISS measurements as possible. By making models of the cool star and quasar populations, and knowing the precision of the SDSS and UKIDSS observations we could work out the probability that any candidate was in fact a target quasar. Taking this approach turned out to be far more effective than we’d hoped — almost all the apparently quasar-like candidates had probabilities of less than 0.01 (i.e., they were only a 1% chance to be a quasar) and so could be discarded from consideration without going near a telescope.
For the remaining 200-odd candidates we did make follow-up observations, on UKIRT, the Liverpool Telescope (LT) or the New Technology Telescope (NTT) and in fewer than ten cases were the initial SDSS and UKIDSS measurements verified. By this stage we were almost certain that we had a distant quasar, although in most cases it was sufficiently bright at optical wavelengths that we knew it wasn’t a record-breaker. However ULAS J1120+0641, identified in late 2010, remained defiantly black when looked at by the LT, and so for the first time in five years we really thought we might have struck gold. To be completely sure we used the Gemini North telescope to obtain a spectrum — essentially splitting up the light into different wavelengths, just as happens to sunlight when it passes through water droplets to form a rainbow. The observation was made on Saturday, November 2010 and we got the spectrum e-mailed to us the next day and confirmed that we’d finally got what we were looking for: the most distant quasar ever found.
We obtained more precise spectra covering a wider range of wavelengths using Gemini (again) and the Very Large Telescope, the results of which are shown here:
The black curve shows the spectrum of ULAS J1120+0641; the green curve shows the average spectrum of a number of more nearby quasars (but redshifted to longer wavelengths to account for the cosmological expansion). The two are almost identical, with the obvious exception of the cut-off at 1 micron of ULAS J1120+0641 which comes about due to the absorption by hydrogen in front of the quasar. We had all the data needed to make this plot by the end of January, but it still took another five months for the results to be published in the June 30 edition of Nature — rather longer than the 24-hour turn-around of the journalists who eventually reported on this work. But if we’d given up on the search after four years — or if the Science and Technology Funding Council had withdrawn funding for UKIRT, as seemed likely at one point — then we never would have made this discovery. It was a long time coming but for me — and hopefully for astronomy — it was worth the wait.
Normally, I would be writing about the discovery of the most distant quasar by Imperial Astronomers using the UKIDSS survey (using excellent Bayesian methods), but Andy and Peter have beaten me to it. To make up for it, I’ll try to get one of the authors of the paper to discuss it here themselves, soon. (In the meantime, some other links, from STFC, ESO, Gemini, …)
But I’ve got a good excuse: I was out (with one of those authors, as it happens) seeing Paul Simon play at the Hammersmith Apollo:
Like my parents, Paul Simon grew up in the outer boroughs of New York City, a young teenager at the birth of rock’n’roll, and his music, no matter how many worldwide influences he brings in, always reminds me of home.
He played “The Sound Of Silence” (solo), most of his 70s hits from “Kodachrome” and “Mother and Child Reunion” to the soft rock of “Slip Slidin’ Away”, and covers of “Mystery Train” and “Here Comes the Sun”. But much of the evening was devoted to what is still his masterpiece, Graceland. (We were a little disappointed that the space-oriented backing video for “The Boy in the Bubble” included images neither of the Cosmic Microwave Background nor the new most distant quasar….)
One of the perks (perqs?) of academia is that occasionally I get an excuse to escape the damp grey of London Winters. The Planck Satellite is an international collaboration and, although largely backed by the European Space Agency, it has a large contribution from US scientists, who built the CMB detectors for Planck’s HFI instrument, as well as being significantly involved in the analysis of Planck data. Much of this work is centred at NASA’s famous Jet Propulsion Lab in Pasadena, and I was happy to rearrange my schedule to allow a February trip to sunny Southern California (I hope my undergraduate students enjoyed the two guest lectures during my absence).
Visiting California, I was compelled to take advantage of the local culture, which mostly seemed to involve meals. I ate as much Mexican food as I could manage, from fantastic $1.25 tacos from the El Taquito Mexicano Truck to somewhat higher-end fare at Tinga in LA proper. And I finally got to taste bánh mì, French-influenced Vietnamese sandwiches (which have arrived in London but I somehow haven’t tried them here yet). And I got to take in the view from the heights of Griffith Park:
as well as down at street level:
And even better, I got to share these meals and views with old and new friends.
Of course I was mainly in LA to do science, but even at JPL we managed to escape our windowless meeting room and check out the clean-room where NASA is assembling the Mars Science Lab:
The white pod-like structure is the spacecraft itself, which will parachute into Mars’ atmosphere in a few years, and from it will descend the circular “sky crane” currently parked behind it which will itself deploy the car-sized Curiosity Rover to do the real work of Martian geology, chemistry, climatology and (who knows?) biology.
But my own work was for the semi-annual meeting of the Planck CTP working group (I’ve never been sure if it was intentional, but the name always seemed to me a sort of science pun, obliquely referring to the famous “CPT” symmetry of fundamental physics). In Planck, “CTP” refers to Cℓ from Temperature and Polarization: the calculation of the famous CMB power spectrum which contains much of the cosmological information in the maps that Planck will produce. The spectrum allows us to compress the millions of pixels in a map of the CMB sky, such as this one from the WMAP experiment (the colors give the temperature or intensity of the radiation, the lines its polarization), into just a few thousand numbers we can plot on a graph.
OK, this is not a publishable figure. Instead, it marks the tenth anniversary of the first CTP working group telecon in February 2001 (somewhat before I was involved in the group, actually). But given that we won’t be publishing Planck cosmology data for another couple of years, sugary spectra will have to do instead of the real ones in the meantime.
The work of the CTP group is exactly concerned with finding the best algorithms for translating CMB maps into these power spectra. They must take into account the complicated noise in the map, coming from our imperfect instruments which observe the sky with finite resolution — that is, a telescope which smooths the sky at a scale from about half down to one-tenth of a degree — and with a limited sensitivity — every measurement has a little bit of unavoidable noise added to it. Moreover, in between the CMB, produced 400,000 years after the Big Bang, and Planck’s instruments, observing today, is the entire rest of the Universe, which contains matter that both absorbs and emits (glows) in the microwaves which Planck observes. So in practice we need to simultaneously deal with all of these effects when reducing our maps down to power spectra. This is a surprisingly difficult problem: the naive, brute-force (Bayesian), solution requires a number of computer operations which scales like the cube of the number of pixels in the CMB map; at Planck’s resolution this is as many as 100 million pixels, and there still are no supercomputers capable of doing the septillion (1024) operations in a reasonable time. If we smooth the map, we can still solve the full problem, but on small scales, we need to come up with useful approximations which take advantage of what we know about the data, usually taking advantage of the very large number of points that contribute, and the so-called asymptotic theorems which say, roughly, that we can learn about the right answer by doing lots of simulations, which are much less computationally expensive.
At the required levels of both accuracy and precision, the results depend on all of the details of the data processing and the algorithm: How do you account for the telescope’s optics and the pixelization of the sky? How do you model the noise in the map? How do you remove those pixels contaminated by astrophysical emission or absorption? All of this is compounded by the necessary (friendly) scientific competition: it is the responsibility of the CTP group to make recommendations for how Planck will actually produce its power spectra for the community and, naturally, each of us wants our own algorithm or computer program to be used — to win. So these meetings are as much about politics as science, but we can hope that the outcome is that all the codes are raised to an appropriate level and we can make the decisions on non-scientific grounds (ease of use, flexibility, speed, etc.) that will produce the high-quality scientific results for which we designed and built Planck — and have worked on it for the last decade or more.
Embarrassing update: as pointed out by Vladimir Nesov in the comments, all of my quantitative points below are incorrect. To maximize expected winnings, you should bet on whichever alternative you judge to be most likely. If you have a so-called logarithmic utility function — which already has the property of growing faster for small amounts than large — you should bet proportional to your odds on each answer. In fact, it’s exactly arguments like these that lead many to conclude that the logarithmic utility function is in some sense “correct”. So, in order to be led to betting more on the low-probability choices, one needs a utiltity that changes even faster for small amounts and slower for large amounts. But I disagree that this is “implausible” — if I think that is the best strategy to use, I should adjust my utility function, not change my strategy to match one that has been externally imposed. Just like probabilities, utility functions encode our preferences. Of course, I should endeavor to be consistent, to always use the same utility function, at least in the same circumstances, taking into account what economists call “externalities”.Anyway, all of this goes to show that I shouldn’t write long, technical posts after the office Christmas party….
The original post follows, mistakes included.
An even more unlikely place to find Bayesian inspiration was Channel 4’s otherwise insipid game show, “The Million Pound Drop”. In the version I saw, B-list celebs start out with a million pounds (sterling), and are asked a series of multiple-choice questions. For each one, they can bet any fraction of their remaining money on any set of answers; any money bet on wrong answers is lost (we’ll ignore the one caveat, that the contestants must wager no money on at least one answer, which means there’s always the chance that they will lose the entire stake).
Is there a best strategy for this game? Obviously, the overall goal is to maximize the actual winnings at the end of the series of questions. In the simplest example, let’s say a question is “What year did England last win the football world cup?” with possible answers “1912”, “1949”, “1966”, and “never”. In this case (assuming you know the answer), the only sensible course is to bet everything on “1966”.
Now, let’s say that the question is “When did the Chicago Bulls last win an NBA title?” with possible answers, “1953”, “1997”, “1998”, “2009”. The contestants, being fans of Michael Jordan, know that it’s either 1997 or 1998, but aren’t sure which — it’s a complete toss-up between the two. Again in this case, the strategy is clear: bet the same amount on each of the two — the expected winning is half of your stake no matter what. (The answer is 1998.)
But now let’s make it a bit more complicated: the question is “Who was the last American to win a gold medal in Olympic Decathlon?” with answers “Bruce Jenner”, “Brian Clay”, “Jim Thorpe”, and “Jess Owens”. Well, I remember that Jenner won in the 70s, and that Thorpe and Owens predate that by decades, so the only possibilities are Jenner and Clay, whom I’ve never heard of. So I’m pretty sure the answer is Jenner, but I’m by no means certain: let’s say that I’m 99:1 in favor of Jenner over Clay.
In order to maximize my expected winnings, I should bet 99 times as much on Jenner as Clay. But there’s a problem here: if it’s Clay, I end up with only one percent of my initial stake, and that one percent — which I have to go on and play more rounds with — is almost too small to be useful. This means that I don’t really want to maximize my expected winnings, but rather something that economists and statisticians call the “utility function”, or conversely, to minimize the loss function, functions which describes how useful some amount of winnings are to me: a thousand dollars is more than a thousand times useful than one dollar, but a million dollars is less than twice as useful as half a million dollars, at least in this context.
So in this case, a small amount of winnings is less useful than one might naively expect, and the utility function should reflect that by growing faster for small amounts and slower for larger amounts — I should perhaps bet ten percent on Clay. If it’s Jenner, I still get 90% of my stake, but if it’s Clay, I end up with a more-useful 10%. (The answer is Clay, by the way.)
This is the branch of statistics and mathematics called decision theory: how we go from probabilities to actions. It comes into play when we don’t want to just report probabilities, but actually act on them: whether to actually prescribe a drug, perform a surgical procedure, or build a sea-wall against a possible flood. In each of these cases, in addition to knowing the efficacy of the action, we need to understand its utility: if a flood is 1% likely over the next century and would cost one million pounds, but would save one billion in property damage and 100 lives if the flood occurred, we need to compare spending a million now versus saving a billion later (taking the “nonlinear” effects above into account) and complicate that with the loss from even more tragic possibilities. One hundred fewer deaths has the same utility as some amount of money saved, but I am glad I’m not on the panel that has to make that assignment. It is important to point out, however, that whatever decision is made, by whatever means, it is equivalent to some particularly set of utilities, so we may as well be explicit about it.
Happily, these sorts of questions tend to arise less in the physical sciences where probabilistic results are allowed, although the same considerations arise at a higher level: when making funding decisions…
I've come across a couple bits of popular/political culture that give me the opportunity to discuss one of my favorite topics: the uses and abuses of probability theory.
The first is piece by Nate Silver of the New York Times' FiveThirtyEight blog, dedicated to trying to crunch the political numbers of polls and other data in as transparent a manner as possible. Usually, Silver relies on a relentlessly frequentist take on probability: he runs lots of simulations letting the inputs vary according to the poll results (correctly taking into account the "margin of error" and more than occasionally using other information to re-weight the results of different polls. Nonetheless, these techniques give a good summary of the results at any given time -- and have been far and away the best discussion of the numerical minutiae of electioneering for both the 2008 and 2010 US elections.
But yesterday, Silver wrote a column: A Bayesian Take on Julian Assange which tackles the question of Assange's guilt in the sexual-assault offense with which he has been charged. Bayes' theorem, you will probably recall if you've been reading this blog, states that the probability of some statement ("Assange is innocent of sexual assault, despite the charges against him") is the product of the probability that he would be charged if he were innocent (the "likelihood") times the probability of his innnocence in the absence of knowledge about the charge (the "prior"):
P(innocent|charged, context) ∝ P(innocent | context) × P(charged|innocent, context)where P(A|B) means the probability of A given B, and the "∝" means that I've left off an overall number that you can mulitply by. The most important thing I've left in here is the "context": all of these probabilities depend upon the entire context in which you consider the problem.
To figure out these probabilities, there are no simulations we can perform -- we can't run a big social-science model of Swedish law-enforcement, possibly in contact with, say, American diplomats, and make small changes and see what happens. We just need to assign probabilities to these statements.
But even to do that requires considerable thought, and important decisions about the context in which we want to make these assignments. For Silver, the important context is that there is evidence that other governments, particularly the US, may have an ulterior motive for wanting to not just prosecute, but persecute Assange. Hence, the probability of his being unjustly accused [P(charged|innocent, context)] is larger than it would be for, say, an arbitrary Australian citizen traveling in Britain. Usually, Bayesian probability is accused of needing a subjective prior, but in this case the context affects and adds a subjective aspect to the likelihood.
Some of the commenters on the site make a different point: given that Assange is, at least in some sense, a known criminal (he has leaked secret documents, which is likely against the law), he is more likely to commit other criminal acts. This time, the likelihood is not affected, but the prior: the commenter believes that Assange is less likely to be innocent irrespective of the information about the charge.
Next: game shows.