Alex Alemi's Blog

The Method of Imaginary Results

Alexander A. Alemi.

Performing Bayesian inference requires a full joint distribution over both our data and parameters $p(D,\theta)$. In the usual way of doing things, we specify that joint distribution by providing two pieces: a likelihood $p(D|\theta)$ that specifies how we believe the data would be generated if we happened to know the exact parameter values and some prior $p(\theta)$ over parameters that represents our state of belief about what the parameters are before we look at any data.

Most people don't have any deep philosophical issues with specifying a likelihood $p(D|\theta)$. We're aware that our likelihoods might not be perfect, that they are some approximation of what is happening in the real world. Still, we have opinions about them, we feel as though we can reason about whether a given likelihood is good or bad for some situation.

I believe I can model a series of $D$ heads in $N$ coin flips with a Binomial likelihood for instance, and I don't really have any qualms about that. I might decide to model the heights of my pea plants with a Normal Distribution or perform a linear fit to some data, or do image classification with some convolutional neural network or transformer. In any case, I often have a good idea of what I should use as a likelihood $p(D|\theta)$.

Choosing the prior $p(\theta)$ is what all the fuss is about. This is the part that raises various philosophical issues. This is the part that, if we are being honest, is much harder. What do I believe the bias of a coin is before I ever flip the coin? I'm not really sure to be honest. In many contexts I might have previously done some experiments, in which case I could use yesterday's posterior as today's prior.1

However, lacking previous experiments, I often feel at a loss. There are many frameworks for designing priors that people have proposed. Laplace originally motivated a flat prior for the Bernoulli likelihood by appealing to the principle of indifference.2 Jeffreys taught us how to build priors that were reparameterization-independent. Jaynes would argue for choosing priors by appealing to symmetries.3 Bernardo suggested choosing priors to maximize the information you extract from data, so called reference priors.4 Gelman and friends tout weakly informative priors. There are even whole lists of common recommendations.

What if we didn't have to choose a prior directly?

The Method of Imaginary Results

Enter the method of imaginary results. It turns out5 that we can uniquely characterize a joint distribution in a different way. Specifying a likelihood $L(D|\theta)$ and a prior $\pi(\theta)$ uniquely characterizes the joint $p(D,\theta) = L(D|\theta)\pi(\theta)$. You know what else uniquely characterizes the joint? Specifying a likelihood $L(D|\theta)$ and some hypothetical posterior $q(\theta|D_0)$. The corresponding unique joint $p(\theta,D)$ is given by:

p(\theta, D) \propto L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} = \frac{ L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} }{\int d\theta\, L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)}}.

Which naturally satisfies the two inputs we provided: p(D|\theta) = L(D|\theta) \qquad p(\theta|D_0) = q(\theta|D_0).

This flips the problem on its head. We no longer have to specify a prior. Instead we can specify a hypothetical posterior. We can say what we would believe, if, hypothetically we had observed some dataset $D_0$.

I think that this is an easier task to do. It is easier for me to reason about what beliefs I should hypothetically hold in light of some data than it is for me to reason about what I believe independent of any data.

Coin Example

Let's work the simple example of some coin flips. I believe I can model a coin as being a simple Bernoulli process. There is some probability $\theta$ that the coin will land heads and each flip is independent and identically distributed. Therefore, I can model observing $H$ heads out of sequence of $N$ flips with a Binomial Likelihood:

L({H,N}|\theta) = { N \choose H} \theta^H (1- \theta)^{N-H}

Now, we imagine I actually observe some sequence of coin flips, let's say 6 out of 10 flips were heads. Now what should I believe about the bias of my coin? To answer this I need to specify a prior belief I have about the bias of the coin. In most textbook examples, that prior is taken to be uniform $p(\theta) = 1$, saying that our prior belief is that it is equally likely that the coin should have a bias in an interval $\theta + \delta \theta$ for any $\theta$, i.e. this prior says its just as likely the bias of the coin is between 0.1 and 0.2 as it is that it is between 0.5 and 0.6.

Alternatively, I could take Jeffrey's advice and adopt a non-informative prior that is reparameterization independent, or I could try to adopt Gelman's advice and start with an informative prior concentrated near fairness. Below is a representation of these three standard choices where the prior is shown in blue and the posterior from 6 heads out of 10 flips is shown in orange.

A visualization of some standard prior and posterior pairs, a uniform prior, jeffrey's prior and a weakly informative prior.
Figure 1. Some standard textbook priors and the resulting posterior for 6 heads out of 10 coin flips.

These are convenient mathematically and make for easy problems to solve for a homework exercise, but they aren't realistic. If we are being honest, we tend to expect that coins we encounter in the real world and very nearly fair.6. We could therefore start with a prior that is concentrated near fair, but how do we assign a meaningful width to that distribution? And if we're being honest, I've encountered trick coins in my days, double headed and doubled tailed coins and if some wierdo walks up to me and asks me to start predicting a whole sequence of coin flips I shouldn't discount the possiblity they are trying to play me for a fool.

As this stage, trying to adjust the parameters of our prior without any evidence or data is difficult. I have a hard time talking to my gut to decide what I should set my prior beliefs to apropro of nothing. Instead, let's try to invoke the method of imaginary results and imagine some hypothetical dataset and probe our beliefs. Imagine we've just observed 10 coin flips, and all 10 of them were heads! What do you believe now? Now that I've hypothesized a dataset I have an easier time talking to my gut.

In this scenario, I feel as though I would place a reasonable probability on the coin being unfair, let's say 50%. At the same time, I think I would still place a reasonable probability on the coin being exactly fair, let's say 25%. The remaining 25% probability I would want to spread around but biased towards heads, for that let's use a $\operatorname{Beta}(11,1)$ distribution or $11\, \theta^{10}$. I've attempted to visualize this distribution below.7

A mixture of 25\% mass on 0.5, 50\% mass on 1.0 and 25\% mass on a Beta(11, 1).
Figure 2. My attempt at illiciting an imaginary result of a posterior I'm comfortable with if I were to observe 10 heads in a row from a coin.

Or in equation form:

q(\theta|D_0) = \frac 12 \delta(\theta -1 ) + \frac 14 \delta\left(\theta - \frac 12 \right) + \frac {11} 4 \theta^{10}

Once we've specified this imaginary result, we have everything we need to form a posterior for our original problem with 6 heads out of 10 flips.

\begin{align} p(\theta|D) &\propto L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} \\ &\propto 210 \theta^6 (1-\theta)^4 \frac{\frac 14 \delta\left(\theta - \frac 12 \right) + \frac 12 \delta(\theta - 1) + \frac{11}{4} \theta^{10}}{\theta^{10}} \\ &= \frac{210}{211} \delta\left(\theta -\frac 12 \right) + \frac{1}{211} \left( 2310 \theta^{6} (1-\theta)^4 \right) \end{align}

A mixture of 99.5\% mass on 0.5, 0.50\% mass on a Beta(7, 5).
Figure 3. The posterior I get from my illicited imaginary posterior if I actually observe 6 heads and 4 tails. The blue curve is the true posterior, the dashed orange is a blown up version of the small residual component.

The posterior we find is 99.5% probability on the coin being exactly fair, and 0.5% probability assigned to a $\operatorname{Beta}(7,5)$ type posterior, which is buried in the true form above, but I've blown up in the dashed line so you can see its shape. This posterior has a very heavy weight on the coin being exactly fair, which I think is reflective of my actual beliefs but I would have had difficulty specifying in terms of a prior. Instead, if I imagine the coin coming up heads 10 times in a row, the fact that I wanted to still give the coin a 25% chance of being fair is obviously mathematically equivalent to me having a 98.7% prior belief the coin is fair, but I feel as though I have a much higher sensitivity to the right number when I express this as a hypothetical posterior.

The method of imaginary results let's us ask ourselves what we would believe in light of some data, rather than ask us to express what we believe apropos of nothing. I think this helps resolve some of the philosophical issues have with prior selection in Bayesian inference.