Blog.AlexAlemi.com

Leap Day

Fri, 29 Mar 2024 00:00:00 -0400

Going overboard to prove the local newspaper wrong.

Why KL?

Fri, 07 Aug 2020 00:00:00 -0400

The Kullback-Liebler divergence, or KL divergence, or relative entropy, or relative information, or information gain, or expected weight of evidence, or information divergence (it goes by a lot of different names) is unique among the ways to measure the difference between two probability distributions. It holds a special and privileged place, being used to define all of the core concepts in information theory, such as mutual information.

Why is the relative information so special and where does it come from? How should you interpret it? What is a nat anyway? In this note, I'll try to give a better understanding and set of intuitions about what KL is, why it's interesting, where it comes from and what it's good for.

Information Gain

Let's see if we can motivate the form of the KL axiomatically.

Imagine we have some prior set of beliefs summarized as a probability distribution $q$. In light of some kind of evidence, we update our beliefs to a new distribution $p$. How much did we update our beliefs? How do we quantify the magnitude of that update? What are some properties we might want this hypothetical function to have? Let $I[p; q]$ denote the function that measures how much we moved beliefs when we switch from beliefs $q$ to beliefs $p$. We'll call this amount of update the information gain when we move from $q$ to $p$. ¹

We want our information function to satisfy the following properties:

It's continuous. A small change in the distributions makes a small change in the amount of information in the move.
It's permutation or reparameterization independent. It doesn't matter if we change the units we've specified our distributions in or if we relabel the sides of our dice, the answer shouldn't change.
We want it to be non-negative and have the value $I = 0$ if and only if $p = q$. If $p=q$ we haven't updated our beliefs and so have no information gain.
We want it to be monotonic in a natural sense. If we, for instance, start with some uniform distribution over the 24 people in a game of Guess Who? and then update to only 5 remaining suspects, $I$ should be larger than if there were still 12 remaining suspects.
Finally, we want our information function to decompose in a natural and linear way.² In particular, we want to be able to relate the information between two joint distributions in terms of the information between their marginal and conditional distributions.

These are all very natural properties for our information function to have. That last point about composition needs to be elaborated. The point is that we have alternative ways we might express a probability distribution. Apropos of nothing, imagine we are concerned that we might have been exposed to a disease and are thinking about getting a test done. There are two random variables under consideration, we will label them $\mathcal{D}$ for whether we actually had the disease or not, and $\mathcal{T}$ for whether the test result is positive. Each of these random variables can take on two possible states, we'll denote them as $\mathcal{D} \in \{ D, \overline D \}, \mathcal{T} \in \{ T, \overline T \}$. $D$ represents the state of our having-had-the-disease random variable $\mathcal{D}$ being positive, meaning we actually did have the disease. $\overline D$ denotes we actually didn't. With two binary random variables, there are 4 possible outcomes $(\{ DT, D\overline T, \overline D T, \overline D \overline T\})$ and fully specifying our set of beliefs requires 3 independent probabilities.

What are our prior beliefs? Let's imagine while we are concerned we might have had the disease, but if we are being honest, we almost certainly didn't,³ so we'll put our prior belief in having had the disease at 7%. $(q(D) = 0.07)$. How do we expect the antibody test to go if we have it done? You do a bit of research and discover that if you had had the disease, the sensitivity or true positive rate of the test you're about to take is 93.8% $(q(T|D) = 0.938)$. The specificity or true negative rate of that same test is 95.6% $(q(\overline T | \overline D) = 0.956)$. ⁴

Figure 1. Two equivalent ways to express the joint distribution $q(\mathcal{D}\mathcal{T})$.

We've just specified our prior beliefs with 3 numbers, imagining our process as having two steps, first, we either had the disease or not $(q(\mathcal{D}))$ and then, conditioned on that we get the result of our test $(q(\mathcal{T}|\mathcal{D}))$. Equivalently, we could have just given the joint probability distribution, as shown in Figure 1.

The point now is that if we were to update our beliefs, in the diagram on the right there is just a single distribution $q(\mathcal{D},\mathcal{T})$, in the one on the left there are essentially three different distributions $(q(\mathcal{D}), q(\mathcal{T}|D), q(\mathcal{T}| \overline D))$ and we want some sort of structural consistency between the two sides: $$ I[p(\mathcal{D},\mathcal{T}); q(\mathcal{D},\mathcal{T})] \quad \textrm{versus} \quad I[p(\mathcal{D}); q(\mathcal{D})], I[p(\mathcal{T}|D); q(\mathcal{T}|D)], I[p(\mathcal{T}|\overline D), q(\mathcal{T}|\overline D)] . $$

The consistency we will require is that our information measure decomposes linearly between these two different descriptions. The information between the joints should be a weighted linear combination of the informations of three constituent distributions. In this particular case we will require: $$ I[p(\mathcal{D},\mathcal{T}); q(\mathcal{D},\mathcal{T})] = I[p(\mathcal{D}); q(\mathcal{D})] + p(D) I[p(\mathcal{T}|D); q(\mathcal{T}|D)] + p(\overline D) I[p(\mathcal{T}|\overline D), q(\mathcal{T}|\overline D)] . $$ In words: The information in the full joint update is the information update for your belief in whether or not you had the disease $(q(\mathcal D))$ plus the informations in the two conditional distributions, but weighted by how often we find ourselves in each of those branches, as measured by our updated beliefs $(p(\mathcal{D}))$.

More generally we are requiring that our information function satisfies a natural chain rule: $$ I[ p(X,Y); q(X,Y) ] = I[ p(X); q(X) ] + \mathbb{E}_{p(X)} \left[ I[ p(Y|X); q(Y|X) ] \right] $$

Notice that it is here, in this sort of structural independence that we make our information function manifestly asymmetric. Here our $p$ distribution becomes distinguished over our $q$ as it is the one we use to weight the child contributions. This makes sense if we imagine or if $p$ is the actual distribution that events are drawn from, for it means that this will correspond to the information we would observe in expectation.

The interesting thing is that if you want your information function to satisfy all of these seemingly reasonable properties, that is enough to determine it uniquely. The only function satisfying all of these properties is the relative entropy, or KL divergence we all know and love: $$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)} $$

See A New Theorem of Information Theory by Arthur Hobson for a complete proof, but here I'll offer a more colloquial argument like the one given by Ariel Caticha.⁵

We will start with and focus on the continuous setting, where we have two probability distributions $p$ and $q$. We seek a functional that takes our two distributions and gives back our information gain and we seek one that is local in the physics sense, meaning that our functional can be written as the integral of a function depending only on the values the probability densities take at each point: $$ I[p;q] = \int \mathrm dx\, \mathcal{A}(x, p(x), q(x)). $$

Our requirement that our information gain be reparameterization independent means it has to be invariant to any remapping of our coordinates, or in other words, it has to be dimensionless. Imagine $x$ has units of a length, here our integral measure $\mathrm dx$ has units of a length, and the densities $p(x), q(x)$ would have units of an inverse length. In order to be dimensionally consistent our functional must take the form:⁶ $$ I[p;q] = \int \mathrm dx\, p(x) f\left( \frac{p(x)}{q(x)} \right). $$

Finally, our decomposability requirement above when written out in terms of continuous densities takes the form: $$ I[ p(x,y); q(x,y) ] = I[ p(x); q(x) ] + \int \mathrm dx\, p(x) I[p(y|x) ; q(y|x)] $$

Combining this linear decomposition requirement with our requirement for the form required and pushing some equations around gives us: $$ \begin{align} I[ p(x,y); q(x,y) ] &= I[p(x); q(x)] + \int \mathrm dx\, p(x) I[p(y|x); q(y|x)] \\ \int \mathrm dx\, \mathrm dy\, p(x,y) f\left(\frac{p(x,y)}{q(x,y)} \right)&= \int \mathrm dx\, p(x) f\left(\frac{p(x)}{q(x)} \right) + \int \mathrm dx\, p(x) \int dy\, p(y|x) f\left(\frac{p(y|x)}{q(y|x)} \right) \\ \int \mathrm dx\, \mathrm dy\, p(x) p(y|x) f\left(\frac{p(x)p(y|x)}{q(x)q(y|x)} \right)&= \int dx\, dy\, p(x) p(y|x) \left[ f\left(\frac{p(x)}{q(x)} \right) + f\left(\frac{p(y|x)}{q(y|x)} \right)\right] . \end{align} $$ Notice that this demonstrates that our function $f$ must satisfy the property: $$ f(ab) = f(a) + f(b). $$ This well known functional equation has a unique (up to a multiplicative constant) continuous solution: $$ f(x) = c \log x. $$ We can roll the choice of multiplicative constant into our choice of basis for the logarithm and arrive at our final form for our information gain: $$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)}. $$

As for the non-negativity, our final form satisfies that property. Because we have that $\log x \leq x -1$: $$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)} = -\int \mathrm dx \, p(x) \log \frac{q(x)}{p(x)} \geq -\int \mathrm dx\, p(x) \left( \frac{q(x)}{p(x)} - 1 \right) = 0. $$

Bayes Rule

Having identified the right way to measure how much information is gained when we update a distribution from $q$ to $p$, why don't we put this to practical use and try to figure out how we ought to update our beliefs in light of evidence or observations.⁵²

Returning to our disease testing example, let's say you get the test done and receive a positive result $(\mathcal T = T)$. What should your new distribution of beliefs be? Well, first off if we've observed the results of the test we should probably have our updated beliefs reflect the observation we made, making it consistent with our observation, setting $p(T) = 1$, but this doesn't fully specify $p$; we need two more numbers. How should we set those?

Why don't we aim to be conservative and try to find a new set of beliefs that are as close as possible to our prior beliefs while still being consistent with the observation that we've made?
Namely, let's look now for a joint distribution $p(\mathcal T, \mathcal D)$ that is as close as possible to $q(\mathcal T, \mathcal D)$ but for which we have that $p(T)=1$. $$ \DeclareMathOperator{\argmin}{arg\,min} $$ $$ \argmin_{p(\mathcal D, \mathcal T)} I[p(\mathcal D, \mathcal T); q(\mathcal D, \mathcal T)] \quad \text{ s.t. }\quad p(T) = 1 $$ Now that we know how to measure how much information is gained in updating our beliefs, we will find the $p$ that minimizes this update while still being true to the observation we made. Writing $p(\mathcal D,\mathcal T) = p(\mathcal T)p(\mathcal D|\mathcal T)$ and using our linear decomposition rule from above (the other way around), we have: $$ I[p(\mathcal D,\mathcal T); q(\mathcal D,\mathcal T)] = I[p(\mathcal T);q(\mathcal T)] + I[p(\mathcal D|T);q(\mathcal D|T)]. $$ Because we've decided to fix $p(T)=1$ in order to be consistent with our observation, the way to minimize the information between the joints is to set $p(\mathcal D|T)=q(\mathcal D|T)$ so that our second term vanishes. In this particular case this means: $$ p(T)=1 $$ $$ p(D|T) = q(D|T) = \frac{q(T|D)q(D)}{q(T|D)q(D) + q(T|\overline D)q(\overline D)} = 0.616 $$

Furthermore, the marginal distribution of our updated beliefs about our disease status is: $$ p(D) = p(D|T)p(T) = q(D|T) = 0.616$$ In this particular case our updated belief is only 3 to 2 on that we actually had the disease, despite our positive test result. In Figure 2 we show both our prior in this factorization as well as our new beliefs.

Figure 2. Our prior (left, blue, notice that we've swapped the order of the conditioning) and updated (right, orange) beliefs after observing that the test was positive.

Notice what just happened. If we look for a new distribution that is as close as possible to our previous distribution of beliefs (as measured by $I[p;q]$) which is also consistent with our observations, we end up with an updated, or posterior set of beliefs given by Bayes' Rule. Imagine we had some observable $x$ and some parameters $\theta$. Our prior set of beliefs are described by the joint distribution $q(\theta,x) = q(x|\theta)q(\theta)$: a likelihood $q(x|\theta)$ of how we expect the data to be distributed given the parameter values and some prior $q(\theta)$ set of beliefs about what values those parameters can take. If we make an observation and see some value for our observable $x=X$, what ought our new beliefs be? If we search for the joint distribution $p(x,\theta)$ that is as close as possible to our previous beliefs $q(x,\theta)$ but that no longer has any uncertainty about the value the observable will take $(p(x) = \delta(x-X))$ we see that minimizing the information gain: $$ I[p;q] = I[p(x);q(x)] + \int \mathrm dx\, p(x) \, I[p(\theta|x); q(\theta|x)], $$ is accomplished if we set $p(\theta|x) = q(\theta|x)$, yielding the updated joint: $$ p(x,\theta) = p(x)p(\theta|x) = \delta(x-X) q(\theta|x) $$ and the marginal beliefs about the parameters to be: $$ p(\theta) = \int \mathrm dx\, p(x,\theta) = \int \mathrm dx\, \delta(x-X) q(\theta|x) = q(\theta|X), $$ or precisely what you probably thought it should have been anyway if you've heard of Bayesian inference.

Although, if you stop to think about it, even though many of us know of and have used Bayes Theorem for a long time, the way it's normally presented, it is just a trivial statement about how joint distributions factor. $$ q(\theta, D) = q(\theta) q(D|\theta) = q(D) q(\theta|D) \implies q(\theta|D) = \frac{q(D|\theta) q(\theta)}{q(D)}. $$ But, this is just a statement about distribution $q$, our prior beliefs. It tells us nothing about how we should update those beliefs in light of observations. However, the previous argument demonstrates that if you want to set your updated beliefs such that they are as close as possible to your prior beliefs while being consistent with your observations, you should set your updated beliefs according to Bayes' rule run on the prior beliefs.

Expected Weight of Evidence

Traditionally, KL is interpreted from a coding perspective, a view I've included in an appendix below, but here I offer a different perspective from the viewpoint of model selection.⁸

Above we saw that we can motivate Bayesian inference as choosing a posterior belief distribution that has the minimal information gain over our prior distribution of beliefs while being consistent with our observations. This guides us towards forming better belief distributions, but what if we just have two different belief distributions and wish to decide between them?

Really what we want to know is what is the probability that our beliefs are correct in light of evidence? Symbolically you might write this as $p(P|E)$ where $P$ is some belief distribution and $E$ is some evidence, data, or observations. If we run Bayes Theorem we can see that: $$ p(P|E) = \frac{p(E|P) p(P)}{p(E)}. $$ We can update our belief in our beliefs being correct by setting our updated weight in the belief $p(P|E)$ to be proportional to our initial weight $p(P)$ times the likelihood that the evidence we observed would have been generated if our belief was true $(p(E|P))$. The probability of the evidence given the belief $P$ is just the likelihood $P(E)$. Proportional because we would need to know how likely the evidence would be $p(E)$ amongst all possible beliefs. This last part, the marginal likelihood is notoriously difficult to compute. In principle, it is asking us to evaluate how likely the evidence would be from all possible models.

However, we can make further progress if we content ourselves to not necessarily knowing the absolute probability our model or beliefs are correct, but instead just its probability relative to some other model. If we consider the ratio of two different models $P$ and $Q$ we have: $$ \frac{p(P|E)}{p(Q|E)} = \frac{p(E|P)}{p(E|Q)} \frac{p(P)}{p(Q)}. $$ Notice that the marginal likelihoods cancel out. This is saying that whatever prior relative odds for the two models being correct, if we compute the Bayes factor $\left( \frac{p(E|P)}{p(E|Q)} \right)$, it tells us how the relative probabilities of the two beliefs should update in light of the evidence. Taking a log on both sides: $$ \log \frac{p(P|E)}{p(Q|E)} = \log \frac{p(E|P)}{p(E|Q)} + \log \frac{p(P)}{p(Q)},$$ turns this multiplicative factor into an additive one.

If what we are deciding between is two different probability distributions, you may recognize that this additive weight of evidence for $p$ over $q$ when we observe $x$ is precisely the integrand in our information gain: $$ w[x; p,q] = \log \frac{p(x)}{q(x)}. $$ The log ratio of two probability distributions measures by how much you should update your prior log odds between the two distributions being correct. The KL divergence is just then the expected weight of evidence if we draw samples from $p(x)$ itself: $$ I[p;q] = \mathbb{E}_p\left[ \log \frac{p(x)}{q(x)} \right] = \mathbb{E}_p \left[ w[x; p,q] \right]$$

So, one way to interpret the relative entropy is that if our data was actually coming from the distribution $p$ and we had some other hypothesis $q$, the $I[p;q]$ measures on average how much we should believe $p$ over $q$ on each observation. In order to make that statement more precise, we need a better language to talk about the magnitudes of these quantities.

How loud is the Evidence?

Our measurement of the amount of information was only unique up to a choice of multiplicative constant. This is equivalent to our choice of base for the logarithm. We can think of this as the units we use to measure our information. The traditional choices would be to use the base-2 logarithm and measure the information in bits,⁹ or to use the more mathematically convenient natural logarithm and measure the information in nats. Another option is to measure the information in decibans or decibels or Hartley's, wherein we use ten times the base-10 logarithm.

$$ I[p;q] = 10 \int \mathrm dx\, p(x) \log_{10} \frac{p(x)}{q(x)}\, \textrm{dB} $$

The nice thing about measuring information in decibans or decibels is the people already have some familiarity with the unit, such as for measuring the loudness of sounds. It's always a comparative measurement, for sound taking $10 \log_{10} \frac{P}{P_0}$ of the power to some reference or baseline power. In the same way we could besides just measuring the KL between two distributions, measure the comparative difference between any two probabilities on the log scale: $$ 10 \log_{10} \frac{p(x)}{q(x)} \textrm{ dB}. $$

In particular, we could get some feeling for these quantities by comparing the probability something happens to the probability it doesn't. Consider a simple binary outcome and taking $q=1-p$, in this case, the weight of evidence that the thing happens versus it doesn't upon observing it happen once is: $$ 10 \log_{10} \frac{p}{1-p} \text{ dB}. $$ This essentially gives us a new scale to measure probabilities on. Instead of expressing probabilities as a number between 0 and 1, here we are computing the log odds of an event happening on the decibel scale.

Below in Table 1 is a summary of the correspondence between decibans and odds or probabilities, and in Figure 3 is a large visual representation you can play with.

db	odds	~odds	probability
0	1.00	1:1	50%
1	1.26	5:4	56%
2	1.58	π:2	61%
3	2.00	2:1	67%
4	2.51	5:2	71.5%
5	3.16	π:1	76%
6	3.98	4:1	80%
7	5.01	5:1	83%
8	6.31	2π:1	86%
9	7.94	8:1	89%
10	10	10:1	91%
11	12.6	4π:1	92.6%
12	15.8	16:1	94%
13	20	20:1	95%

Table 1: A table of the correspondence between decibans/decibels and odds or probabilities.

Figure 3: A larger visual representation of decibels as a probability that you can play with. Here the set value of decibels measure the weight of evidence between the spinner giving a blue versus a white outcome.

Another nice property of measuring evidence and probabilities in decibels is that it seems like 1 dB roughly corresponds the smallest detectable value that people notice in terms of a change in underlying distribution, being the difference between even chance and 5 to 4 odds, moderate probability or better than even chance.

Additionally, $10 \textrm{ dB}$ corresponds to 10 to 1 odds, or 91% probability, which people associate with events being almost certain or happening almost always. ¹⁰.

The traditional statistical threshold for reported results is a p-value of 0.05, which is often misinterpreted to mean that the probability the null hypothesis is less than 5%. While this isn't what the p-value measures, if we obtain more than 13 dB of evidence against some null hypothesis, this does mean that the relative odds that it is correct have decreased by a factor of 20, taking us below 20 to 1 against if we started with even odds.

We have the conversions: $$ 1 \textrm{ nat} = \frac{10}{\log 10} \textrm{ dB} = 4.34 \textrm{ dB} $$ $$ 1 \textrm{ bit} = \frac{10}{\log_2 10} \textrm{ dB} = 3.01 \textrm{ dB} $$

Examples and Magnitudes

Double-headed Coin

Let's say I have two coins in my pocket, the first is an ordinary unbiased coin, and the second is doubled-headed. I give you one of them and you start flipping the coin. You get a heads, then another heads, then another. How many heads would you need to see in a row until you're sure you've been given the doubled-headed coin? Let's work out the relative entropy between these two distributions. On the one hand we have $p(H)=1, p(\overline H) =0$, and the other $q(H) = q(\overline H)= 0.5$.

$$ I[p;q] = 10 \sum_i p_i \log_{10} \frac{p_i}{q_i} = -10 \log_{10} 2 = 3.01 \text{ dB} $$

The relative entropy of a sure thing and a coin flip is 3 decibels. This means that if we want to be more sure than 20 to 1 that we have the doubled-headed coin we'd need to observe 5 heads in a row, giving us 15 dB of evidence.

Births

Perhaps the first hypothesis test to be resolved with modern statistics was the question of whether more male or female babies are born. Using data from 1745 to 1770, Laplace found that in those 26 years, 251,527 boys and 241,945 girls were born. This gives a fraction of male births of $\sim 51\%$. Is this just a statistical fluke, or are boys more common than girls at birth? What Laplace did was to analytically work out the Bayesian posterior distribution for the probability that a male baby was born using a uniform prior, obtaining a $\operatorname{Beta}(251528, 241946)$ distribution, for which the probability that the probability a male is born is less than or equal to $1/2$ is $$ \int_0^{1/2} \mathrm dx \, \operatorname{Beta}(x; 251528, 241946) \sim 10^{-42}$$ enough for Laplace to declare that he was morally certain that males are born more frequently than females.

Let's work out the weight of evidence in this case, let's say we were comparing two hypotheses, the first that males are born 51% of the time, and the second that they are born 50% of the time. With Laplace's data, the total weight of evidence in this case is:

$$ 2515270 \log_{10} \frac{0.51}{0.50} + 2419450 \log_{10} \frac{0.49}{0.50} = 404 \text{ dB} $$ a whopping 400 decibels of evidence for males being born 51% of the time rather than 50%.
At the same time, I'm not sure most people are aware that males are born with a higher proportion and it doesn't seem to affect most people's lives. Why is that? Well, let's evaluate the relative entropy between a 51% Bernoulli and a 50% Bernoulli: $$ I = 5.1 \log_{10}\frac{0.51}{0.50} + 4.9 \log_{10} \frac{0.49}{0.50} = 8.7 \times 10^{-4} \text{ dB}. $$ Notice that the relative entropy is quite small. On average, if the true distribution was 51%, the evidence we accumulate on each observed birth is less than 8 microbels. This means that on average in order to be reasonably sure that the 51% hypothesis is true, we'd have to observe $\sim \frac{13}{8.7 \times 10^{-4}} \sim 15,000$ births. This makes clear how with enough data we could both be very sure that males are born with a higher frequency than females, but at the same time, this could have very little impact on our individual lives.

Likelihoods and Learning

What we would really like to do is learn a model of some real life distribution. If the true distribution of data is $p(x)$, and we have some kind of parametric model $q(x;\theta)$, we would like to set our model parameters $\theta$ so that we get as close as possible to the true distribution. In other words, we want to minimize the relative entropy from the real world to our model: $$\min I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x;\theta)}. $$ The biggest complication is that we don't actually know what the true distribution of the data is. We can, however, sample data. Luckily for us, as far as this as an objective for $\theta$ goes, we can treat the entropy of $p(x)$ as a constant. This motivates the traditional maximum likelihood objective: $$ \max \int \mathrm dx \, \log q(x;\theta). $$

If we had an infinite dataset, maximum likelihood is the same as minimizing the relative entropy between the real world and our model. Unfortunately, we don't often have infinite datasets.¹¹ On finite datasets, maximum likelihood can still be interpreted as minimizing a KL divergence, but now the KL divergence between the *empirical distribution* $\hat p(x) = \sum_i \delta(x - x_i) $ and our model $q(x;\theta)$.

Unfortunately, the cross entropy is no longer reparameterization invariant a point I elaborate in an appendix below, and so is difficult to interpret directly, but if we take the difference of any two cross entropies, we can still interpret that as the weight of evidence for one model with regards to the other. Because of the lack of reparameterization independence, care must be taken to ensure that the likelihoods of the two models are evaluated using the same measure, but provided they are:

$$ L_1 - L_2 = \mathbb{E}\left[ \log q_1(x) \right] - \mathbb{E}\left[ \log q_2(x) \right] = \mathbb{E}\left[ \log \frac{q_1(x)}{q_2(x)} \right] $$

Given the size of test sets we have for modern image datasets, this means that very small changes in likelihood can be interpreted as large confidences in the superiorities of models. Take for instance something as simple as binary static MNIST.¹² Here, with 10,000 test set images, a difference in likelihoods of 0.0013 dB or 0.0004 nats corresponds to 13 dB of evidence for the one model over the second.

Appendix A: Whither Continuous Entropy

The relative entropy really is the proper way to define entropy. For all of the things that Shannon got right, he flubbed a bit when he defined the entropy of a distribution as: $$ H(P) = -\sum_i p_i \log p_i $$

Why do I say he flubbed? Because this notion of entropy doesn't generalize to continuous distributions. The continuous analog: $$ H(P) = -\int \mathrm dx\, p(x) \log p(x) $$ isn't reparameterization independent. Consider for instance the distribution of adult human heights: ¹³

Figure 1. Distribution of adult heights. ¹⁴

If you measure the continuous entropy of this distribution measured in centimeters you get 5.4 bits. If you instead measure the entropy of the same distribution in feet you get 0.43 bits. If you instead were to measure heights in meters it would be -1.3 bits! ¹⁵

Appendix B: Coding Interpretation

The traditional interpretation offered for the KL is from the coding perspective. Imagine we have a simple 4-letter alphabet that we want to communicate over the wire. If the four letters occurred with different probabilities: $p(A)=1/2, p(B)=1/4, p(C)=p(D)=1/8$, with an optimally designed Huffman Code we could encode our letters with a variable length code: $A:0, B:10, C:110, D:111$, and on average we'd only be spending $1/2 + 2/4 + 3/8 + 3/8 = 7/4$ bits per letter.

	A	B	C	D
$p$	1/2	1/4	1/8	1/8
p-code	0	10	110	111
$q$	1/4	1/4	1/4	1/4
q-code	00	01	10	11

Table 2: A simple example of two different distributions over a 4 letter alphabet.

Imagine however we didn't know what the true distribution of letters was and instead designed an optimal code using a different distribution $q$. If we believed each of the 4 letters were equally likely $(q(A)=q(B)=q(C)=q(D)=1/4)$, the optimal way to encode messages would just assign a two bit code to each letter $(A : 00, B:01, C:10, D:11)$. If we used this suboptimal code to send messages that were actually distributed as $p$ it would cost $2/2 + 2/4 + 2/8 + 2/8 = 2$ bits per letter. Our incorrect belief leads to a $2 - 7/4 = 1/4$ of a bit inefficiency. For these two distributions, it shouldn't come as a surprise that the information gain is precisely 1/4 bits: $$ I[p;q] = \sum_i p_i \log_2 \frac{p_i}{q_i} = 1/4 \textrm{ bits}. $$

For an optimally designed code, the code lengths go as $-\log p(x)$ for any symbol $x$. Our information gain can be interpreted as a difference in expected code lengths under $p$: $$ I[p;q] = \mathbb{E}_p[ -\log q ] - \mathbb{E}_p[-\log p ]. $$ The information gain $I[p;q]$ measures the excess encoding cost for trying to encode messages from $p$ using a code designed for $q$.

Blogging

Mon, 15 Nov 2021 00:00:00 -0500

We all stand on the shoulders of giants.

The thing that separates humans from other animals is the degree to which we can share knowledge with one another. As such, I feel as though we have some duty to our fellow humans to share the knowledge we've accumulated. More than ever before, everyone has at their disposal the tools required to share their knowledge with anyone essentially anywhere else on earth.

I feel guilty for not participating myself.

Sure I write scientific papers, but if we're being honest, most scientific papers have rather small audiences. If we're being even more honest, unfortunately, it seems as though most scientific papers these days are not really meant to be read. It doesn't feel as though they go out of their way to make themselves understood. I'm sure there is some selection bias but I feel as though when I read older papers they feel a lot more like a conversation with the author. I feel as though the best papers feel as though they are a chat with the author(s) and they are essentially helping you stand on their shoulders. They've spent a good amount of time and energy thinking about some particular thing and are attempting to share that thought with you so that you don't have to go through the same arduous process.

I feel as though this is a perfect opportunity to bring back blogging and in particular academic blogs. For some reason, it becomes clear when you're writing a blog post that it's meant to be read. Many academic blogs take on a much more conversational tone and focus a lot more on explaining themselves and the ideas they are covering in simple to understand terms. Granted, a lot of the best information these days comes in the form of videos, but not everyone has the skills, time, or energy to produce such amazing video content. The barrier to entry is a lot lower for blogging.

Finally, it's cliche but we all know that the best way to learn is through teaching. Taking the time to try to explain a new concept you're struggling with can really help solidify it in your own mind. I've found that many of the things I've learned best were things I worked through for previous blog posts of mine.

I want to start blogging again. I've set up this small site to do that. It's remarkable how easy it is these days. This site is being hosted on GitHub Pages which gives free space for anyone to start hosting their own static content. Using systems like Jekyll the actual practice of making the website can be quite simple these days. In my case I wrote some custom python scripts to convert my Markdown posts to html. This blog itself lives in its own github repo. MathJax allows for beautiful $\LaTeX$ math rendering. For discussions or comments, I'll try out giscus which is a nifty system enabling comments through github discussions. People can also reach out on twitter. I also think it's important to support rss which I've done here.

A Simple Demonstration of Benford's Law

Tue, 03 May 2022 00:00:00 -0400

Benford's Law is the observation that in a list of "naturally occuring numbers", the plurality of numbers will begin with a 1. This often catches people by surprise, but if you go and pull random numbers from a book or newspaper you can expect the leading digits to follow the following distribution:

Figure 1. The distribution of leading digits in naturall occuring numbers.

Nearly a third of all naturally occuring numbers begin with a 1. Why? Perhaps the simplest way to see this is to realize that if there is going to be some kind of distribution for naturally occuring numbers, that distribution ought to be reparameterization independent. It shouldn't matter whether we use English or metric units. If there is going to be a universal leading digit distribution, its gotta be invariant to a change of units. Changing units is accomplished via multiplication, so whatever the universal distribution of leading digits is, it needs to be invariant to multiplication.

See it

Perhaps the easiest way to see Benford's law is to look at a Circular slide rule:¹

Figure 2. An old soviet circular slide rule. The inner dial is the main dial. Notice that the digits follow Benford's law.

If you look at the inner dial, you'll notice that the digits are spaced just like our Benford's law distribution was in Figure 1. This is not an accident. Slide rules work by physically manifesting multiplication as a sort of addition. On this circular slide rule, if you add the angle a number appears at on the inner dial to the angle some other number appears at, the resulting angle will point to their product. Since slide rules only track the significand,² circular slide rules cleverly wrap around.³

Whatever the universal leading digit distribution is, or more specifically whatever the universal distribution of significands is, provided one exists, it would have to be invariant to any multiplication. It would have to be invariant to the addition of any random angle on the circular slide rule, i.e. it would have to be circularly symmetric, i.e. it would have to uniform on the slide rule dial.

This thought process is enough to give us the distribution in Benford's law. The digits on the circular slide rule are located so that, $$ \theta(x) = 2 \pi \log_{10} x, $$ for the numbers from 1 to 10. This ensures that 1 is at $\theta=0$ and $10$ is at $\theta = 2\pi$. It also ensures that if we try to locate the angle of a product of numbers $x$ and $y$, we can do so by simply adding their angles:⁴ $$ \theta(x y) = 2\pi \log_{10}( x y ) = 2 \pi \log_{10} x + 2 \pi \log_{10} y = \theta(x) + \theta(y). $$

Knowing how the digits are arranged, we can easily determine the fraction of the circle allotted to each one: $$ f(d) = \log_{10}(d+1) - \log_{10}(d) = \log_{10}\left( 1 - \frac{1}{d} \right). $$ This is the formula you'll see elsewhere. Most of the discussion surrounding Benford's law focusses on the first digit alone, but our visual argument also suggests that we can easily determine the distribution for significands themselves, not just the first digit. For instance, looking at the sliderule, we can see that its nearly as likely that a number should have "10" as its first two digits⁵ as it is that we'd find a naturally occuring number beginning with a 9.

We've already said that with a random multiplication being like a random spin of the circular slide rule pointer, the universal distribution of significands should just be uniform on the slide rule. Performing a change of basis, if we know the distribution $p(\theta)$ of angles along the circle is uniform, we can work out the distribution $p(x)$ of significands by requiring we conserve all of the probability mass: $$ p(\theta) \, d\theta = p(x) \, dx $$ combined with what we already know as the relationship between our significands and their angles: $\theta = 2\pi \log_{10} x$. This allows us to transform the uniform distribution of angles $p(\theta) = \frac{1}{2\pi}$ into: $$ p(x) = \frac{1}{x \log 10 }. $$ We've recovered a nice power law or "scale-free" distribution for the significands, something we could have guessed or worked out from our requirement that the distribution be invariant to scale.

We may have just gone around in a circle,⁶ but I hope you agree that there is something very visceral about seeing Benford's law play out on the face of the circular sliderule.

Knots

Mon, 13 Jun 2022 00:00:00 -0400

Years ago I decided that I didn't know how to tie any knots and needed to remedy the situation. Unfortunately, if you start to look into knots you discover there is a whole slew of them. It's easy to dive deep and try out dozens of different knots and then be left not remembering how to tie any of them. So, I decided I was going to try to pick the best knot for each of a few different scenarios and then focus on learning and retaining those knots well.

I can say that having done this nearly a decade ago, it's paid off. It's not the kind of thing that comes up very often, but when it does, knowing what to do with a rope has proven a very useful skill. So, which knots do I recommend?

Most sources of this type tend to focus on specific jobs or activities, the best boating knots, scouting knots, climbing knots, etc. I don't do any of those things. Instead, I want to know how to not embarrass myself if I find myself needing to accomplish some kind of task with a rope. So, for each of the common use cases for a rope, I've tried to identify the "best" knot to use. Best, in this case, means some combination of ease of tying, ease of remembering, and strength and ability of the knot. A lot of knots will have "false friends" of a sort, very closely related knots that are often much worse in their characteristics. I tried to pay special attention here, as I don't very often use knots so I need to ensure the ones I focus on learning are robust in the sense that even if I haven't tied it for a few years I'm unlikely to make a grave mistake.

Tying off

If I'm being honest, the most common knot I tie, nearly every day is the Square knot, also known as the reef knot. This is because its the knot I use to tie my shoes, or tie off garbage bags, that sort of thing.

beautiful [but deadly] square knot flickr photo by woodleywonderworks shared under a Creative Commons (BY) license

The knot is very common and nearly everyone knows it, but one thing I've discovered is that a good fraction of people tie the knot incorrectly! By far, the most useful thing I learned all those years ago when I decided to dive into knots was that I had been tying my shoes wrong my entire life.

To tie a proper reef knot, you have to alternate which rope goes on top. The mnemonic is right over left, then left over right, if you instead do right over right then right over right as I did most of my life, you end up with the far worse knot, the Granny Knot, instead.

Throughout my childhood and young adult life, it always seemed as though my shoelaces would come untied, I often had to revert to double knotting them to try to get them to last the whole day. The problem the entire time was that I was tying the wrong knot. I've since caught many of my friends and family having the exact same problem. If you spend a few minutes tying both a proper square knot and a granny knot you can immediately learn to tell the difference. The telltale sign when looking at someone's shoes is to check to see if the loops are crooked. Proper shoelace loops should lie down the sides of the shoe. If a granny knot is tied, they tend to rotate and lie parallel to the shoe but perpendicular to the laces. For more info, see Ian Fieggen's site.

Now, as I said, the square knot is by far the most common knot I tie, which is a bit unfortunate because the knot itself is pretty poor. It should not be used in any sort of critical situation. Aside from tying off garbage bags and shoelaces, it shouldn't be used at all. It should never be used to join two ropes. Ashley¹ claimed that reef knots have caused more deaths than all other knots combined. Given their familiarity, I imagine many people use them to join ropes not realizing their tendency to slip until it's too late. Learn how to tie a square knot properly but also learn alternatives to use for basically all use cases.

Shoelaces

Speaking of Ian Fieggen, I actually use his quick method to try my shoes these days. It only takes a few minutes to learn and then you can use it for the rest of your life, I recommend checking it out.

A very fast method of tying your shoes.

Loop at end of the rope

The next most common use case I find myself in is wanting to tie a loop at the end of a rope. For this I think the Bowline is the goto answer.

"Bowline knot", by Markus Bärlocher, in the public domain.

The Bowline is often described as the "king of knots". It's a nice knot to know. The loop formed doesn't collapse, so can be used to make a handle, to tie a rope around a hole, pole, or other objects. If you make a Bowline in the end and then pass the rest of the rope through you can make a type of lasso or slipknot. Even after being used the knot is usually easy to untie.

Tying the knot is simple and easy to remember, especially with the common mnemonic: "The rabbit comes up the hole, runs around the tree then back down the hole." One thing to pay special attention to is to ensure that the free end of the rope is left on the inside of the loop.

Loop in the middle of the rope

Next, if you need a loop in a the middle of a rope and don't have access to the ends, use the Alpine Butterfly. This is a great midline loop knot, it doesn't collapse, can be loaded from any direction or the loop itself. It is easy to untie after being loaded.

"Alpine butterfly knot", by WikipedianYknOK, licensed under CC BY-SA 3.0

It's also very easy to tie if you know the hand wrap method on Grog's site.
Using this method, its very difficult to mess up the knot.

Bend, joining two ropes

To bend or join two ropes together, we don't actually need a new knot, we can reuse the Alpine Butterfly.

"AlpineButterflyBend", by Cobanyastigi, in the public domain.

To tie this, use the same hand wrapping method but start with the first rope which you take up to the top of your hand and pinch between your middle fingers, then add the second rope to the same spot and continue the wrap. Pull both ends down and under the two wraps and you've got an alpine butterfly but with the two ends of the ropes playing the role of the loop.

We get another use case without having to remember a new knot.

Hitch, attaching a rope to an object

I have to admit, while last time I looked into knots I managed to learn all the rest of the knots you see listed here, but I didn't manage to remember the hitch I had picked.
This is unfortunate because I've found myself needing to tie a rope to a pole or tree at times. One option is to simply tie a Bowline loop around the object in question, which I've used in the past, but I've felt bad for not having a proper hitch in my arsenal. Since I forgot the hitch I picked last time around, I clearly made a poor choice, so when preparing this blog post, I started over.

After looking around again, I've decided that going forward, my chosen hitch will be the Backhand Hitch.

The backhand hitch is a Munter hitch with Two Half Hitches to finish it off. It appears to be gaining in popularity, with many extolling its benefits. It's a secure knot that is not only easy to untie but can hold a load while being untied due to the Munter hitch at the base.

As I said I don't have a whole lot of personal experience with this knot, but I'll try to retain it going forward.

Trucker's Hitch

Another reason I like the Backhand hitch is that it sets us up well to make use of another kind of popular hitch, the Trucker's Hitch². I think of the Trucker's hitch less as a particular knot as more as a sort of principle for how to tie down loads. A trucker's hitch involves tying a mid-line loop in a rope which you then use a pulley to get three-to-one leverage as you try to pull the line tight. After that, you simply tie it off when it's appropriately tight. The three-to-one leverage you get really helps tighten the rope.

For our midline loop, we can use the Alpine butterfly from above. After pulling tight, we need to secure the rope, and for that, we can finish the Trucker's hitch with the same Two Half Hitch structure we use to finish the Munter Hitch in the Backhand Hitch, here around both of the ropes at the same time.

Even though it's composed of the other pieces we've discussed so far, it's worth practicing the Trucker's hitch altogether, and with some regularity. It's the kind of thing that if you can recall immediately can be a big help, but if you have to look it up, it's not really worth it.

Necktie

I don't often wear a tie, but when I do I use the Pratt knot. A modern, relatively slim, nicely symmetrical tie knot. It has served me well, and I like that it starts with the tie reversed.

"Blue Pratt Knot", by Kris Alekseych Karlov, in the public domain.

Binding

If you want to very securely close off a bag or sack, so well that you'll have to cut the rope to get it open again, look no further than the Constrictor Knot. It can also be tied in the end of a rope.

Stopper Knot

Finally, though I don't think it's the most important use case, a surprisingly common one is needing to put a stopper knot in a rope, just something to keep the rope from slipping through a hole. Nearly everyone knows the [Overhand knot, however, I think fewer people know that there are superior alternatives. The overhand knot tends to bind, being difficult to untie if pulled taught, and isn't particularly big a stopping knot, to begin with.

If the desire is to have a large stopper knot, the Double overhand is much better than the ordinary overhand knot and very easy to tie, simply add an additional turn while tying the overhand knot. When pulled tight it forms a neatly symmetrical stopper, but tends to bind tight, perhaps tighter than the overhand knot.

If you want a stopper knot that can be untied, the Figure 8 knot works well, and could also be described as adding an additional to the overhand knot, but this time on the outside of the knot.

Try them out

Get yourself a piece of rope, and a free half-hour or so, and try out some of these knots. If you focus on a short list of a handful of knots, one for each situation you might need to use a rope for, I think you'll find it time well spent.

A Path to the Variational Diffusion Loss

Thu, 15 Sep 2022 00:00:00 -0400

Diffusion models have made quite a splash, especially after the open-source release of Stable Diffusion. What are diffusion models, where does the loss come from and what does a simple example look like? I've recently helped open-source a simple, pedagogical, self-contained example colab of a diffusion model trained on EMNIST, which you can find as part of the Variational Diffusion Models (VDM) github page. In this post, I wanted to give some more background and a simple way to motivate where the loss function comes from.

Non-negativity of KL

Let's say we want to build a latent-variable model, $q(x, z)$ where the likelihood of the data ($p(x)$), has high marginal likelihood: $\log q(x)$. Unfortunately, computing $\log q(x)$ involves an intractable integral over the latent variable, $z$.¹

³ I use brakets to show expectations and unless noted, always with respect to the full $p$ distribution. $$ \left\langle \cdot \right\rangle_p = \mathbb{E}_p \left[ \cdot \right] = \int dx\, p(x) [\cdot] $$

If I don't denote the distribution the expectation is with respect to on the brakets, it's always the full joint $p(x,\cdots)$. Notice that this works even if there are fewer variables or conditioning variables left inside the terms in the brakets, as any excess variables will just marginalize out without issue in the expectation and any variables being conditioned on will be evaluated in expectation as desired.

We can derive the tractable objective used to train these models using the observation that the KL² divergence is non-negative and monotonic. The Kullback-Leibler (KL) divergence between any two distributions is non-negative:³ $$ \left\langle \log \frac{p(x)}{q(x)} \right\rangle_p \geq 0. $$

If we marginalize out some subset of random variables the KL divergence of the marginal distributions has to be less. For any two random variables: $$ \begin{align} \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle &= \left\langle \log \frac{p(x)p(z|x)}{q(x)q(z|x)} \right\rangle \\ &= \left\langle \log \frac{p(x)}{q(x)} \right\rangle + \left\langle \log \frac{p(z|x)}{q(z|x)} \right\rangle \\ &\geq \left\langle \log \frac{p(x)}{q(x)}\right\rangle \geq 0 \end{align} $$ Intuitively, if we think about KL divergence as a "distance" between probability distributions, two joint distributions always have to be at least as far apart as their marginals. As we just saw, the KL of the joint is the sum of the KL between the two marginals, as well as the expected KL of the conditional distributions (which has to be positive, as all KLs are).

VAEs

Imagine designing these joint distributions to have different flavors. Think of $p(x,z)$ as a forward process $p(x) p(z|x)$ that takes an image from some natural image distribution $p(x)$ and then encodes it into some representation $z$ with an encoder $p(z|x)$. This is a joint distribution over the two variables. Running the forward process would give us $(x,z)$ pairs, pairs of natural images and their encodings. Next, imagine a different joint distribution, a reverse process $q(x,z)$ that takes some sample from a prior $q(z)$ and then runs it through a decoder $q(x|z)$ to generate a synthetic image. This is a generative model of the kind we might be used to building. This is also a fully-fledged joint distribution that we could sample from, in order to generate $(x,z)$ pairs. At initialization, these two distributions are very different. The goal of generative modeling is to bring these two joint distributions into alignment.

Based on the properties of the KL divergence, these two joint distributions must have a non-negative KL divergence that is monotonic to marginalizing out one of the variables: $$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log \frac{p(x) p(z|x)}{q(z) q(x|z)} \right\rangle \geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle \geq 0 $$ Notice what this is saying. The KL divergence between the joint distributions here is the expected log density ratio of the forward to the reverse model's likelihood, where the expectation -- the samples -- are taken with respect to the forward process $p(x,z)$. This joint KL is itself an upper bound for the KL divergence between the marginal distributions $p(x)$ and $q(x)$. $p(x)$ was our original image distribution, while $q(x)$ is the distribution of synthetic images drawn from the generative model that is our reverse process: $$ q(x) = \int dz\, q(x|z) q(z) $$

So, by minimizing the KL between our forward and reverse process -- by aligning the two joint distributions -- we can ensure that we make progress towards learning a good generative model of our images $q(x)$. We can ensure that we are aligning the marginals $q(x)$ and $p(x)$.

The tightness of this bound is controlled by how close together the remaining conditional distributions are:

$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log \frac{p(x)}{q(x)} \right\rangle + \left\langle \log \frac{p(z|x)}{q(z|x)} \right\rangle $$ In other words: the degree to which our encoding distribution ($p(z|x)$) matches the Bayesian posterior of our generative model ($q(z|x)$) determies the tightness of our bound.

So, again, all we started with is the idea of two different processes, the forward process that takes images and encodes them and a reverse process that samples some latents from a known distribution and decodes them. If we try to minimize the KL divergence between these two processes, forward to reverse, we can ensure that this is a valid bound on the marginal KL between the true image distribution $p(x)$ and the marginal of our generative model $q(x)$. That is, by learning to make the two joint processes look alike we are also as a consequence learning a good generative model of images.

We've just derived the ordinary ELBO:⁴ $$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log p(x) -\log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle, $$ up to a constant outside our control, the entropy of the true image distribution $p(x)$. Notice that this term cancels out on both sides if we wish to target the cross-entropy from our true $p(x)$ to our model's $q(x)$ rather than the KL.

$$\begin{align} \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log p(x) - \log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle &\geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle \\ \left\langle -\log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle &\geq \left\langle -\log q(x) \right\rangle \\ \left\langle \log q(x) \right\rangle &\geq \left\langle \log q(x|z) - \log \frac{p(z|x)}{q(z)} \right\rangle \end{align}$$

At the end of the day, the hope and the dream we seem to have in doing latent variable modeling is that maybe we will somehow be more successful in learning a reverse $q(z)q(x|z)$ process to match some forward $p(x)p(z|x)$ than we would have been able to just model the density $q(x)$ directly. We are hoping that by expanding the problem, and making it a harder or larger modeling task, it'll become easier for us to optimize or learn.

Diffusion

For diffusion models, honestly, there isn't much to add except they add many more steps. The only difference is that instead of a two-step forward process, in diffusion we imagine a many-stepped (or potentially continuous) forward and reverse process.

In particular, in most diffusion models we fix the forward process to be a Markov chain: $$ p(x, z_0, z_1, z_2, \cdots, z_{T-1}, z_T) = p(x) p(z_0|x) p(z_1|z_0) \cdots p(z_T|z_{T-1}), $$ which starts with a sample from a natural image distribution $p(x)$ and then adds $T$ steps of additive Gaussian noise $p(z_t| z_{t-1}) \sim \mathcal N(\alpha_{t} z_{t-1}, \sigma_{t}^2) $.

Figure 1. The graphical model for the forward process in diffusion.

This takes an ordinary image and then adds more and more noise to it until it looks more or less indistinguishable from just isotropic Gaussian noise.⁵

Figure 2. A demonstration of the typical forward process in diffusion models.

One particularly nice thing about using Gaussians for every step of the forward process here is that the composition of a bunch of conditional Gaussians is itself Gaussian so we will have a closed form for the marginal distribution at any intermediate time: $$ p(z_t|x) = \mathcal N(\tilde \alpha_t x, \tilde \sigma_t^2 I ).$$

With a forward process defined, we parameterize or learn the reverse process, a Markov chain that operates in the opposite direction: $$ q(x,z_0,z_1,\cdots,z_T) = q(z_T) q(z_{T-1}|z_T) \cdots q(z_1|z_2)q(z_0|z_1)q(x|z_0) $$

Figure 3. The graphical model for the reverse process in diffusion.

The VDM loss is⁶ simply the KL between these two joints, which serves as an upper bound on the KL of the image marginals: $$ \left\langle \log \frac{p(x,z_0,z_1,\cdots,z_T)}{q(x,z_0,z_1,\cdots,z_T)} \right\rangle \geq \left\langle \log \frac{p(x)}{q(x)}\right\rangle $$

Just as in the case of a VAE, here, the hope is that it might actually be easier to model the larger joint distribution than it was to try to model the density directly. In the case of simple diffusion models, the forward process is fixed additive Gaussian noise. If we make enough steps in the forward process we believe we ought to be able to learn the reverse process exactly.⁷

Various Sundry Tricks

The joint KL is equivalent to the VDM loss. However, in practice, to make this loss efficient to train, diffusion models leverage a lot of the known structure of the forward process to power a very clever parameterization of the reverse process. This requires some tricky rearranging of terms and some stochastic approximation to make the whole thing efficient.
To see the code, please check out the example colab as well as its accompanying text that walks through some of these details in more detail.

To utilize our knowledge of the forward process, we're actually going to rewrite the forward process not as a sequence of conditional Gaussian steps (a bottom-up forward process): $$ p(x,z_0,z_1,z_2,\cdots,z_T) = p(x) p(z_0|x) p(z_1|z_0) p(z_2|z_1) \cdots p(z_T|z_{T-1}) $$ but instead we'll rearrange this to be a product of a bunch of conditional reverse steps (as a top-down forward process): $$ \begin{align} p(x, z_0, z_1, z_2,\cdots, z_N) &= p(z_0,z_1,z_2,\cdots, z_T|x) p(x) \\ &= p(z_0|z_1,\cdots,z_T,x)p(z_1|z_2,\cdots,z_T,x)\cdots p(z_T|x)p(x) \\ &= p(z_0|z_1,x)p(z_1|z_2,x)\cdots p(z_{T-1}|z_{T},x)p(z_T|x)p(x) \end{align}$$ For the Gaussian diffusion, we can analytically figure out what these conditional reverse steps should be for the forward process $p(z_{t-1}|z_t,x)$. These distributions compute the probability of seeing a particular noisy image from the previous step if we get to observe both the noisy image as well as the original image.

Figure 4. The graphical model for the top-down forward process in diffusion.

We'll then parameterize our reverse process $q(z_{t-1}|z_t)$ to have this same functional form: $$ q(z_{t-1}|z_t) \leftarrow p(z_{t-1}|z_t, \hat x(z_t, t)). $$ We'll model the reverse process as if it were the exact reversed conditional forward process, but of course, for the true reverse process we don't get to observe the true original image. Still, we'll use the same functional form, it's just we'll spend our modeling budget on trying to impute the original clean image $\hat x$ after observing the noisy image $z_t$ and which step we are on $t$.

The actual parametric model in a diffusion model is this bit, $\hat x(z_t, t)$. It is a neural network that takes as input the noisy image $z_t$ and the step we are on in the diffusion process $t$ and has the job of trying to predict what the corresponding clean image was that generated the noisy image. In most diffusion models this is implemented as a U-Net style architecture. In practice, it's been found that if instead of predicting the clean image $\hat x$, you predict the noise $\hat \epsilon$ from the noisy image, you get better-looking samples.⁸ The full reverse generative model then consists of many steps of looking at a noisy image and trying to infer the clean one; rinse and repeat.

With these choices in place, we can now look at the full joint KL and organize terms.

$$ \left\langle \log p(x) - \log q(x|z_0) + \log \frac{p(z_T|x)}{q(z_T)} + \sum_{i=0}^{T-1} \log \frac{p(z_i|z_{i+1},x)}{q(z_i|z_{i+1})} \right\rangle_p $$

The last trick we're going to use is that we're going to avoid computing all of the terms in our sum by simply not computing all of the terms in our sum. We'll approximate the sum with Monte Carlo: we'll simply randomly choose one of the terms and upweight it appropriately. At that point, we have the loss function used to train VDM models. A very nice thing about the VDM loss is that it is clear that we are optimizing a bound on the marginal likelihood of our generative model. As you can learn in the VDM Paper, many of the diffusion models you've heard about correspond to a weighted form of this same objective, where different terms in the sum get different weights.

After going through all of the fancy math, the analytic KL divergences involved in the diffusion loss simplify quite nicely: $$ \left\langle \log p(x) - \log q(x|z_0) + \log \frac{p(z_T|x)}{q(z_T)} + \frac 1 2 \sum_{t=0}^{T-1} \beta_t \left\lVert \epsilon - \hat \epsilon(z_t,t) \right\rVert^2 \right\rangle $$ For variational diffusion the weight terms $\beta_t$ depend on your choice of noise schedule. For most other diffusion models in the wild, these $\beta_t$ weights are conventionally set to 1.

Closing Thoughts

So, why are diffusion models so interesting? Well, first and foremost, the reason they are drawing so much attention is that they have shown tremendous performance. It feels like for the first time we have models that are able to generate very high resolution, very high fidelity natural images. Projects like DALL-E2, Imagen, and Stable Diffusion show really impressive results. What is the magic driving these models?

At a high level, I think we can say that diffusion models start to realize the dream of latent variable models. Sometimes, when you are faced with a problem that is too difficult, you can crack it if you consider an even harder, related problem. As I tried to demonstrate here, even for simple latent variable models like VAEs and especially for diffusion models, one reason we can point to for their success is that instead of directly modeling the distribution over images, they model a much larger joint distribution. That larger joint distribution is strictly speaking a bigger thing to attempt to model, but here we get to design the forward process in such a way that even if there are many pieces to the forward process, those pieces individually are easier to tackle.

However, if that were the case, shouldn't we have expected deep hierarchical models to perform similarly awesomely? Probably, though here I think there is another real trick that diffusion has up its sleeve. For a general deep hierarchical generative model, even if by splitting the problem up into smaller pieces you might have split it up into easier-to-model tasks, to evaluate the joint KL you still need to evaluate all of those terms. That is, as your model becomes richer and more computationally expressive because of its depth, so does the cost of training your model, as you have to evaluate all of the layers at each step in the training process.

Diffusion models avoid this by structuring their forward process in such a way that all of the steps share a great deal of structural similarity. This allows diffusion to approximate a sum of a potentially large number of steps by a single randomly chosen step. If each step looks more or less the same, you can get a good estimate for the whole sum by looking at an individual, random, term.

The last trick up its sleeve is, even if you managed to design a deep hierarchical generative model with this structural homogeneity property, if you wanted to get to some intermediate position in the hierarchy you'd still have to run roughly half of the full forward process. That would still be expensive in general. Here, diffusion avoids that entirely.
As boring as a sequence of conditional Gaussians is as a forward process, it is also beautiful: it enables exact analytic marginalization to intermediate steps. You can very quickly mimic the result of adding hundreds of steps of additive Gaussian noise by simply adding a moderate amount of Gaussian noise in a single shot.

So, ultimately, what do I think is one of the main reasons diffusion models do so well? I think it's because they can do so well! I think it's because they are very powerful, expressive, generative models. Sampling from them is generally rather expensive. Drawing a sample means running the full reverse process, which might mean calling the central score net a thousand or so times. That is a very powerful and very expressive generative model, but magically, we can train that generative model's likelihood without ever having to actually instantiate the full generative process at training time due to our set of sundry tricks.

I'm excited to see where this all goes and hope this post and the colab help to introduce these magical models to a wider audience.

Special thanks to Ben Poole, Pavel Izmailov, Christopher Suter, and Sergey Ioffe, and Ian Fischer for helpful feedback on this post.

Non-equilibrium Thermodynamics Results Seemingly from Nothing

Fri, 16 Sep 2022 00:00:00 -0400

Let's see if we can very quickly prove the Jarzynski Equality and related non-equilibrium statistical mechanics results. Much like the mathematical underpinnings of thermodynamics are pretty mathematically simple, e.g. the existence of a convex surface on which mixed partial derivatives commute, I believe most of the results in non-equilibrium statistical mechanics are similarly due to a rhetorical reinterpretation of a simple mathematical manipulation.

This post will assume some familiarity with physics.

Basic Facts

The underlying math in our case are two facts, one that probability distributions are normalized: $$ \int dx\, p(x) = 1. $$

and second, that KL divergence is positive:¹ $$ \int dx\, p(x) \log \frac{p(x)}{q(x)} \geq 0. $$

Density Ratios

To generate the classic non-equilibrium statistical mechanics results we start by considering a simple ratio of two joint probability distributions: $$ \frac{q(x_0, x_1)}{p(x_0, x_1)} $$ Clearly we have a tremendous freedom here in our choices for the distributions $p$ and $q$. Mathematically it's uninteresting but we can start to build some rhetorical weight by factoring our two distributions in two distinct ways: $$ \frac{q(x_1) q(x_0|x_1)}{p(x_0)p(x_1|x_0)} $$ Despite still not having done anything, we can start to build an interpretation here. Imagine $x_0$ and $x_1$ as being two configurations of a system, with $x_1$ happening after $x_0$. Now, though we're allowed by the chain rule to factor distributions any way we wish, here we've chosen to factor $p$ to be suggestive of some kind of forward process wherein we first sample some $x_0$ from a distribution $p(x_0)$ and then evolve it according to some potentially stochastic process to generate our next state $x_1$ conditioned on the first: $p(x_1|x_0)$. At the same time, we've factored $q$ the other way, evocative of a reverse process that starts at $x_1$ and then evolves backward to $x_0$.

To make further progress, let's specialize a bit. Let's imagine that $x_0$ and $x_1$ are configurations of a physical system evolving according to Hamiltonian dynamics, with a Hamiltonian governed by some kind of control parameter $\lambda$. Let's further imagine that at the beginning of either our forward or reverse process our system is in thermodynamic equilibrium at the same temperature, and in particular in a canonical ensemble:²

$$ \begin{align} p(x_0) &= \frac{1}{Z(\beta,\lambda_0)} e^{-\beta H(x_0, \lambda_0)} \\ q(x_1) &= \frac{1}{Z(\beta, \lambda_1)} e^{-\beta H(x_1, \lambda_1)}. \end{align} $$

Simply substituting these expressions into our density ratio we find:

$$ \frac{q(x_0,x_1)}{p(x_0,x_1)} = \frac{Z(\beta,\lambda_0)}{Z(\beta, \lambda_1)} e^{-\beta \left( H(x_1,\lambda_1) - H(x_0, \lambda_0) \right)} \frac{q(x_0|x_1)}{p(x_1|x_0)}. $$

We can clean this up a bit and give it a cleaner physical interpretation. Let's identify the change in the Hamiltonian with the work: $$ W \equiv H(x_1,\lambda_1) - H(x_0, \lambda_0). $$ And let's use the standard definition of the free energy: $$ \beta F = -\log Z, $$ to rewrite the ratio of partition functions as a difference in free energies: $$ e^{-\beta \Delta F} = e^{\log Z(\beta,\lambda_0) -\log Z(\beta,\lambda_1)} = \frac{Z(\beta,\lambda_0)}{Z(\beta,\lambda_1)}. $$ Combining these results gives: $$ \frac{q(x_0,x_1)}{p(x_0,x_1)} = e^{\beta (W - \Delta F)} \frac{q(x_0|x_1)}{p(x_1|x_0)}. $$ I'm going to anticipate some of the things we're going to talk about below and define the log of the forward over the reverse transition probabilities as the heat: $$ Q = \log \frac{p(x_1|x_0)}{q(x_0|x_1)}. $$ With this final identification we end up with the general statement: $$ \frac{q_R}{p_F} = e^{\beta (W - Q - \Delta F)}. $$ The density ratio of the reverse process (shortened here as $q_R$) to the forward process $p_F$ is given by the exponential of $\beta$ times the quantity of the work, minus the heat minus the change in free energy.

Hamiltonian Dynamics

First, if we assume that our dynamics is Hamiltonian, and thus deterministic and reversible, we know that the probability that we start at $x_0$ and end up at $x_1$ if we evolve forward in time is the same as the probability that we start at $x_1$ and end up at $x_0$ if we reverse our time evolution, ($q(x_0|x_1) = p(x_1|x_0)$)³

so the ratio of conditional probabilities actually cancels and we generate Crook's Fluctuation Theorem: $$ \frac{q_R}{p_F} = e^{\beta (W - \Delta F)}. $$ The ratio of the reverse process probability to the forward probability for a given initial and final point is given by the exponential $e^{\beta (W - \Delta F)}$. If we now take the integral of this with respect to the forward process, we generate the Jarzynski equality:⁴ $$ \int dx_0\, dx_1\, p(x_0,x_1) \frac{q(x_0,x_1)}{p(x_0,x_1)} = 1 = \left\langle e^{\beta (W - \Delta F)} \right\rangle_p, $$ which simplifies to⁵: $$ \left\langle e^{-\beta W}\right\rangle_p = e^{-\beta \Delta F}. $$ So, recapping, what have we just done? Since we can take density ratios of arbitrary probability distributions, we could choose those two densities to mean something we care about. Consider $p$ the forward, Hamiltonian evolution of a system from $x_0$ to $x_1$ and $q$ the reverse process. If we imagine that both the forward and reverse processes start in a state of canonical equilibrium, we can generate both Crook's Fluctuation Theorem as well as the Jarzynski equality.

The power of this result is that it allows us to relate an expectation computed with respect to non-equilibrium processes (the exponential of the beta weighted stochastic work needed for a bunch of non-equilibrium realizations of our trajectory) to a pure equilibrium quantity (a difference of equilibrium free energies). In the context of the physical sciences, this lets us perform non-equilibrium simulations or experiments, and provided we measure the work performed over many such runs, even with the system driven far from equilibrium, we can estimate equilibrium free energy differences.

Stochastic Dynamics

But, let's say you don't like the assumption that the dynamics are Hamiltonian, we can imagine something else, imagine our dynamics is stochastic but imagine discretizing the dynamics. We still need to make some kind of assumption, in this case, we'll imagine that our process consists of $N$ steps, each of which is governed by a Markov transition kernel. Finally, we'll assume that each transition kernel has a stationary distribution and satisfies detailed balance.

What this means is that we'll imagine that our forward process now takes the form: $$ \begin{align} p_F &= p(x_0) p(x_1|x_0) p(x_2|x_1) \cdots p(x_N|x_{N-1}) \\ &= p(x_0) T_1(x_1|x_0) T_2(x_2|x_0) \cdots T_N(x_N|x_{N-1}) \end{align} $$ Here we've denoted the intermediate conditional distributions as being governed by our transistion kernels, labeled with the corresponding stationary distribution. Saying that our kernels have a stationary distribution that they respect according to detailed balance means that: $$ T_k(x'|x) \sigma_k(x) = T_{k}(x|x') \sigma_k(x'), $$ for the stationary distribution $\sigma_k$.

We've defined our forward process, now we need to define our reverse process. We'll imagine that the reverse process is governed by the same transition kernels but running in reverse:⁶

$$ \begin{align} q_R &= q(x_N) q(x_{N-1}|x_N) \cdots q(x_1|x_2) q(x_0|x_1) \\ &= q(x_N) T_{N}(x_{N-1}|x_N) \cdots T_2(x_1|x_2) T_1(x_0|x_1). \end{align} $$

Now if we look at the ratio of our reverse to our forward process, things simplify a bit: $$ \begin{align} \frac{q_R}{p_F} &= \frac{q(x_N)T_N(x_{N-1}|x_N)\cdots T_2(x_1|x_2)T_1(x_0|x_1)}{p(x_0)T_1(x_1|x_0)T_2(x_2|x_1)\cdots T_N(x_N|x_{N-1})} \\ &= \frac{q(x_N)}{p(x_0)} \frac{T_1(x_1|x_0)}{T_1(x_0|x_1)} \frac{T_2(x_1|x_2)}{T_2(x_2|x_1)} \cdots \frac{T_N(x_{N-1}|x_N)}{T_N(x_N|x_{N-1})} \\ &= \frac{q(x_N)}{p(x_0)} \frac{\sigma_1(x_1)}{\sigma_1(x_0)} \frac{\sigma_2(x_2)}{\sigma_2(x_1)} \cdots \frac{\sigma_N(x_{N-1})}{\sigma_N(x_N)} . \end{align} $$

Finally, as we did above, let's imagine that all of these marginal distributions take the form of a canonical distribution.⁷

$$ \begin{align} q(x_N) &\equiv \frac{1}{Z_N} e^{-\beta H_N} \\ p(x_0) &\equiv \frac{1}{Z_0} e^{-\beta H_0} \\ \sigma_k(x_j) &\equiv \frac{1}{Z_k} e^{-\beta E_k(x_j)}. \end{align} $$ Notice that the nice simplification that happens here is that since we imagined our reverse process as being the reverse of the forward process, in all but one of these fractions, the partition function of the intermediate stationary processes will cancel out. Putting this all together we obtain the general result: $$ \frac{q_R}{p_F} = e^{\beta(W - Q - \Delta F)}, $$ if we identify $W$ with the total energy change of the system ($H_0-H_N$), $\Delta F$ with the change in the partition functions (as above, $-\beta \Delta F = \log Z_0/Z_N$) and now identify the heat as additional energy changes in each of the intermediate processes:⁸ $$ Q \equiv \sum_{k=1}^{N} Q_k \qquad Q_k = \Delta E_k = E_k(x_k) - E_k(x_{k-1}) . $$ And I believe we've done it. Taking the expectation of this quantity with respect to the forward process will give us the Jarzynksi equality again⁹: $$ \left\langle e^{\beta(W - Q)} \right\rangle = e^{\beta \Delta F}. $$

Taking the logarithm of the ratio and then the expectation is equivalent to the KL divergence between the forward and reverse processes, which we know must be positive: $$ D(p_F; q_R) = \left\langle \log \frac{p_F}{q_F} \right\rangle_F = -\beta \left\langle W - Q \right\rangle + \beta \Delta F \geq 0 $$ which naturally generates the inequality (a version of the second law): $$ \Delta F \geq \left\langle W - Q \right\rangle. $$ As a reminder, in this case, we were generalized to a situation where our initial distributions were canonical, but our dynamics were generalized to any sequence of Markovian transition kernels, provided only that those kernels have a stationary distribution.

Generalized Landauer Bound

Wolpert says that, from stochastic thermodynamics we know:

\begin{equation} -\Delta Q = \Delta \Sigma + S(p_0) - S(p_1) \end{equation}

Which, with $\Delta \Sigma \geq 0$ gives us the generalized Landauer bound

\begin{equation} -\Delta Q \geq S(p_0) - S(p_1) \end{equation}

For the classic case of bit erasure the change in entropy is $\log 2$ and we get Landauer's bound:

\begin{equation} -\Delta Q \geq kT \log 2 \end{equation}

So, where does this come from? It doesn't seem like there is much to it, honestly, imagine two joint distributions $p(x_0, x_1)$ and $q(x_0, x_1)$ describing a forward and reverse process that moves between two states. The KL divergence between these two is non-negative and monotonic

\begin{equation} \left\langle \log \frac{p(x_0,x_1)}{q(x_0,x_1)} \right\rangle_p \geq \left\langle \log \frac{p(x_1)}{q(x_1)} \right\rangle \geq 0 \end{equation}

We can simply rearrange terms to see that: Subtracting $\langle \log p(x_1)/q(x_1) \rangle$ from both sides we first find the entropy production: \begin{equation} \Delta\Sigma \equiv \left\langle \log \frac{p(x_1|x_0)p(x_0)}{q(x_0|x_1)p(x_1)} \right\rangle \geq 0 \end{equation}

and we can establish the identity: \begin{equation} \left\langle \log \frac{p(x_1|x_0)p(x_0)}{q(x_0|x_1)p(x_1)} \right\rangle_p = \left\langle \log \frac{p(x_1|x_0)}{q(x_0|x_1)} \right\rangle_p + \left\langle \log \frac{p(x_0)}{p(x_1)} \right\rangle_p \end{equation}

If we simply identify terms, we recover the Wolpert form:

\begin{equation} \Delta \Sigma = -\Delta Q + S(p_1)-S(p_0) \end{equation}

To make these identifications, we can see that: \begin{equation} S(p_0) = -\left\langle \log p(x_0) \right\rangle \qquad S(p_1) = -\left\langle \log p(x_1) \right\rangle \end{equation}

And for the entropy rate: \begin{equation} -\Delta Q \equiv \left\langle \log \frac{p(x_1|x_0)}{q(x_0|x_1)} \right\rangle \end{equation} which appears to be the likelihood ratio of our forward and reverse conditional processes, i.e. some characterization of the irreversibility of our system.

If we happen to be in a system that satisfies local detailed balance, we know that there should be some kind of steady state distribution for which: \begin{equation} p(x_1|x_0) \pi(x_0) = q(x_0|x_1) \pi(x_1) \end{equation} so that: \begin{equation} \log \frac{p(x_1|x_0)}{q(x_0|x_1)} = \log \frac{\pi(x_1)}{\pi(x_0)} \end{equation} and if we further imagine that the steady state distribution is boltzmann like and the system is in contact with some kind of heat bath, we see that: \begin{equation} \log \frac{\pi(x_1)}{\pi(x_0)} = \log \frac{\frac{1}{Z_1}e^{\beta H_1}}{\frac{1}{Z_0} e^{\beta H_0}} = \log \frac{Z_0}{Z_1}+ \beta (H_1 - H_0) = \beta \Delta F - \beta \Delta U = \Delta Q \end{equation} we can identify the forward to the reverse transition probabilties as the heat flow from the bath.

Variational Autoencoder

To show some of the generality of what we're doing here, let's do it again but for a completely different kind of system, this time a Variational Autoencoder. In a variational autoencoder there are two joint distributions at play, one a representational model $p(x,z) = p(x) p(z|x)$ which starts with a draw from some true data distribution $p(x)$ and then uses an encoder to map that datum to some kind of representative code, or summary, or representation $z$: $p(z|x)$. The other joint distribution consists of a generative model $q(x,z) = q(z)q(x|z)$ that imagines a joint distribution over the same space but works in reverse. First, we generate a latent variable $z$ from some prior distribution $q(z)$ and then we use a decoder to stochastically turn that latent variable into a generated datum $x$: $q(x|z)$.

We can easily imagine the ratio of these two densities: $$ \frac{q(x,z)}{p(x,z)} = \frac{q(z)q(x|z)}{p(x)p(z|x)}. $$

As we saw above, the way to generate an inequality here is to turn this into a KL divergence: $$ \begin{align} D( p(x,z) ; q(x,z) ) &= \left\langle \log \frac{p(x) p(z|x)}{q(z) q(x|z)} \right\rangle_p \\ &= -\left\langle -\log p(x) \right\rangle_p + \left\langle -\log q(x|z) \right\rangle_p + \left\langle \log \frac{p(z|x)}{q(z)} \right\rangle_p \\ &\equiv -\mathbb{H} + D + R \geq 0 \end{align} $$ Here, just as above we've only rearranged terms, but this time organized them into three contributions, the entropy of the true data generating process: $$ H \equiv \left\langle -\log p(x) \right\rangle_p, $$ the distortion a measure of the likelihood we encode then decode and image to the one we started with: $$ D \equiv \left\langle - \log q(x|z) \right\rangle_p = -\int dx\, p(x) \int dz\, p(z|x) \log q(x|z), $$ and the rate, a measure of the excess cost required to communicate this message $z$ over a wire designed to be optimal for the prior $q(z)$: $$ R \equiv \left\langle \log \frac{p(z|x)}{q(z)} \right\rangle_p = \left\langle D(p(z|x); q(z)) \right\rangle_{p(x)}. $$ We've just rederived the ELBO¹⁰

rendered in the form presented in Fixing a Broken ELBO¹¹ $$ \textsf{ELBO} \equiv D + R \geq H. $$

Conclusion

We've managed to derive several non-equilibrium statistical mechanical equalities and inequalities seemingly from nothing. All of these results were powered by the facts we opened with, that probability distributions integrate to one and that KL divergences are positive. The only challenge here was one of semantics. To get power out of such trivial mathematical manipulations required us to make judicious choices in how we interpreted them.

Special thanks to Sam Schoenholz, Srinivas Vasudevan, Yasaman Bahri and Jim Sethna for helpful feedback on this post.

Ventilation

Wed, 29 Nov 2023 00:00:00 -0500

$ \newcommand{\coo}{\mathrm{CO_2}} $

We recently bought an Airthings so I have been geeking out looking at plots of our air quality.

In general, we don't seem to have any issues with our indoor air, except when we use the stovetop without the exhaust on it will spike our PM2.5 and in general we have higher $\coo$ concentrations indoors.

There are four of us in the house, and since we homeschool and I work from home, there is a lot of $\coo$ generation happening and I've tried to seal up our house as well as I can to make it energy efficient.

This got me thinking about how to model the $\coo$ concentration in our house.

The Model

Let's start with a basic model, we know that we have sources in the form of the four humans in the house, and that we exchange air with the outside air which is at something like $420 \textrm{ ppm}$.

This suggests a simple model of the form:

$$ \dot x = b - \kappa ( x - x_0 ) $$

Where we use $x$ to denote the concentration of $\coo$, $x_0$ for the outdoor concentration of $420\textrm{ ppm}$, $b$ represents the constant source from the humans in the house and $\kappa$ represents an exchange with the outside.

Here we assume that the rate of change is proportional to the difference, similar in spirit to Fourier's law. In terms of our model, this is equivalent to assuming that within some time window, we replace some fraction of the air in the house with fresh air from outside. To break it down a bit more, imagine some amount of time $\Delta t$, the change in the concentation of $\coo$ inside can be modelled as taking some fraction $\kappa \Delta t$ of the total air inside and replacing it with fresh air, this means we decrease the concentration by $\kappa x \Delta t$ and then increase it by $\kappa x_0 \Delta t$, giving us the form we see above.

I find its easier to formulate a differential equation or physical model in terms of unitful quantities, but then easier to solve them if we take the time to non-dimensionalize (which we'll do below). Here our $x$ is in units of $\textrm{ppm}$ by volume, a dimensionless measure of concentration. We'll imagine our time variable taking on the units of $\textrm{days}$ for convenience. Then our source term $b$ has units of $\text{ppm/day}$ and measures the increase in $\coo$ concentration the four of us cause each day within the volume of our house. We could look up this number and find $900 \textrm{ gCO$_2$/day}$ produced per person.¹

Estimating Sources

Instead of looking up the number, let's see if we can estimate it. We release carbon dioxide because we respire, our body burns hydrocarbons that we eat to generate energy for our body. Carbohydrates and sugars have an energy density of $4 \textrm{ kcal/g}$ which you can verify on the back of your favorite candy bar. The basic chemistry of respiration (and photosynthesis) is the burning of these hydrocarbons:

$$ \mathrm{C}_n\mathrm{H}_{2n}\mathrm{O}_n + n \mathrm{O}_2 \to n\mathrm{CO}_2 + n \mathrm{H}_2\mathrm{O} + \textrm{energy} $$

If we total up the atomic masses in this formula, we find that for every $30 \textrm{ grams}$ of carbohydrates we burn, we release $44 \textrm{ grams}$ of carbon dioxide. If we typically eat $2000 \textrm{ kcal/day}$, this works out to $730 \textrm{ g/day}$ of $\coo$ per person, a pretty good match to the numbers you'll find online.

What does this mean for the atmosphere in my house? Well, we have a $2100 \textrm{ft}^2$ home with $8 \textrm{ ft}$ ceilings. This gives us:

$$ \frac{730 \textrm{ g/day}}{2100 \textrm{ ft}^2 \cdot 8 \textrm{ ft} \cdot 1.225 \textrm{ kg/m}^3} = 1300 \textrm{ ppm/day}$$

In terms of the mass fraction. However, $\coo$ concentrations we read about in papers or measure on sensors are volumetric fractions. Carbon dioxide has a molar mass of $44 \textrm{ g/mol}$ while natural air is $29 \textrm{ g/mol}$², so to convert the mass fraction to a volume fraction we need to multiply by $44/29$.

In the end, we estimate that each person in my house contributes $1900 \textrm{ ppm/day}$ of volumetric $\coo$ concentration. Outdoor concentrations are $420 \textrm{ ppm}$ and my sensor turns yellow when the indoor concentration exceeds $800 \textrm{ ppm}$ and red above $1000 \textrm{ ppm}$.

Estimating $\coo$ conductivity

We've estimated $b$, but what about $\kappa$, well, this is a measure of how quickly we exchange air in the house. If we close the windows and have the AC occasionally run the fan, it seems like the indoor $\coo$ concentration will level out at about $1800 \textrm{ ppm}$. Meanwhile, in our model, if we solve for the steady state:

$$ \dot x = 0 = b - \kappa ( x - x_0 ) \implies x = x_0 + \frac{b}{\kappa} $$

and put in our estimates of a steady state value of $1800 \textrm{ ppm}$ and our estimate that for four humans we have $b = 4 \cdot 1800 \textrm{ ppm/day}$ we get an estimate of $\kappa = 5.5 \textrm{ /day}$. Honestly, this feels low, I know that most houses are supposed to have about 1 air change per hour, though here we are discussing specifically $\coo$ but I would expect the rates to be the same. I might have to have a blower door test done to see how sealed up our home is. We live in an older home which I generally expect to be fairly leaky, but we did some renovations recently and took care to try to seal up potential leaks. We may have sealed up the house too much and might have to look into installing something like an Energy Recovery Ventilation system to ensure we have fresh enough air.

Then again, we seem to be able to shed our PM2.5, VOCs and other polutants rather quickly, so perhaps there is just an issue with our $\coo$ sensor itself. Regardless, now that we have a model we can go on to solve it.

Non-dimensionalizing

I always find it useful to nondimensionalize differential equations when I'm solving them. This means reparameterizing the equation to be in terms of only nondimensional quantities. In this case we'll form a dimensionless measure of the excess concentration, and use our $\kappa$ constant to reparameterize in terms of some relevative time:

$$ \chi \equiv \frac{\kappa}{b} (x - x_0) \implies \dot \chi \equiv \frac{\kappa}{b} \dot x $$ $$ \tau = \kappa t \implies dt = \kappa dt $$

After transforming we obtain:

$$ \frac{d\chi}{d\tau} = 1 - \chi $$

Which we can solve in typical physicist fashion:

$$ \begin{align} \frac{d\chi}{d\tau} &= 1 - \chi \\ \frac{d\chi}{1 - \chi} &= d\tau \\ \int_{\chi_0}^{\chi} \frac{d\chi}{1 - \chi} &= \int_0^\tau d\tau \\ \log \frac{1 - \chi_0}{1 - \chi} &= \tau\\ \frac{1 - \chi}{1 - \chi_0} &= e^{-\tau}\\ \chi &= 1 + (\chi_0 - 1) e^{-\tau} \end{align} $$

To see that the behavior should be a simple exponential relaxation to the steady state.

Figure 1. Two example evolutions of the $\coo$ given by the model.

In this nondimensional form, it becomes clear that everything is dominated by $\kappa$, if we wanted to either change the equilibrium value or get there sooner, we need to adjust $\kappa$, or the air flow rate. If we open a couple windows and turn on the fans in the house, even with all four of us in here, the $\coo$ concentration then settles down at something like $600 \textrm{ ppm}$ suggesting that the $\kappa$ is now something like $42 \text{ /day}$, and that the $\coo$ takes about $2/(42 \textrm{ /day}) \approx 1 \textrm{ hour}$ to fall.

Would Plants help?

Could we better control our indoor $\coo$ concentration by having some houseplants? While plants also respire like we do, they also photosynthesis, using the sun's energy to run the chemical equation above backward, fixing $\coo$ in the air into carbohydrates and sugars.

Unfortunately, as we saw above, what's really important for the chemistry is essentially the weight of the products. For every $30 \textrm{ grams}$ of carbohydrates we burn we release $44 \textrm{ grams}$ of $\coo$ into the air, plants go in reverse: for every $30 \textrm{ grams}$ of carbohydrates they synthesize they consume $44 \textrm{ grams}$ of $\coo$ from the air. This means that if every person in our house is releasing $730 \textrm{ g/day}$ of $\coo$, we would need $500 \textrm{ g/day}$ of sugars being synthesized to offset each of us. Unfortunately plants do not grow nearly that fast. It seems most plants grow a couple kilograms a year, let alone a day. We are unlightly to make a dent in our resting $\coo$ concentration indoors unless we turned our house into a relative jungle.

Impact on Earth

To put something like climate change into perspective, we just worked out that a typical human releases something like $730 \textrm{ g/day}$ of $\coo$ just by breathing. Granted, this $\coo$ doesn't tend to increase $\coo$ atmospheric concentrations because it came from carbon that very recently was in the atmosphere itself (before being fixed by our food). Our breathing is essentially carbon neutral, but let's work out how much humanities collective breathing compared with atmospheric $\coo$ concentrations.

As before, we just need to scale up this production by the 8 billion humans on the planet and then divide by the total weight of the atmosphere, then correct for the volumetic concetration rather than mass based one.

To estimate the weight of the atmosphere, we know that the $1 \textrm{ atm}$ of air pressure at the surface is caused by the weight of air above us, so the total mass of the atmosphere is roughly:

$$ \frac{1 \textrm{ atm} \cdot 4 \pi R^2}{g} \approx 5.2\times 10^{21} \textrm{ g} $$

So we can work out that the relative $\coo$ concentration from human breathing is:

$$ \frac{730 \textrm{ g/day}}{5.2 \times 10^{21} \textrm{ g}} \cdot 7\times 10^9 \cdot \frac{44}{29} \approx \frac{2}{3} \textrm{ ppm/year}, $$

about $0.6 \textrm{ ppm/year}$. Again, human breathing is actually net neutral, but given the magnitude here, it becomes a bit easier to imagine that human activities and burning of fossil fuels might be contributing $2.47 \pm 0.25 \textrm{ ppm/year}$ to the atmosphere.³. If we could csomehow sequester all of the $\coo$ that all of the humans on the planet breathe out, that would only reduce the atmospheric growth rate of $\coo$ by 25%. Humanity operates on a truly global scale and we now have very direct influences on the chemistry of the planet.

The Method of Imaginary Results

Thu, 30 Nov 2023 00:00:00 -0500

Performing Bayesian inference requires a full joint distribution over both our data and parameters $p(D,\theta)$. In the usual way of doing things, we specify that joint distribution by providing two pieces: a likelihood $p(D|\theta)$ that specifies how we believe the data would be generated if we happened to know the exact parameter values and some prior $p(\theta)$ over parameters that represents our state of belief about what the parameters are before we look at any data.

Most people don't have any deep philosophical issues with specifying a likelihood $p(D|\theta)$. We're aware that our likelihoods might not be perfect, that they are some approximation of what is happening in the real world. Still, we have opinions about them, we feel as though we can reason about whether a given likelihood is good or bad for some situation.

I believe I can model a series of $D$ heads in $N$ coin flips with a Binomial likelihood for instance, and I don't really have any qualms about that. I might decide to model the heights of my pea plants with a Normal Distribution or perform a linear fit to some data, or do image classification with some convolutional neural network or transformer. In any case, I often have a good idea of what I should use as a likelihood $p(D|\theta)$.

Choosing the prior $p(\theta)$ is what all the fuss is about. This is the part that raises various philosophical issues. This is the part that, if we are being honest, is much harder. What do I believe the bias of a coin is before I ever flip the coin? I'm not really sure to be honest. In many contexts I might have previously done some experiments, in which case I could use yesterday's posterior as today's prior.¹

However, lacking previous experiments, I often feel at a loss. There are many frameworks for designing priors that people have proposed. Laplace originally motivated a flat prior for the Bernoulli likelihood by appealing to the principle of indifference.² Jeffreys taught us how to build priors that were reparameterization-independent. Jaynes would argue for choosing priors by appealing to symmetries.³ Bernardo suggested choosing priors to maximize the information you extract from data, so called reference priors.⁴ Gelman and friends tout weakly informative priors. There are even whole lists of common recommendations.

What if we didn't have to choose a prior directly?

The Method of Imaginary Results

Enter the method of imaginary results. It turns out⁵ that we can uniquely characterize a joint distribution in a different way. Specifying a likelihood $L(D|\theta)$ and a prior $\pi(\theta)$ uniquely characterizes the joint $p(D,\theta) = L(D|\theta)\pi(\theta)$. You know what else uniquely characterizes the joint? Specifying a likelihood $L(D|\theta)$ and some hypothetical posterior $q(\theta|D_0)$. The corresponding unique joint $p(\theta,D)$ is given by:

$$ p(\theta, D) \propto L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} = \frac{ L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} }{\int d\theta\, L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)}}. $$

Which naturally satisfies the two inputs we provided: $$ p(D|\theta) = L(D|\theta) \qquad p(\theta|D_0) = q(\theta|D_0). $$

This flips the problem on its head. We no longer have to specify a prior. Instead we can specify a hypothetical posterior. We can say what we would believe, if, hypothetically we had observed some dataset $D_0$.

I think that this is an easier task to do. It is easier for me to reason about what beliefs I should hypothetically hold in light of some data than it is for me to reason about what I believe independent of any data.

Coin Example

Let's work the simple example of some coin flips. I believe I can model a coin as being a simple Bernoulli process. There is some probability $\theta$ that the coin will land heads and each flip is independent and identically distributed. Therefore, I can model observing $H$ heads out of sequence of $N$ flips with a Binomial Likelihood:

$$ L({H,N}|\theta) = { N \choose H} \theta^H (1- \theta)^{N-H} $$

Now, we imagine I actually observe some sequence of coin flips, let's say 6 out of 10 flips were heads. Now what should I believe about the bias of my coin? To answer this I need to specify a prior belief I have about the bias of the coin. In most textbook examples, that prior is taken to be uniform $p(\theta) = 1$, saying that our prior belief is that it is equally likely that the coin should have a bias in an interval $\theta + \delta \theta$ for any $\theta$, i.e. this prior says its just as likely the bias of the coin is between 0.1 and 0.2 as it is that it is between 0.5 and 0.6.

Alternatively, I could take Jeffrey's advice and adopt a non-informative prior that is reparameterization independent, or I could try to adopt Gelman's advice and start with an informative prior concentrated near fairness. Below is a representation of these three standard choices where the prior is shown in blue and the posterior from 6 heads out of 10 flips is shown in orange.

Figure 1. Some standard textbook priors and the resulting posterior for 6 heads out of 10 coin flips.

These are convenient mathematically and make for easy problems to solve for a homework exercise, but they aren't realistic. If we are being honest, we tend to expect that coins we encounter in the real world and very nearly fair.⁶. We could therefore start with a prior that is concentrated near fair, but how do we assign a meaningful width to that distribution? And if we're being honest, I've encountered trick coins in my days, double headed and doubled tailed coins and if some wierdo walks up to me and asks me to start predicting a whole sequence of coin flips I shouldn't discount the possiblity they are trying to play me for a fool.

As this stage, trying to adjust the parameters of our prior without any evidence or data is difficult. I have a hard time talking to my gut to decide what I should set my prior beliefs to apropro of nothing. Instead, let's try to invoke the method of imaginary results and imagine some hypothetical dataset and probe our beliefs. Imagine we've just observed 10 coin flips, and all 10 of them were heads! What do you believe now? Now that I've hypothesized a dataset I have an easier time talking to my gut.

In this scenario, I feel as though I would place a reasonable probability on the coin being unfair, let's say 50%. At the same time, I think I would still place a reasonable probability on the coin being exactly fair, let's say 25%. The remaining 25% probability I would want to spread around but biased towards heads, for that let's use a $\operatorname{Beta}(11,1)$ distribution or $11\, \theta^{10}$. I've attempted to visualize this distribution below.⁷

Figure 2. My attempt at illiciting an imaginary result of a posterior I'm comfortable with if I were to observe 10 heads in a row from a coin.

Or in equation form:

$$ q(\theta|D_0) = \frac 12 \delta(\theta -1 ) + \frac 14 \delta\left(\theta - \frac 12 \right) + \frac {11} 4 \theta^{10} $$

Once we've specified this imaginary result, we have everything we need to form a posterior for our original problem with 6 heads out of 10 flips.

$$\begin{align} p(\theta|D) &\propto L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} \\ &\propto 210 \theta^6 (1-\theta)^4 \frac{\frac 14 \delta\left(\theta - \frac 12 \right) + \frac 12 \delta(\theta - 1) + \frac{11}{4} \theta^{10}}{\theta^{10}} \\ &= \frac{210}{211} \delta\left(\theta -\frac 12 \right) + \frac{1}{211} \left( 2310 \theta^{6} (1-\theta)^4 \right) \end{align} $$

Figure 3. The posterior I get from my illicited imaginary posterior if I actually observe 6 heads and 4 tails. The blue curve is the true posterior, the dashed orange is a blown up version of the small residual component.

The posterior we find is 99.5% probability on the coin being exactly fair, and 0.5% probability assigned to a $\operatorname{Beta}(7,5)$ type posterior, which is buried in the true form above, but I've blown up in the dashed line so you can see its shape. This posterior has a very heavy weight on the coin being exactly fair, which I think is reflective of my actual beliefs but I would have had difficulty specifying in terms of a prior. Instead, if I imagine the coin coming up heads 10 times in a row, the fact that I wanted to still give the coin a 25% chance of being fair is obviously mathematically equivalent to me having a 98.7% prior belief the coin is fair, but I feel as though I have a much higher sensitivity to the right number when I express this as a hypothetical posterior.

The method of imaginary results let's us ask ourselves what we would believe in light of some data, rather than ask us to express what we believe apropos of nothing. I think this helps resolve some of the philosophical issues have with prior selection in Bayesian inference.

KL is All You Need

Mon, 08 Jan 2024 00:00:00 -0500

Modern machine learning is a sea of initialisms: VAE, VIB, VDM, BBB, VB, etc. But, the more time I spend working in this field the more I come to appreciate that the core of essentially all modern machine learning methods is a single universal objective: Kullback-Leibler (KL) divergence minimization. Even better, there is a very simple universal recipe you can follow to rederive most of the named objectives out there. Understand KL, understand the recipe, and you'll understand all of these methods and be well on your way to deriving your own.

In the past I've discussed some of the special properties of KL divergence, and how you can derive VAEs or Diffusion Models by means of a simple KL objective. What follows is an extension of those ideas, essentially a written version of a recent talk [slides] I gave at the InfoCog Workshop at NeurIPS 2024. ¹

Figure 1. The elephant in the room is KL divergence or the relevant entropy.²

KL Divergence as Expected Weight of Evidence

Before we get into it, we need to make sure we're all starting on the same page. Because KL divergence is so fundamental and special (as I've written about before) it has many different interpretations. For our purposes, the most useful interpretation is as an expected weight of evidence.³ I'll briefly build that up here.

Imagine we have two hypotheses $P$ and $Q$ and we're trying to decide which of these two is a better model of the world. We go out an collect some data $D$ and would like to use that data to help us discriminate between the two models. Being good probabilistic thinkers with a penchant for gambling, what we're interested in is:

$$ \frac{\Pr(P|D)}{\Pr(Q|D)}, $$

the odds of $P$ versus $Q$, given the data $D$. Using Bayes rule we can express this as:

$$ \frac{\Pr(P|D)}{\Pr(Q|D)} = \frac{\Pr(D|P)}{\Pr(D|Q)} \frac{\Pr(P)}{\Pr(Q)}, $$

the product of the likelihood ratio that the data we observed were generated by model $P$ and $Q$ times the prior odds of the two models. Taking a logarithm of both sides turns the product into an easier to work with sum:

$$ \log \frac{\Pr(P|D)}{\Pr(Q|D)} = \log \frac{\Pr(D|P)}{\Pr(D|Q)} + \log \frac{\Pr(P)}{\Pr(Q)}. $$

Now, the posterior log odds is expressed as the sum of the weight of evidence plus the prior log odds of the two hypotheses.

Figure 2. Belief-O-Meter.

This weight of evidence tells us how much to update our beliefs in light of evidence. If you picture a sort of Belief-O-Meter™ for your own beliefs, each bit of independent evidence gives you an additive update for the meter, pushing your beliefs either toward $P$ or toward $Q$. For simple hypothesis taking the form of probability distributions, this weight of evidence is just the log density ratios of the data under the models:

$$ \log \frac{\Pr(D|P)}{\Pr(D|Q)} \text{ becomes } \log \frac{p(D)}{q(D)}. $$

OK, so what does this have to do with the KL divergence? Imagine if one of our two hypotheses is actually true. If $P$ was the probability distribution governing the actual world, the expected weight of evidence we would accumulate from observing some data would be, the KL divergence:⁴

$$ I[p;q] \equiv \int dx\, p(x) \log \frac{p(x)}{q(x)} \equiv \left\langle \log \frac{p(x)}{q(x)} \right\rangle_{p(x)} . $$

Therefore, we can interpret the KL divergence as a measure of how quickly we would be able to discern between hypotheses $P$ and $Q$ if $P$ were true. Similarly, the reverse KL is:

$$ I[q;p] \equiv \int dx\, q(x) \log \frac{q(x)}{p(x)} \equiv \left\langle \log \frac{q(x)}{p(x)} \right\rangle_{q(x)}, $$

a measure of how quickly we'd be able to discern between $P$ and $Q$ if $Q$ were true. Suddenly, the asymmetry of the KL divergence, an issue that often causes consternation is no longer a mystery. We should expect the expected weight of evidence to be asymmetric. As an extreme example, imagine we were trying to decide between two hypothesis regarding some coin flips we are about to observe. $P$ is the hypothesis that the coin is fair while $Q$ is the hypothesis the coin is a cheating, double-headed coin. In this case, if we actually had a fair coin, we expect to be able to perfectly discern the two hypotheses (infinite KL) because we will eventually observe a tails, an impossible situation under the alternative ($Q$) hypothesis. Meanwhile, if the coin is actually a cheat, we'll be able to collect, on average, 1 bit of evidence per flip in favor of the hypothesis that the coin is a cheat, but we will only ever observe heads and so never be able to perfectly rule out the possibility that the coin is fair and we've simply observed some miracle.⁵

Mathematical Properties

In what follows, we'll need to use two mathematical properties of the KL divergence. The first is that the KL divergence is non-negative, i.e. the lowest it can be is zero:

$$ I[p;q] \equiv \int dx\, p(x) \log \frac{p(x)}{q(x)} \geq 0, $$

which I'll leave as an exercise to the reader, or you can see a proof in the previous post. In the context of our interpretation of KL divergence as an expected weight of evidence, the non-negativity of KL divergence means, essentially, that the world can't lie to us. If we are trying to decide between two hypotheses, and one of them happens to be correct, we have to, we must, we have to, we must, on average, be pushed in the direction of the correct hypothesis. Even the Devil can't construct a $q \neq p$ that we would be led to believe after seeing enough samples from $p$.

The other property we'll use is the monotonicity of the KL divergence. This is a generalized version of the data processing inequality. If we perform some kind of processing on our random variables, it should only make it harder to discern between two hypotheses, not easier. In particular, the version we'll need today concerns marginalization, if I have two joint distributions defined on two random variables, it always has to be the case that the KL divergence between their two marginals must be less than or equal to the joint KL: $$ \int dx\, dy\, p(x,y) \log \frac{p(x,y)}{q(x,y)} \geq \int dx\, p(x) \log \frac{p(x)}{q(x)}, $$ which is easy to show if you decompose $p(x,y) = p(x) p(y|x)$ and use the fact that all KL divergences (including the conditional $I[p(y|x);q(y|x)] \geq 0$ are non-negative.

Again, in terms of our current interpretation, this makes sense. If I have some beliefs defined over several variables, if I only get to observe some subset of them, it should be harder for me to discern the beliefs. The less I look at, the less I see.

Universal Recipe

With the prerequisites out of they way we're ready to see the "universal recipe" for generating objectives.

In machine learning, broadly, we build neural networks and need some guidance on how to set their parameters. An objective acts like a score that ranks each possible setting and guides our search in the space of parameters for a good one. How ought we value, or judge each possible solution?

Fundamentally, there are two things in conflict. There is a the real world with all of its causal depedencies and structure, a great deal of which we can no influence on. Data comes from some data generating process wholly outside of our control. On top of this data we are often interested in building machines to process the data, which may exist in the real world but have a billion or more knobs we need guidance on how to set. In contrast to the real world, there is the dream world, the world of our desires, the world as it wish it were to be. There's a simple story we wish were true that we could tell about the data and its causal structure. When doing Bayesian inference this is the generative model you use to describe the data. If we're being honest with ourselves, it isn't that the data we observe actually comes from our generative model, we only wish that were the case. So, we have two different stories we could try to tell about the world, the accurate real world description and the wishful dream world one.

The goal is to make the real world look more like our dreams. Given that KL divergence is the proper way to measure how similar two distributions are, we need only minimize the KL divergence between the real world -- the world we can sample from -- and the world as we wish it were. The smaller that KL can become, the harder it becomes for us or anyone else to distinguish between our dreams and reality. In steps:

Draw a causal graphical model corresponding to the world as it is, the true world $P$.
Augment the real world with any components you wish to add.
Draw the world of your desires, what success would look like, what you are targeting, the dream world $Q$.
Minimize $I[P;Q]$.
...
Profit!

As simple as it sounds, in retrospect a lot of machine learning is simply following this recipe. Let's repeat this ad nauseam.

Density Estimation

We'll start with the problem of density estimation. Let's say we have some black box that generates samples. This is the real world $P$, outside of our control. Despite not knowing how $p(x)$ is structured, we can push the button on the black box to generate samples. What do we wish for? We wish we instead have a nice description of those same images. We wish that those images instead came from a box of our own design, some parametric model or probability distribution with knobs that we can adjust to bring it into alignment with the real world, our dream world $q_\theta(x)$ with parameters $\theta$.

Figure 3. Density Estimation. ⁶

Following the recipe, our recipe then is to minimize the KL divergence between the real world and our ideal one:

$$ I[p; q] = \left\langle \log \frac{p(x)}{q_\theta(x)} \right\rangle_p $$

To belabor the point, in terms of our interpretation of KL divergence, this makes sense. $I[p;q]$ measures how easy it is for us to distinguish between $p$ and $q$ using samples from $p$. We have samples from $p$, while $q_\theta(x)$ is a whole set of worlds we can index with our parameters $\theta$. We seek a setting of those parameters which make it as difficult as possible for us or anyone else to tell the difference between the real world $P$ and our imaginary one $Q$. Minimizing the KL divergence does exactly that.

Unfortunately, naively, this objective requires that we be able to evaluate $\log p(x)$, the density the real world assigns to the samples it generates. This is out of reach, we don't know what the real world is doing, but here is where the KL divergence helps us out yet again. It decomposes into two terms:

$$ \underbrace{\left\langle \log \frac{p(x)}{q_θ(x)} \right\rangle}_{I[p;q]} = \underbrace{\left\langle \log p(x) \right\rangle \vphantom{\left\langle \frac p q \right\rangle} }_{-H[p]} + \underbrace{\left\langle -\log q_θ(x) \right\rangle \vphantom{\left\langle \frac p q \right\rangle}}_{H[p;q]}, $$

the (negative) entropy of the true data generating process ($H[p]$), and the cross-entropy between $p$ and $q$: ($H[p;q]$), aka the likelihood of the data samples from $p$ under $q$. The entropy of the true data generating process isn't something that we control, as far as we're concerned its a constant and we don't need to worry about it. Just like that, we see that minimizing the KL divergence between the real world and the world of our desires, in this simple single random variable setup recovers ordinary minimum cross-entropy learning, aka maximum likelihood learning, but with a different and hopefully well-motivated origin. We adjust the parameters of our model $q_\theta(x)$ so as to maximize the likelihood of the data $\log q_\theta(x)$, why? So that we and anyone else would struggle as much as possible to distinguish between the real world and our model. With this same motivation, lots of other machine learning objectives will fall into place.

There are two caveats worth discussing but I've pushed them to appendices. The first is that it bugs me that splitting the log density ratio is awkward in terms of dimensional analysis, and the second is that while this gives us a meaningful objective, it requires that we be able to take expectations with respect to the true distribution. If we have only finite samples in the form of a training set, that introduces complications. I want to acknowlege that reusing a fixed dataset is a problem that has to be dealt with, I want to highlight that it isn't a problem with the objective. Our KL divergence objective is telling us the right thing to do, we need to work out real world issues about how to best implement that objective. This requires some real world complications that are outside the scope of this discussion.

Supervised Learning

Let's complicate things slightly. Instead of imagining that we have a single random variable in the real world, imagine instead we have a pair of variables, $X$ and $Y$. For concreteness, imagine the $X$ are images and the $Y$ are their associated labels in some dataset.

What are we after? What does success look like? Let's imagine that what we desire is the ability to assign labels to data. What we wish were the case was that we used the same process to draw the images $q(x) = p(x)$, but instead of using the real world process to assign labels, ideally the labels would instead come from a device under our control: $q_\theta(y|x)$. ⁷. Just as before, we simply minimize the KL divergence between these two joints and we obtain an objective:

Figure 4. Supervised Learning.

$$ \left\langle \log \frac{p(x,y)}{p(x)q(y|x)} \right\rangle, $$

Just as above, when we drop constants outside of our control, we end up with the usual maximum likelihood objective we are used to:

$$ \left\langle \log \frac{p(x)p(y|x)}{p(x)q(y|x)} \right\rangle = \left\langle \log \frac{p(y|x)}{q(y|x)} \right\rangle. $$ With the same caveats about proper handling of dimensions and issues stemming from using a fixed set of finite samples.

This conditional likelihood optimization objective is truly the workhorse of modern machine learning. However, I feel as thought its a bit dishonest. In practice we rarely care too much about the actual predictive task we are mimicking with our parametric conditional density. Very few people actually care about assigning ImageNet labels to images. Instead, the explosion in deep learning is mostly due to a happy little accident. When we train very large, very expressive conditional distributions to minimize the conditional KL for something like ImageNet labeling with large datasets, we've discovered that the representations formed by some intermediate (usually penultimate) layer in that neural network are useful for a wide array of different image tasks. This didn't have to be the case, but we got a bit lucky.

What if we wanted to learn a useful representation? What would true representation learning look like?

Variational Autoencoders

So far we've only ever represented the world as it is and haven't yet taken the step of augmenting the real world with something new. If we want to learn a representation, that's something that lives in the real world. That's a new random variable.

Let's start with an unsupervised case. We have images and we want to form a representation of those images. In our real world, we have the images $X$ drawn from some distribution outside our control ($p(x)$). Now we'll augment the real world with a new random variable $Z$; our representation. We'll parameterize this with a neural network $p(z|x)$ that defines a tractable distribution for our stochastic representation $Z$. This is our encoder, which maps an image $X$ to a distribution for its representation. We want to consider a whole slew of possible real worlds, each world consisting of a different setting of the parameters of our encoder, and thus each world consisting of a different joint distribution $p(x,z)$. Now our parameters $\theta$ essentially index one of a wide array of possible joint distributions $p(x,z)$. How do we decide amongst these? What does success look like? We are seeking a world in which we can encode images into a useful representation $p(z|x)$, one way to define success would be if those learned representations were really like latents for the images themselves. Wouldn't it be swell if instead the world worked by looking at our own learned representation and used that to formulate the images themselves? Wouldn't it be grand if that joint distribution factorized in the opposite direction: $q(x,z) = q(z)q(x|z)$. This is the usual generative model story, where we first draw a latent variable $z$ from some prior distribution and then decode it through a stochastic map $q(x|z)$ to formulate our image. Such a latent would be demonstrably useful for generating images.

Figure 5. Variational Autoencoders.

Having defined both the real worlds under consideration $p(x,z)$ and the definition of success $q(x,z)$, our objective is the universal one of minimizing the KL divergence betwixt the two, from $p$ to $q$. We try to make it as hard as possible for us or anyone else to distinguish between the real world in which we send images forward through an encoder to form a representation and some hypothetical world in which those representations were drawn from some prior and acted as a latent for a decoder that generated images. We've just recreated the ELBO or Evidence Lower Bound Objective:

$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle_p = \left\langle \log \frac{p(x)p(z|x)}{q(x|z)q(z)} \right\rangle_p \geq 0. $$

Since this is a joint KL and all KLs are nonnegative, this objective is non-negative. Furthermore, because of the monotonicity of KL, we know this is a bound on something we might care about, the marginal KL of our generative or reverse path: $$ \left\langle \log \frac{p(x)p(z|x)}{q(x|z)q(z)} \right\rangle_p \geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle_p \geq 0. $$ So, as a bonus, if we push down on this joint KL objective, since this bounds the marginal KL on $X$, we can be assured that this machine composed of three parts, the encoder $p(z|x)$, decoder $q(x|z)$ and marginal (or prior) $q(z)$ will, as we adjust their tunable parameters, additionally make progress on the generative path: $z \sim q(z), x \sim q(x|z)$ itself being as indistinguishable as possible from the original image generating process $p(x)$. Building and training the representative learning objective, as a side effect, ensures we also manage to build a good generative model.

We can split this objective up and name the various terms: $$ \underbrace{\left\langle -\log q(x|z) \vphantom{\left\langle \frac p q \right\rangle} \right\rangle_p}_{D} + \underbrace{\left\langle \log \frac{p(z|x)}{q(z)}\right\rangle_p}_{R} \geq \underbrace{\left\langle -\log q(x) \vphantom{\left\langle \frac p q \right\rangle} \right\rangle_p}_{L} \geq \underbrace{\left\langle -\log p(x) \vphantom{\left\langle \frac p q \right\rangle} \right\rangle_p}_{H}, $$ or in short: $$ D + R \geq L \geq H, $$

a geometric story we tell in more detail in prior work.⁸ The first term, the *distortion*, measures how well we are able to recover the original image after encoding it with the encoder $z \sim p(z|x)$ and then trying to decode back to the original image $q(x|z)$. The second term in the objective is the *rate*, which measures the information theoretic cost of the encoding itself. If Alice and Bob were attempting to communicate the encoding $z$, the KL between the encoding distribution and the prior measures the excess cost of communicating the encoding.

If we are careful to split up the objective into its various reparameterization independent components, we can also explore some trade-offs between the different terms in the objective, adding some Lagrange multipliers, obtaining the $\beta$-VAE.⁹: $$ \left\langle -\log q(x|z) \right\rangle_p + \beta \left\langle \log \frac{p(z|x)}{q(z)}\right\rangle_p. $$

All told, the universal recipe has given us a proper representation learning objective, albeit unsupervised. We have defined what it could mean for a representation to be a good one and we are able to search now in the space of all possible representations. Unfortunately, a bit is a bit and unless we bring some kind of auxiliary information to the table, the success and utility of this objective is often left to inductive biases in our particular choices of variational families.

Variational Information Bottleneck

If we want to be a bit more explicit in our representation learning objectives, we could color the bits by bringing and auxiliary variable to the table. Imagine our real world distribution consists of pairs, $(x,y)$ drawn from some joint distribution $p(x,y)$ outside of our control. Imagine images $X$ and labels $Y$. As before, we can augment this world with a new random variable $Z$, a representation, which, in this example, we are interested in depending only on the image part, $p(z|x)$. We do this because we'd like to be able to compute the representation of some downstream image without having access to its label. As before, we've now defined a whole slew of possible worlds, consisting of all possible encoding distributions paired with our joint input distribution $p(x,y,z) =p(x,y)p(z|x)$. How do we decide amongst these? What does success look like? Let's define success as being able to use our learned representation $Z$, not to recreate the image, but only predict the auxiliary information $Y$. This gives us a set of diagrams as in Figure 6 below.

Figure 6. Variational Information Bottleneck.

Following the universal recipe and taking the KL divergence between these two joints lets us reinvent the Variational Information Bottleneck:¹⁰

$$ \left\langle \log \frac{p(y|x) p(z|x)}{q(y|z) q(z)} \right\rangle_p \geq \left\langle \log \frac{p(y|x)}{q(y|x)}\right\rangle_p \geq 0. $$ Because KL is monotonic, this joint objective bounds the marginal conditional likelihood and we can rest assured that our predictive engine is still trying to mimic the labeling distribution. This objective learns a representation that specifically aims to retain only the information that is relevant to predicting the auxiliary information contained in $Y$. Because the objective is representation centric, we also learn a stochastic representation that can truly compress the inputs.

Semi-Supervised Learning

We say that VAEs came from trying to design a representation that could use the learned representation could recreate the images, and that VIB was motivated by saying we could use the learned representation to predict an auxiliary variable. What if we instead wanted to do both?

Figure 7. Semi-Supervised Variational Autoencoder.

We then obtain a type of semi-supervised VAE:

$$ \left\langle -\beta \log q(x|z) - \gamma \log q(y|z) + \log \frac{p(z|x)}{q(z)} \right\rangle_p. $$ Here $\beta$ and $\gamma$ have been inserted to let us play with the trade-offs between how much emphasize we place on the reconstruction and auxiliary variable respectively.

Diffusion

As I outline in more detail in an earlier post, modern diffusion models can also be cast in this universal objective form. We imagine a simple fixed forward process that iteratively adds Gaussian noise to an image, and try to learn a reverse process parameterized in a clever way.

Figure 8. Variational Diffusion.

The Variational interpretation of diffusion models makes clear that they are little more than deep hierarchical VAEs, though with some tricks that make training them much more tractable than a general hierarchical VAE.

Bayesian Inference

So far we've focused on local representation learning, wherein we want to form a representation of each example or image. Let's now think a bit about global representation learning. We are going to observe an entire dataset and want to somehow summarize what we've learned. Now we imagine a forward process in which we sample a whole set of data, $D$, and need to form some kind of summary statistic or description of the data: $p(\theta|D)$. What would success look like here? We'll if we aren't willing to assume much, we still might be willing to assume our data is exchangeable, that is that the order the data was generating in doesn't matter. De Finetti tells us this is equivalent to being able to describe the data as being conditionally i.i.d. (independent and identically distributed). That is, we will describe success as taking the form of a sort of generative story: $$ q(\theta) q(D|\theta), $$ where we draw the summary $\theta$ from some prior and use it to generate the data with some likelihood which we can take to decompose: $q(D|\theta) = \prod_i q(x_i|\theta)$.

Figure 9. (Variational) Bayesian Inference.

It's the same story we've told several times now, our universal recipe gives us an objective, the KL divergence between these two joints which aims to make them as indistinguishable as possible: $$ \left\langle \log \frac{p(D)p(\theta|D)}{q(\theta)q(D|\theta)} \right\rangle_p . $$ If we drop the constant terms outside of our control and separate terms into pieces and insert a trade-off parameter, we've reinvented a generalize form of variational Bayesian inference: $$ \left\langle -\beta \log q(D|\theta) + \log \frac{p(\theta|D)}{q(\theta)} \right\rangle_p. $$ If we set $\beta=1$ and make our $p(\theta|D)$ expressive enough to cover the space of all possible distributions, minimizing this objective recovers the Bayesian posterior. If we simply restrict our attention to some kind of parametric family of distributions $p(\theta|D)$ this is the ELBO used in variational Bayes. Lots of names for the same idea: try to form a global representation of data that is as indistinguishable as possible from the data being exchangeable.

Bayesian Neural Network

We don't have to stop now, let's imagine we want to generate a global summary of data in the form of the best settings of the parameters of a neural network to make some supervised predictions. We can do that to, we simply follow the universal recipe. We draw the real world and the world of our desires.

Figure 10. Bayesian Neural Networks.

And take the KL betwixt them: $$ \left\langle -\beta \log q(y|x,\theta) + \log \frac{p(\theta|D)}{q(\theta)} \right\rangle_p, $$ and we've reinvented Bayes By Backprop.¹¹

TherML

From here you might be wondering what it would look like if we tried to be as honest as possible about the sort of standard practice in machine learning today. In our earlier work ¹² we did exactly that and came up with the following diagram:

Figure 10. TherML.

This gave us an objective that seemed to include all of the previous things discussed as special cases and left open the door for interesting behavior on the spots in between.

Rearranging the objective into terms: $$ \left\langle \gamma \underbrace{\left(-\log q(y|z)\right) \vphantom{\log \frac{p(x)}{q(x)}}}_{C} + \delta \underbrace{\left(-\log q(x|z)\right) \vphantom{\log \frac{p(x)}{q(x)}}}_{D} + \sigma \underbrace{\log \frac{p(\theta|D)}{q(\theta)}}_{S} + \underbrace{\log \frac{p(z|x,\theta)}{q(\theta)}}_{R} \right\rangle_p \geq 0, $$ as we discuss in the paper we get an objective that let's us trade off between the ability of our representation to do reconstruction ($D$ term), predict auxillary variables ($C$ term), all the while being honest about the information our learning algorithm extracts from the dataset ($S$ term) and how expensive our learned representation is ($R$ term). Inserting tradeoff parameters ($\gamma,\delta,\sigma$) would let you explore an entire three dimensional frontier of optimal solutions that explore all tradeoffs between these different criteria.

Variational Prediction

While most of the previous diagrams were all retellings of essentially the same story, more recently we've begun to wonder what it might look like if we try some more extreme rewirings of these kinds of diagrams. What if we wanted to try to be so brazen as to invent something that might be an alternative to Bayesian inference, as a different sort of diagram that could provide a global representation learning objective. One candidate would be the following:

Figure 11. Variational Prediction.

Which we explore in some detail in our recent work ¹³

I'm not sure this is better, but its certainly different.

Closing

This post got fairly repetitive, but honestly that was the point. A whole slew of existing and not yet invented machine learning objectives all seem to follow a very simple universal recipe. Simply draw an accurate causal model of the world, then augment it with anything you wish and finally draw a second diagram in the same random variables that corresponds to your marker of success. Take the KL between the two and you've got yourself a reasonable objective. I hope this helps you understand some of these and potentially invent new ones of your own.

Special thanks to Mark Kurzeja, John Stout and Mallory Alemi for helpful feedback on this post.

Appendix A - Dimensional Consistency

There is one caveat, I'm a particular stickler for decomposing KL divergences in this way. I don't think it makes any dimensional sense. I can't take the logarithm of a dimensional quantity, let alone a density. To fix the glitch, let's instead try to explicitly choose some tractable base measure $m(x)$ and insert it into our original objective:

$$ \left\langle \log \frac{p(x)}{q_\theta(x)} \right\rangle_p = \left\langle \log \frac{p(x) m(x)}{q_\theta(x)m(x)} \right\rangle = \left\langle \log \frac{p(x)}{m(x)} \right\rangle_p + \left\langle \log \frac{m(x)}{q_\theta(x)} \right\rangle_p . $$

Now, we've decomposed the KL divergence between $P$ and $Q$ into two terms, the first is the KL divergence between $P$ and $M$, our base density. Just as before, this is some constant outside our control. As long as we fix $m(x)$, given that $p(x)$ is fixed, their KL divergence is fixed and no changes we make to $\theta$ have any effect, so we can drop this (now appropriately reparameterization-independent) term from our objective. We're left with the weight of evidence samples from $p$ provide in favor of $m$ against $q$. If we try to adjust the parameters of $q_\theta(x)$ to make it as easy as possible to distinguish it from some base measure $m(x)$, under samples from $p$, we ensure that we drive $q$ towards $p$. If we use ordinary path gradients the choice of $m(x)$ here won't actually affect the optimization trajectory. It will, however, help us sleep at night, ensuring that our objective is a truly reparameterization-invariant quantity. ⁵

Appendix B - Finite Samples and the Empirical Distribution

We motivated that a useful objective for learning a parametric distribution is to minimize the KL divergence between the true distribution and our parametric distribution, i.e. we should adjust the parameters of our distribution to maximize the likelihood of samples from the true distribution. In practice however, we typically only have access to a finite number of samples from the true distribution and this introduces a difficulty. If we wanted to, we could generate an unbiased estimate of the expected likelihood of our model using a finite number of samples from the true distribution: $$ -\left\langle \log q(x|\theta) \right\rangle_p \approx -\frac 1 N \sum_{i=1}^N \log q(x_i|\theta). $$ Nothing wrong here. There is similarly nothing wrong with taking the gradient of this Monte Carlo estimate to generate an unbiased estimate of the gradient of the true likelihood: $$ -\nabla_\theta \left\langle \log q(x|\theta) \right\rangle_p \approx -\frac 1 N \sum_{i=1}^N \nabla_\theta \log q(x_i|\theta). $$ The problem only occurs if we start to reuse the same samples. These Monte Carlo estimates are only unbiased estimates of the true expectation if the samples are independent. If we start to take multiple gradient steps with overlapping samples we start to introduce some bias. Taken to the extreme, if we simply maximize the empirical likelihood on a fixed set of finite samples: $$ \sum_{i=1}^N \log q(x_i|\theta), $$ We are no longer minimizing the KL divergence between the true distribution $p(x)$ and our parametric distribution $q(x|\theta)$, instead we are minimizing the KL divergence between the empirical distribution $\hat p$ and our parametric distribution $q(x|\theta)$: $$ \hat p \equiv \frac 1 N \sum_{i=1}^N \delta(x - x_i). $$ If we had a very large number of samples, this empirical estimate would be pretty close to our true $\hat p \sim p$, but with finite samples it is always a distinct distribution from the true. If we minimize the empirical risk, or maximize the empirical likelihood what we are really doing is getting our parametric distribution to be as indistinguishable as possible from the empirical distribution. This is equivalent to saying we should match sampling with replacement from our training set. This is really where all of the issues of over-fitting come from. The degree to which matching the empirical distribution rather than the true distribution is a problem depends on how little data we have (relative to its sort of extent or coverage) and how flexible our parametric model is (the degree to which it can memorize the data we show it and nothing else). In the context of classical machine learning this is where regularization comes to bear, we typically add some additional terms to our objective beyond just the empirical likelihood to attempt to get our learned model to better approximate the true distribution rather than the empirical.

I want to acknowledge that this is a problem, but in the context of the current discussion I want to point out that this isn't a problem with our objective. It is a good idea to try to minimize the KL divergence between the true distribution and our parametric model. After we decide on this objective, unfortunately, there are practical issues we have to consider about how to target this objective tractably and accurately.