Non-equilibrium Thermodynamics Results Seemingly from Nothing
Alexander A. Alemi.
Let's see if we can very quickly prove the Jarzynski Equality and related non-equilibrium statistical mechanics results. Much like the mathematical underpinnings of thermodynamics are pretty mathematically simple, e.g. the existence of a convex surface on which mixed partial derivatives commute, I believe most of the results in non-equilibrium statistical mechanics are similarly due to a rhetorical reinterpretation of a simple mathematical manipulation.
This post will assume some familiarity with physics.
Basic Facts
The underlying math in our case are two facts, one that probability distributions are normalized:
∫𝑑𝑥𝑝(𝑥)=1.
and second, that KL divergence is positive:1∫𝑑𝑥𝑝(𝑥)log𝑝(𝑥)𝑞(𝑥)≥0.
Density Ratios
To generate the classic non-equilibrium statistical mechanics results we start by considering a simple ratio of two joint probability distributions:
𝑞(𝑥0,𝑥1)𝑝(𝑥0,𝑥1)
Clearly we have a tremendous freedom here in our choices for the distributions 𝑝 and 𝑞. Mathematically it's uninteresting but we can start to build some rhetorical weight by factoring our two distributions in two distinct ways:
𝑞(𝑥1)𝑞(𝑥0|𝑥1)𝑝(𝑥0)𝑝(𝑥1|𝑥0)
Despite still not having done anything, we can start to build an interpretation here. Imagine 𝑥0 and 𝑥1 as being two configurations of a system, with 𝑥1 happening after𝑥0. Now, though we're allowed by the chain rule to factor distributions any way we wish, here we've chosen to factor 𝑝 to be suggestive of some kind of forward process wherein we first sample some 𝑥0 from a distribution 𝑝(𝑥0) and then evolve it according to some potentially stochastic process to generate our next state 𝑥1 conditioned on the first: 𝑝(𝑥1|𝑥0). At the same time, we've factored 𝑞 the other way, evocative of a reverse process that starts at 𝑥1 and then evolves backward to 𝑥0.
To make further progress, let's specialize a bit. Let's imagine that 𝑥0 and 𝑥1 are configurations of a physical system evolving according to Hamiltonian dynamics, with a Hamiltonian governed by some kind of control parameter 𝜆. Let's further imagine that at the beginning of either our forward or reverse process our system is in thermodynamic equilibrium at the same temperature, and in particular in a canonical ensemble:2
We can clean this up a bit and give it a cleaner physical interpretation. Let's identify the change in the Hamiltonian with the work:
𝑊≡𝐻(𝑥1,𝜆1)−𝐻(𝑥0,𝜆0).
And let's use the standard definition of the free energy:
𝛽𝐹=−log𝑍,
to rewrite the ratio of partition functions as a difference in free energies:
𝑒−𝛽Δ𝐹=𝑒log𝑍(𝛽,𝜆0)−log𝑍(𝛽,𝜆1)=𝑍(𝛽,𝜆0)𝑍(𝛽,𝜆1).
Combining these results gives:
𝑞(𝑥0,𝑥1)𝑝(𝑥0,𝑥1)=𝑒𝛽(𝑊−Δ𝐹)𝑞(𝑥0|𝑥1)𝑝(𝑥1|𝑥0).
I'm going to anticipate some of the things we're going to talk about below and define the log of the forward over the reverse transition probabilities as the heat:
𝑄=log𝑝(𝑥1|𝑥0)𝑞(𝑥0|𝑥1).
With this final identification we end up with the general statement:
𝑞𝑅𝑝𝐹=𝑒𝛽(𝑊−𝑄−Δ𝐹).
The density ratio of the reverse process (shortened here as 𝑞𝑅) to the forward process 𝑝𝐹 is given by the exponential of 𝛽 times the quantity of the work, minus the heat minus the change in free energy.
Hamiltonian Dynamics
First, if we assume that our dynamics is Hamiltonian, and thus deterministic and reversible, we know that the probability that we start at 𝑥0 and end up at 𝑥1 if we evolve forward in time is the same as the probability that we start at 𝑥1 and end up at 𝑥0 if we reverse our time evolution, (𝑞(𝑥0|𝑥1)=𝑝(𝑥1|𝑥0))3
so the ratio of conditional probabilities actually cancels and we generate Crook's Fluctuation Theorem:
𝑞𝑅𝑝𝐹=𝑒𝛽(𝑊−Δ𝐹).
The ratio of the reverse process probability to the forward probability for a given initial and final point is given by the exponential 𝑒𝛽(𝑊−Δ𝐹). If we now take the integral of this with respect to the forward process, we generate the Jarzynski equality:4∫𝑑𝑥0𝑑𝑥1𝑝(𝑥0,𝑥1)𝑞(𝑥0,𝑥1)𝑝(𝑥0,𝑥1)=1=⟨𝑒𝛽(𝑊−Δ𝐹)⟩𝑝,
which simplifies to5:
⟨𝑒−𝛽𝑊⟩𝑝=𝑒−𝛽Δ𝐹.
So, recapping, what have we just done?
Since we can take density ratios of arbitrary probability distributions, we could choose those two densities to mean something we care about. Consider 𝑝 the forward, Hamiltonian evolution of a system from 𝑥0 to 𝑥1 and 𝑞 the reverse process. If we imagine that both the forward and reverse processes start in a state of canonical equilibrium, we can generate both Crook's Fluctuation Theorem as well as the Jarzynski equality.
The power of this result is that it allows us to relate an expectation computed with respect to non-equilibrium processes (the exponential of the beta weighted stochastic work needed for a bunch of non-equilibrium realizations of our trajectory) to a pure equilibrium quantity (a difference of equilibrium free energies).
In the context of the physical sciences, this lets us perform non-equilibrium simulations or experiments, and provided we measure the work performed over many such runs, even with the system driven far from equilibrium, we can estimate equilibrium free energy differences.
Stochastic Dynamics
But, let's say you don't like the assumption that the dynamics are Hamiltonian, we can imagine something else, imagine our dynamics is stochastic but imagine discretizing the dynamics. We still need to make some kind of assumption, in this case, we'll imagine that our process consists of 𝑁 steps, each of which is governed by a Markov transition kernel. Finally, we'll assume that each transition kernel has a stationary distribution and satisfies detailed balance.
What this means is that we'll imagine that our forward process now takes the form:
𝑝𝐹=𝑝(𝑥0)𝑝(𝑥1|𝑥0)𝑝(𝑥2|𝑥1)⋯𝑝(𝑥𝑁|𝑥𝑁−1)=𝑝(𝑥0)𝑇1(𝑥1|𝑥0)𝑇2(𝑥2|𝑥0)⋯𝑇𝑁(𝑥𝑁|𝑥𝑁−1)
Here we've denoted the intermediate conditional distributions as being governed by our transistion kernels, labeled with the corresponding stationary distribution. Saying that our kernels have a stationary distribution that they respect according to detailed balance means that:
𝑇𝑘(𝑥′|𝑥)𝜎𝑘(𝑥)=𝑇𝑘(𝑥|𝑥′)𝜎𝑘(𝑥′),
for the stationary distribution 𝜎𝑘.
We've defined our forward process, now we need to define our reverse process. We'll imagine that the reverse process is governed by the same transition kernels but running in reverse:6
Now if we look at the ratio of our reverse to our forward process, things simplify a bit:
𝑞𝑅𝑝𝐹=𝑞(𝑥𝑁)𝑇𝑁(𝑥𝑁−1|𝑥𝑁)⋯𝑇2(𝑥1|𝑥2)𝑇1(𝑥0|𝑥1)𝑝(𝑥0)𝑇1(𝑥1|𝑥0)𝑇2(𝑥2|𝑥1)⋯𝑇𝑁(𝑥𝑁|𝑥𝑁−1)=𝑞(𝑥𝑁)𝑝(𝑥0)𝑇1(𝑥1|𝑥0)𝑇1(𝑥0|𝑥1)𝑇2(𝑥1|𝑥2)𝑇2(𝑥2|𝑥1)⋯𝑇𝑁(𝑥𝑁−1|𝑥𝑁)𝑇𝑁(𝑥𝑁|𝑥𝑁−1)=𝑞(𝑥𝑁)𝑝(𝑥0)𝜎1(𝑥1)𝜎1(𝑥0)𝜎2(𝑥2)𝜎2(𝑥1)⋯𝜎𝑁(𝑥𝑁−1)𝜎𝑁(𝑥𝑁).
Finally, as we did above, let's imagine that all of these marginal distributions take the form of a canonical distribution.7
𝑞(𝑥𝑁)≡1𝑍𝑁𝑒−𝛽𝐻𝑁𝑝(𝑥0)≡1𝑍0𝑒−𝛽𝐻0𝜎𝑘(𝑥𝑗)≡1𝑍𝑘𝑒−𝛽𝐸𝑘(𝑥𝑗).
Notice that the nice simplification that happens here is that since we imagined our reverse process as being the reverse of the forward process, in all but one of these fractions, the partition function of the intermediate stationary processes will cancel out. Putting this all together we obtain the general result:
𝑞𝑅𝑝𝐹=𝑒𝛽(𝑊−𝑄−Δ𝐹),
if we identify 𝑊 with the total energy change of the system (𝐻0−𝐻𝑁), Δ𝐹 with the change in the partition functions (as above, −𝛽Δ𝐹=log𝑍0/𝑍𝑁) and now identify the heat as additional energy changes in each of the intermediate processes:8𝑄≡𝑁∑𝑘=1𝑄𝑘𝑄𝑘=Δ𝐸𝑘=𝐸𝑘(𝑥𝑘)−𝐸𝑘(𝑥𝑘−1).
And I believe we've done it. Taking the expectation of this quantity with respect to the forward process will give us the Jarzynksi equality again9:
⟨𝑒𝛽(𝑊−𝑄)⟩=𝑒𝛽Δ𝐹.
Taking the logarithm of the ratio and then the expectation is equivalent to the KL divergence between the forward and reverse processes, which we know must be positive:
𝐷(𝑝𝐹;𝑞𝑅)=⟨log𝑝𝐹𝑞𝐹⟩𝐹=−𝛽⟨𝑊−𝑄⟩+𝛽Δ𝐹≥0
which naturally generates the inequality (a version of the second law):
Δ𝐹≥⟨𝑊−𝑄⟩.
As a reminder, in this case, we were generalized to a situation where our initial distributions were canonical, but our dynamics were generalized to any sequence of Markovian transition kernels, provided only that those kernels have a stationary distribution.
Generalized Landauer Bound
Wolpert says that, from stochastic thermodynamics we know:
−Δ𝑄=ΔΣ+𝑆(𝑝0)−𝑆(𝑝1)
Which, with ΔΣ≥0 gives us the generalized Landauer bound
−Δ𝑄≥𝑆(𝑝0)−𝑆(𝑝1)
For the classic case of bit erasure the change in entropy is log2 and we get Landauer's bound:
−Δ𝑄≥𝑘𝑇log2
So, where does this come from? It doesn't seem like there is much to it, honestly, imagine two joint distributions 𝑝(𝑥0,𝑥1) and 𝑞(𝑥0,𝑥1) describing a forward and reverse process that moves between two states. The KL divergence between these two is non-negative and monotonic
⟨log𝑝(𝑥0,𝑥1)𝑞(𝑥0,𝑥1)⟩𝑝≥⟨log𝑝(𝑥1)𝑞(𝑥1)⟩≥0
We can simply rearrange terms to see that:
Subtracting ⟨log𝑝(𝑥1)/𝑞(𝑥1)⟩ from both sides we first find the entropy production:
ΔΣ≡⟨log𝑝(𝑥1|𝑥0)𝑝(𝑥0)𝑞(𝑥0|𝑥1)𝑝(𝑥1)⟩≥0
and we can establish the identity:
⟨log𝑝(𝑥1|𝑥0)𝑝(𝑥0)𝑞(𝑥0|𝑥1)𝑝(𝑥1)⟩𝑝=⟨log𝑝(𝑥1|𝑥0)𝑞(𝑥0|𝑥1)⟩𝑝+⟨log𝑝(𝑥0)𝑝(𝑥1)⟩𝑝
If we simply identify terms, we recover the Wolpert form:
ΔΣ=−Δ𝑄+𝑆(𝑝1)−𝑆(𝑝0)
To make these identifications, we can see that:
𝑆(𝑝0)=−⟨log𝑝(𝑥0)⟩𝑆(𝑝1)=−⟨log𝑝(𝑥1)⟩
And for the entropy rate:
−Δ𝑄≡⟨log𝑝(𝑥1|𝑥0)𝑞(𝑥0|𝑥1)⟩
which appears to be the likelihood ratio of our forward and reverse conditional processes, i.e. some characterization of the irreversibility of our system.
If we happen to be in a system that satisfies local detailed balance, we know that there should be some kind of steady state distribution for which:
𝑝(𝑥1|𝑥0)𝜋(𝑥0)=𝑞(𝑥0|𝑥1)𝜋(𝑥1)
so that:
log𝑝(𝑥1|𝑥0)𝑞(𝑥0|𝑥1)=log𝜋(𝑥1)𝜋(𝑥0)
and if we further imagine that the steady state distribution is boltzmann like and the system is in contact with some kind of heat bath, we see that:
log𝜋(𝑥1)𝜋(𝑥0)=log1𝑍1𝑒𝛽𝐻11𝑍0𝑒𝛽𝐻0=log𝑍0𝑍1+𝛽(𝐻1−𝐻0)=𝛽Δ𝐹−𝛽Δ𝑈=Δ𝑄
we can identify the forward to the reverse transition probabilties as the heat flow from the bath.
Variational Autoencoder
To show some of the generality of what we're doing here, let's do it again but for a completely different kind of system, this time a Variational Autoencoder. In a variational autoencoder there are two joint distributions at play, one a representational model𝑝(𝑥,𝑧)=𝑝(𝑥)𝑝(𝑧|𝑥) which starts with a draw from some true data distribution 𝑝(𝑥) and then uses an encoder to map that datum to some kind of representative code, or summary, or representation 𝑧: 𝑝(𝑧|𝑥). The other joint distribution consists of a generative model𝑞(𝑥,𝑧)=𝑞(𝑧)𝑞(𝑥|𝑧) that imagines a joint distribution over the same space but works in reverse. First, we generate a latent variable𝑧 from some prior distribution𝑞(𝑧) and then we use a decoder to stochastically turn that latent variable into a generated datum 𝑥: 𝑞(𝑥|𝑧).
We can easily imagine the ratio of these two densities:
𝑞(𝑥,𝑧)𝑝(𝑥,𝑧)=𝑞(𝑧)𝑞(𝑥|𝑧)𝑝(𝑥)𝑝(𝑧|𝑥).
As we saw above, the way to generate an inequality here is to turn this into a KL divergence:
𝐷(𝑝(𝑥,𝑧);𝑞(𝑥,𝑧))=⟨log𝑝(𝑥)𝑝(𝑧|𝑥)𝑞(𝑧)𝑞(𝑥|𝑧)⟩𝑝=−⟨−log𝑝(𝑥)⟩𝑝+⟨−log𝑞(𝑥|𝑧)⟩𝑝+⟨log𝑝(𝑧|𝑥)𝑞(𝑧)⟩𝑝≡−ℍ+𝐷+𝑅≥0
Here, just as above we've only rearranged terms, but this time organized them into three contributions, the entropy of the true data generating process:
𝐻≡⟨−log𝑝(𝑥)⟩𝑝,
the distortion a measure of the likelihood we encode then decode and image to the one we started with:
𝐷≡⟨−log𝑞(𝑥|𝑧)⟩𝑝=−∫𝑑𝑥𝑝(𝑥)∫𝑑𝑧𝑝(𝑧|𝑥)log𝑞(𝑥|𝑧),
and the rate, a measure of the excess cost required to communicate this message 𝑧 over a wire designed to be optimal for the prior 𝑞(𝑧):
𝑅≡⟨log𝑝(𝑧|𝑥)𝑞(𝑧)⟩𝑝=⟨𝐷(𝑝(𝑧|𝑥);𝑞(𝑧))⟩𝑝(𝑥).
We've just rederived the ELBO10
rendered in the form presented in Fixing a Broken ELBO11𝖤𝖫𝖡𝖮≡𝐷+𝑅≥𝐻.
Conclusion
We've managed to derive several non-equilibrium statistical mechanical equalities and inequalities seemingly from nothing. All of these results were powered by the facts we opened with, that probability distributions integrate to one and that KL divergences are positive. The only challenge here was one of semantics. To get power out of such trivial mathematical manipulations required us to make judicious choices in how we interpreted them.
Special thanks to Sam Schoenholz, Srinivas Vasudevan, Yasaman Bahri and Jim Sethna for helpful feedback on this post.