Probability
\(\newcommand{\R}{\mathbb{R}}\)
Distributions¶
Given a space \(A\), a probability distribution is described by a function \(f : A \to \R\), such that \(\sum_{a \in A} f(a) = 1\) and \(\forall a,(a) \geq 0\).
For example, the function \(f\):
describes a distribution over the set \(B=\{0, 1\}\). This function \(f\) is known as the probability mass function (PMF) or probability density function (PDF) if the domain is \(\R^n\).
We refer to the space of all possible distributions over \(B\), known as the space of Bernoulli distributions, as \(I(B)\). While \(B\) is finite, \(I(B)\) is infinite and in fact is a manifold.
Note
For each \(p : [0,1]\), there is exactly one distribution \((0 \mapsto p, \quad 1 \mapsto (1p))\) which lives in \(I(B)\), so \([0,1] \cong I(B)\).
Note
When \(A\) is a discrete space, this is all quite simple, but in a continuous space, the sum becomes an integral, and we have to be more careful. This is one of the applications of measure theory. However, discrete probability distributions are enough to understand all the fundamental concepts of probability.
A category of distributions¶
\(I\) is a functor. Concretely, consider SET (or a category of manifolds) as the domain. For every object \(S\), \(I(S)\) is an object of DIST; it is a space the points of which are distributions \(f\). Ignoring technicalities, we'll use \(f\) to also refer to the PDF of a given distribution.
The morphisms are given by the functor; for \(g : M \to N\), we get \(I(g) : I(M) \to I(N)\), defined by:
Note
Statisticians often talk about random variables, a term I try to avoid here. But, roughly, a random variable \(X\) is a distribution over \(\R\), and when they write \(f(X)\) for some function \(f : \R \to \R\), they really mean the distribution \(I(f)(X)\).
Joint distributions¶
A distribution \(P : I(A\times B)\), over a product space \(A \times B\), is known as a joint distribution.
Example:
Note
The name is misleading because it suggests that it is composed of two distributions. It isn't.
Marginals as a special case¶
Given a joint distribution \(P\), say the one given above, we can take the function \(f = (a,b) \mapsto a\), and obtain \(P_1 = I(f)(P)\). This is known as a marginal distribution. We say that the variable \(B\) is marginalized out.
Note
Obviously we could do the same for functions like \((a,b,c) \mapsto (a,c)\), and anything of this sort is referred to as a marginal distribution.
Conditional distributions¶
A conditional distribution is a misleading name; it refers to a function \(f\) from a set \(A\) to a distribution over a set \(B\). So given any \(a \in A\), \(f(a) \in I(B)\).
Conditional distributions are the morphisms of a category, with spaces of distributions again being the objects.
Note
For readers familiar with category theory: this category is the Kleisli category of the probability monad. Probabilistic programming implicitly revolves around constructing morphisms in this category.
Expectation¶
\(E_{x \sim p}[f(x)] := \sum_x p(x)f(x)\), for a PDF \(p\) and a function \(f\) is the standard way of writing the expectation. It is a special case of a morphism in DIST, if we take
The geometry of distributions¶
Given a set \(S\), \(I(S)\) is in fact a manifold, in the usual sense that every point on the manifold can be mapped to its parameters under some chart.
For example, if \(S=\mathbb{R}\), then the unit normal distribution corresponds to a single point on the manifold. If we consider the submanifold of all normal distributions, one particular chart might map it to \((0,1)\), its mean and variance. So this submanifold is 2D.
The most importance fact about this manifold is that there is a canonical way to make it Riemannian, i.e. a canonical choice of metric tensor \(g\), known as the Fisher information, which turns out to be the Hessian of the relative entropy:
The upshot is that we can talk about the distance between distributions in a parametrizationinvariant (i.e. coordinate free) way.
PDF of function of random variable¶
Suppose \(Y = g(X)\) and \(g\) is increasing. Then \(F_Y(y) = P(Y\leq y) = P(g(X)\leq y) = P(X\leq g^{1}(y)) = F_X(g^{1}(y))\).
Then \(f_Y(y) = F'_Y(y) = \frac{d}{dy}F_X(g^{1}(y)) = f_X(g^{1}(y))\frac{d}{dy}g^{1}(y) = f_X(g^{1}(y))\frac{dg^{1}(y)}{dy}\)
Or:
By the general form of the substitution rule, we then have that
But \(J_{g^{1}}(y) = det D(g^{1}(y)) = det (Dg(g^{1}(y)))^{1} = J_g(g^{1}(y))^{1}\) so
So \(\frac{d}{dy}P(Y\in A) = f_Y(y) = \frac{(f_X\circ g^{1})(y)}{J_g(g^{1}(y))}\)
So \((f_X\circ g^{1})(y) = f_Y(y)\cdotJ_g(g^{1}(y))^{1}\)
Sum Random Variables by Convolving PDFs¶
Then, taking the derivative with respect to \(x\), using the fundamental theorem of calculus, and assuming independence of \(X\) and \(Y\), we have that \(f_{X+Y}(z) = \int_{\infty}^{\infty} f_X(x)\cdot f_Y(zx) dx\).
Characteristic Function¶
The characteristic function of a PDF is the (inverse) Fourier transform of its density. The characteristic function of a Gaussian is a Gaussian, since the Gaussian is an eigenvector of the Fourier transform.
Central Limit Theorem¶
Various increasingly powerful versions, but simplest is that for a sequence \(\{X_i\}\) of independent and identically distributed (iid) random variables with mean \(\mu\) and variance \(\sigma^2\), with \(X_n = \left(\sum_i^nX_i\right)n\mu\), the random variable \(Z_n=\frac{X_nn\mu}{\sqrt{n\sigma^2}}\) converges in distribution to \(N(0,1)\), i.e. to a standard normal.
The proof uses Fourier methods. The PDF of a sum of iid random variables \(X_i\) is a convolution of \(n\) PDFs of \(X_i\), the Fourier transform of a convolution is a raising of the Fourier transform of those PDFs to the \(n\)th power, and Taylor expanding inside the Fourier integral gives us the result (see e.g. page 130 of this).
Inference¶
It is common to be in the situation that we know a PDF \(p\) only up to proportionality, i.e. we know that \(p(x) \propto \hat p(x)\) for some unnormalized probability function \(\hat p\). Inference is the goal of recovering \(p\), which amounts to finding the constant of proportionality \(1/Z\), i.e. \(p(x) = \frac{1}{Z}\hat p(x)\) for some constant \(Z\).
Variational Inference (VI)¶
In this approach, we specify a submanifold of distributions \(q\) with coordinates \(\lambda\) that we vary in order to move as close as possible to \(p\) according to the metric.
That is, we optimize \(\lambda\) in some \(q_{\lambda}(\cdot)\) to minimize \(KL(q  p)  \sum_x q(x) \log\frac{q(x)}{p(x)} = E_q[\log\frac{q(x)}{p(x)}]\). Once we have performed this optimization to find some \(\lambda^\star\), \(q_{\lambda^\star}\), is our approximation of \(p\).
The challenge is to minimize \(KL(q  p)\) without knowing \(p\). Here's the idea: let \(\hat{p}\) be the unnormalized version of \(p\), and let's assume we have that. Then
KL divergence is always positive (a basic fact of information theory), so:
which means that \(L(q)\) is a lower bound (in Bayesian inference contexts known as the evidence lower bound, or ELBO) on \(Z\). So we maximize \(L\) to minimize the KL.
Mean Field Variational inference¶
If the distribution you are trying to approximate is a joint distribution (e.g. the Ising model) then as your approximating family, you can use a product of marginals (each independently parametrized). This is a familiar idea in statistical physics.
Monte Carlo Methods¶
Another method for inference is to find a good way to "draw samples" from a distribution, such that the empirical distribution you obtain converges to the true distribution. In a sense, this is also a variational method, since you are moving around a space of empirical distributions to get as close as possible to the true distribution.
Typically, the goal is not to obtain the PDF of the distribution of interest, but rather some expectation value.
In general, you want samples from the typical set, i.e. the part of the space of the support of \(P\) which has both large volume and high probability density (speaking informally  it isn't always a region with sharp boundaries). It turns out that in high dimensions, almost all of the mass is concentrated in such a set.
Markov Chain Monte Carlo (MCMC)¶
This is a particular type of Monte Carlo method. The idea is that you choose a transition kernel \(T(xx')\) in such a way that \(P\) is a fixed point of \(T\), that is, \(P(x) = (TP)(x) = \int T(xx')P(x')dx'\). There are different ways to do this, and one common one described below is MetropolisHastings. Then you invoke some theorems (largely, I think, the PerronFrobenius theorem) which tells you that given some assumptions (your \(T\) should be ergodic), \(T\) has a unit eigenvector (i.e. a distribution which is the fixed point of the kernel) which is maximal and unique. That means that \(\lim_{n\to\infty}T^nA\) for any initial distribution \(A\) converges to this fixed point, because the largest eigenvector eventually dominates. Because it's unique, this fixed point must be \(P\). Taking a Markovian walk along the kernel, starting from any \(A\), amounts to sampling from the fixed point, i.e. from \(P\).
Metropolis Hastings (MH)¶
\(P(x)T(x'x) = P(x')T(xx')\) is the condition of detailed balance for a pair of a distribution and a kernel. As shown below, detailed balance is sufficient (but not necessary) to show that \(P\) is a fixed point of \(T\).
MH is one method to obtain a kernel which satisfies detailed balance. The idea is that you have another kernel \(Q\) which you can choose pretty freely, and let \(T = A(x'x)Q(x'x)\), where \(A(x'x)=min(1,\frac{P(x')Q(xx')}{P(x)Q(x'x)})\). Note that you don't have to know \(P\), but rather \(P\) only up to normalization, since it appears in a ratio. Clever. By design, some simple maths shows that with \(T\) as defined, \((P,T)\) satisfies detailed balance, so you're in business. Caveat: you want to make sure that \(T\) is ergodic.
Hamiltonian Monte Carlo (HMC)¶
The idea is that we derive a better transition kernel by moving into a phase space and exploiting the nice volume preserving nature of Hamiltonian dynamics (hence the name).
We start with the canonical distribution (see notes on statistical mechanics).
where \(\pi\) is our density. We assume that \(\pi(q,p)=\pi(pq)\pi(q)\) so that
In other words, choosing \(\pi(pq)\) amounts to choosing a Kinetic energy, while the potential energy \(V\) is determined by \(\pi(q)\), which is known.
The transition kernel at a point \(q\) is a three step procedure. First sample \(p\) from \(\pi(pq)\). Then push \((p,q)\) forward using Hamiltonian dynamics, approximated with a Symplectic integrator. Finally marginalize out \(p\) to get \(\pi(q)\) (since we have \(\sum_p \pi(p,q) = 1\)).
What remains is a good choice of Hamiltonian, and a good choice of sympletic integrator.
Entropy¶
Entropy \((H : DIST \to \mathbb{R_+})\) is a function that maps a distribution to a real number. It is best defined via the relative entropy, a.k.a. KL divergence:
Conceptually, this is a measure of the information gained by moving from \(q\) to \(p\). It is (evidently) not symmetric.
When \(q(x)\propto C\), i.e. when \(q\) is uniform, and when the support of the distribution is discrete, we define \(H\) as follows:
So, entropy is the measure of how much more information a distribution contains than the uniform one.
Entropy in a continuous space¶
The obvious extension of entropy to a continuous variable, namely
does not represent the right definition. The most obvious way to see that it cannot be right is that it isn't invariant under a change of variables, which means that for \(f : X \to Y\), and \(y=f(x)\):
(This uses \(p(x)dx = p'(y)dy\) and \(df(x) = \frac{dy}{dx}dx\) )
A similar derivation shows that relative entropy is reparametrization invariant, which is an indication that it is the more fundamental quantity.
Entropy and Communication¶
Suppose you have a distribution over some set \(A\), \(p : D(A)\), from which you will be given a sample. Your goal is to come up with a bijection \(f : A \to List(\{0,1\})\) (i.e. from \(A\) to arbitrary length bitstrings) such that \(E_{x\sim p}[len(f(x))]\) is minimized.
The intuition: you are trying to minimize the expected length of messages which losslessly communicate.
The upshot is that one can talk of the information contained in a distribution, measured in the expected number of bits (Booleans) needed to encode a draw from the distribution.
The source coding theorem states that the minimum expected length is \(H(p)\), known as the entropy of \(p\), which is, up to a factor, the KL distance from a uniform distribution.
Intuition: if the entropy is low, this means the distribution puts a lot of mass on certain elements of the support, and accordingly, these can be assigned short messages, minimizing the expected message length.
There is the foundation of a much larger relationship between information theory and communication, which tries to find efficient ways to send data over noisy or noiseless channels and has many applications to modern communication.
Properties of entropy¶
As the source coding theorem makes clear, information, as measured by entropy, is an additive quantity, which is why a logarithm appears in the definition.
It has various intuitive properties, not least that your uncertainty on a distribution over the cells of a partition of the \(p_i\), added to your uncertainty for each distribution over the cell weighted by the cell probability, is equal to the entropy of the unpartitioned set.
Probability distributions¶
Gaussian (a.k.a. Normal) Distribution¶
$$ p(x) \propto e^{{x}2/2} $$ This is the Normal distribution with variance \(1\) and mean \(0\).
The more complicated looking form of a normal distribution with general mean and variance simply follows from change of variables. Similarly for the multivariate Gaussian, a change from \(X\) to \(\mu + \Sigma X\) justifies its form, where \(\Sigma\) is a linear map.
The normalization constant can be computed analytically, by using the fact that \(\int_{\infty}^{\infty}e^{x^2}=\sqrt{\pi}\).
Derivation
Smoothing over a bunch of analytic difficulties, here's the basic idea:
Exponential Distribution¶
Markov Random Field¶
where \(f(x)\) is an indicator function and \(\theta\) is a vector of parameters. This parametrization shows that an MRF is an exponential family distribution (see below).
Empirical Distribution¶
A distribution over any set \(X\). Given a data set \(D\) of points \(d \in X\),
\(p(y) \propto \sum_{d \in D} I[d=y]\)
Exponential Family¶
Not to be confused with the exponential distribution. For fixed functions \(b\) and \(T\) but with \(\eta\) varying, we have a manifold of distributions:
With \(T(X)\) the sufficient statistic, \(\eta\) the natural parameter, \(a(\eta)\) the lognormalizer, \(b\) the base measure.
So \(\eta\) parametrizes a family. For instance, we can choose \(T\) and \(b\) to make \(\eta\) range over all Gaussians, or Bernoulli, Poisson, Exponential, VonMises, Gamma distributions. \(a\) is determined by \(T\) and \(b\) and is the log of the normalizing constant.
Important facts about exponential family distributions:

\(\frac{da(\eta)}{d\eta} = E[T(x);\eta]\)

\(H_{\eta}a(\eta) = Cov[T(x);\eta]\)

\(a\) is convex in \(\theta\)

Each exponential family is conjugate to an exponential family

Maximum Entropy: exponential families are the most entropic distributions given that the expectation is equal to some \(\alpha\), for a fixed base measure
To find the maximum likelihood \(\theta\) for an observation \(D\), one wants to take the gradient of log \(P(x\theta)\), since loglikelihood in exponential families is convex. Taking the gradient of the log of the numerator is easy. But the gradient of the log of the denominator is as follows:
The Hessian wrt. \(\theta\) is the covariance of f(x) wrt the MRF. Covariance matrices are positive semidefinite, proving that the loglikelihood is convex.
Example of how to put Bernoulli distribution in exponential family form:
\(T=id\), \(b(x)=1, \eta=\sigma^{1}(\mu)\).
Crucial point: how do we convert from the mean parametrized form of the Bernoulli distribution, namely \(P(x\mu) = x^\mu\cdot x^{1\mu}\) to the naturally parametrized form?
Easy: take the log odds: \(\eta=log(\frac{\mu}{1\mu})\). And the inverse is the sigmoid function.
Distributions over functions of time (stochastic processes)¶
We can have distributions that live in \(I(\R\to\R)\), i.e. where each element of the support is a function.
One can marginalize such a distribution \(f : \mathcal{P}(\R\to\R)\) at a particular time, to obtain \(\rho(t) : (\R \to \mathcal{P}(\R)) = map(g \mapsto g(t))(f)\), i.e. a time varying distribution.
A physical application is to model the time varying motion of a particle with unknown dynamics (see e.g. Brownian motion).
Stochastic integration¶
For a process P:
where \(t_k = k(\tau/v)\) in the Ito formulation of the integral, and \(t_k = (k + \frac{1}{2})(\tau/v)\) in the Stratonovich formulation.
Wiener process¶
So \(W\) is the limit of a random walk in a high density of steps.
We have that \(W(t) = N(0,t)\), and \(W(t)W(s) = N(0, ts)\).
Moreover, for \(Y(t) = \int_0^t g(s)dW(s)\):
Processes via Stochastic Differential Equations (SDEs)¶
It is typical to use the Wiener process to create more complex processes, e.g. \(dX = a(X(t),t)dt + b(X(t),t)dW\)
Generators¶
For a given process \(dX\), it has an associated generator \(\mathcal{L} = a\frac{d}{dx} + \frac{1}{2}b^2\frac{d^2}{dx^2}\).
OrnsteinUhlenbeck SDE¶
Let \(dP = \gamma Pdt + \sqrt{2\gamma T m} dW\). Then it isn't hard to show that for \(t\) large:
Physically, this is a model of the approach the equilibrium distribution (of momenta) via an SDE.
Langevin SDE¶
This is basically OrnsteinUhlenbeck coupled to Hamiltonian dynamics, in the following sense:
Physically, it is used as a model of a particle in a "bath" of smaller particles.
FokkerPlanck¶
Considering the definition of \(\rho\) above, we may ask about its dynamics, in terms of the defining process \(X\).
The answer is a continuity equation \(\partial^\mu j_\mu=0\) for flux \(j^0=\rho\) and \(j^i := \frac{dx}{dt}\rho  D\frac{d}{dx^i}\rho\) (for \(i\) ranging over space dimensions).
Equivalently:
\(\frac{d}{dt}\rho = \mathcal{L}^\dagger \rho\)
where (via integration by parts) we have that \(\mathcal{L}^\dagger\rho = \frac{d}{dx}(a\rho) + \frac{1}{2}\frac{d^2}{dx^2}(b^2\rho)\)
Or more explicitly in 1D:
Example
For \(dX = 0a(X,t)dt + b^0(X,t)dW = dW\), \(\mathcal{L}^\dagger = \frac{1}{2}\frac{d^2}{dx^2}\), so
This is known as the diffusion equation.
Derivation
We proceed in a recursive fashion, taking \(P(f(t)=x  x_0, t_0)\) to be the marginal probability conditional on an initial condition, to obtain:
where we have expanded \(g(a) := \delta(xa)\) around \(x'\), so that \(\Delta X = x(t+dt)x'\). Then to evaluate \(E[\Delta X]\) and \(E[(\Delta X)^2]\), we'll need to know the SDE for \(f\).
To go further, let us suppose that we have
where \(v\) is deterministic and \(\xi\) stochastic with \(E(\xi(t)\xi(t'))=2D\delta(tt')\). We could do other cases, but this gives the general idea. Then \(f_{x_0=x'}(t+dt) = x' + v(x')dt + \int_t^{t+dt}\xi(t')dt'\), so \(E[\Delta X] =v(x')dt\), and \(E[(\Delta X)^2] = 2Ddt\)
so that for
Dynamical systems¶
References
Chapter 18 of Machine Learning: A Probabilistic Perspective is great on this topic.
One can also have stochastic dynamical systems, in the sense of a distribution over functions of time, conditional on a function of time.
Discrete time cases of stochastic dynamical systems are known as state space models. For example, imagine a distribution over the position of a particle, that at each time \(t\) is a function of the position at time \(t1\).
The goal is to infer the current, or future states from previous states. Inference methods take advantage of this recurency inherent in the model.
A linear dynamical system is one where the temporal transitions and production of observations are linear, with additive Gaussian noise. They are solvable in closed form by a Kalman filter.
In the case where the transition and observation models are not linear, one option (the Extended Kalman Filter) is to take the Jacobian in order to obtain a linear approximation, and basically do a Kalman filter with that. The unscented Kalman filter is similar, but instead first passes the distribution (or sample points) through the nonlinear function, and then fits a Gaussian.
More generally, one can do Monte Carlo inference in the sense of approximating the state at time \(t\) by an empirical distribution, and then update this empirical distribution at each time.
Distributions over functions of space¶
This mathematics is crucial for field theories in physics (particularly statistical and quantum field theories), since they amount to distributions over functions of space.
Recall that we can view functions \(f: \R^n \to \R\) as vectors. In terms of coefficients, we can regard \(f(x), x\in\R\) as the coefficients of the vector \(f\rangle\). \(\langle f, g \rangle := \int dx f(x)g(x)\), and \((Af)(x) = \int dy A(x,y)f(y)\).
We are interested in distributions on the space, in particular the Boltzmann distribution \(p(\phi) \propto e^{\beta F(\phi)}\) for some \(F : (\R^n \to \R) \to \R\).
Since the Boltzmann distribution is in the exponential family, the normalizing constant \(Z = \int d\phi e^{\beta F(\phi)}\) is a generating function for the expectations we want to calculate.
When \(F\) only has quadratic terms, \(Z\) is a Gaussian integral:
for \(A(x,y) := \delta(xy)(\partial^2_x + \mu^2)\) (where we have integrated by parts to obtain the \(\partial^2_x\) term).
In finite dimensional vector spaces, we use the formula: \(Z = \int y e^{\frac{1}{2}xAx + Bx} = \sqrt{\det(2\pi A)}e^{\frac{1}{2}BA^{1}B}\). This requires the inverse of \(A\). In the infinite dimensional case, the same formula holds, but \(A^{1}\) is defined by the property:
and known as the Green's function of \(A\). In the finite dimensional case, \(A^{1}\) is the covariance of the Gaussian, and generated by the second derivative of \(\log Z\), as we can see explicitly above. Higher moments are functions of the covariance. Similarly in the infinite dimensional case, \(A^{1}(x,y) = \langle \phi(x), \phi(y)\rangle\) and Wick's theorem gives that higher order correlations are sums of products of the twopoint correlation.
Then since:
we find that \(\langle \phi(x), \phi(y)\rangle = A^{1}(x,y) = (\nabla^2_x + \mu^2)^{1}(x,y) = \int d^4k \frac{1}{(2\pi)^4}\frac{e^{ik(xy)}}{k^2+m^2 + i\epsilon}\)
Under construction