… or this wikipedia section is all you need to know.
I’m trying something different today. There are a handful of topics I’ve wanted to practice writing about. And I do have a blog. So this begins a series of posts about non-political, probably-technical topics I’ve wanted to write up.
Gaussian Processes are a robust, flexible way to fit models to data. But it’s hard to figure out what they are! Tutorials almost all begin with a definition of a random process (a probability measure over functions…?), and you immediately get lost in the weeds.
It took me several years, and few great textbooks, to build intuition for what a Gaussian Process actually is, and I think there’s a much better way to introduce it–at the expense of some of the rigor.
Let’s discuss GPs how I imagine they were actually discovered.
Fitting curvy lines through points¶
Suppose we have some data, defined as pairs of $x$ and $y$. $x$ can be a vector. (So can $y$, but we’ll pretend it’s scalar in this post.) In the example below, the data is missing for $x=7$, and that’s the point we want to predict.
View code
import numpy as np
from matplotlib import pyplot as plt
from collections.abc import Callable
np.random.seed(215)
x = np.array([0, 1, 2, 3, 4, 5, 6, 8, 9])
y = np.sin(x)
plt.scatter(x, y)
plt.vlines(x=7, ymin=-1, ymax=1, colors="lightblue", linestyles="dashed")
plt.xlabel("X")
plt.ylabel("Y")
pass
Experienced statistical modelers will have a number of techniques for modeling this curve and predicting missing data. They include:
- Regression using powers of $x$: $y_i = \alpha + \beta_1 x_i + \beta_2 x_i^2 + …$
- Loess Regression
- Splines
To invent Gaussian Processes, instead, let’s start from first principles.
- We can view this entire set of nine datapoints as one nine-dimensional random vector.
- The points that are closer to each other on the $x$-axis will have more similar $y$-values.
This leads to the key insight for a GP.
A simple GP¶
Let’s model the data as a giant multivariate normal random variable with a covariance matrix where things that are closer in $x$ have more positive covariance:
$$ \vec{y} \sim MVN(\mu=0, \Sigma(\vec{x})) $$where I’ve added the vector notation to emphasize these are all the $y$s, instead of a single one. We’ll use a prior mean of 0. This doesn’t mean we think the data themselves are zero, but that before we ever saw the data we would expect them to be centered around zero. If you had a real prior about the value you can add it, but it rarely matters in practice.
We need a covariance matrix $\Sigma$. Let’s make one up. We’ll create a kernel function, which takes in two $x$ values and spits out a covariance. We’ll arbitrarily use
This is a squared exponential kernel (it’s often called a Gaussian Kernel, but I’ll use Gauss’s name to refer to only one thing in this post). Notice that on the diagonal the value will be $\exp\{0\} = 1$, and the covariance between two points decays to zero as the distance gets larger. According to this kernel, the covariance halves every time the distance increases by 1.66. Is that right? Who knows!
With this setup, it becomes easy to predict $y$ at $x=7$:
- Add the correct row and column to the covariance matrix for $x = 7$.
- Use the conditional normal equations to find the distribution of $y$.
To use the equations, let’s concatenate the data pairs $x_d, y_d$ with the value we want to predict: $y_p$ at $x_p$:
\begin{align*} y &= \begin{pmatrix} y_d \\ y_p \end{pmatrix} \\ x &= \begin{pmatrix} x_d \\ x_p \end{pmatrix} \end{align*}to give \begin{align*} y &\sim MVN(\mu, \Sigma) \\ \mu &= \begin{pmatrix} \mu_d \\ \mu_p \end{pmatrix} = \begin{pmatrix} \vec{0} \\ 0 \end{pmatrix} \\ \Sigma &= \begin{pmatrix} \Sigma_{dd} & \Sigma_{dp} \\ \Sigma_{pd} & \Sigma_{pp} \end{pmatrix} \\ \end{align*}
The conditional normal equations tell us that conditional on $y_d$, $y_p$ is normally distributed with
$$ y_p | y_d \sim N(\mu_{post}, \sigma_{post}) $$where
\begin{align*} & \mu_{post} = \mu_p + \Sigma_{pd} \Sigma_{dd}^{-1} (y_d – \mu_d) \\ & \Sigma_{post} = \Sigma_{pp} – \Sigma_{pd} \Sigma_{dd}^{-1} \Sigma_{dp} \end{align*}(Aside: I love these equations, and it’s worth spending ten minutes building intuition here. They just translate information from the observed dimensions to the desired dimension, where the scaling and amount of information translated depends on the covariance.)
And we’re ready! We just plug in the values from our example data:
View code
def kernel_square_exponential(x_0, x_1, bandwidth=2):
return np.exp(-np.power((x_0[:, np.newaxis] - x_1[np.newaxis, :]) / bandwidth, 2))
def conditional_normal_mu_sigma(
mu_d: np.ndarray,
x_d: np.ndarray,
y_d: np.ndarray,
mu_p: np.ndarray,
x_p: np.ndarray,
kernel: Callable[[np.ndarray, np.ndarray], np.ndarray],
) -> tuple[np.ndarray, np.ndarray]:
sigma_dd = kernel(x_d, x_d)
sigma_pd = kernel(x_p, x_d)
sigma_pp = kernel(x_p, x_p)
sigma_dd_inv = np.linalg.inv(sigma_dd)
mu_post = mu_p + sigma_pd @ sigma_dd_inv @ (y_d - mu_d)
sigma_post = sigma_pp - sigma_pd @ sigma_dd_inv @ sigma_pd.T
return mu_post, sigma_post
x_p = np.array([7])
mu_post, sigma_post = conditional_normal_mu_sigma(
mu_d=np.zeros_like(x),
x_d=x,
y_d=y,
mu_p=np.zeros_like(x_p),
x_p=x_p,
kernel=kernel_square_exponential,
)
print(f"{mu_post=}, {sigma_post=}")
plt.scatter(x, y)
plt.scatter(x_p, mu_post, c="r")
plt.vlines(
x=7,
# use a 95% posterior predictive interval
ymin=mu_post - 1.96*np.sqrt(sigma_post),
ymax=mu_post + 1.96*np.sqrt(sigma_post),
colors="r",
)
plt.xlabel("X")
plt.ylabel("Y")
pass
mu_post=array([0.68350561]), sigma_post=array([[0.01330855]])
In fact, we could use this function to produce predictions at any $x$, giving us a full line of expected values.
View code
x_p = np.arange(-1,12,0.5)
x_p = x_p[~np.isin(x_p, x)]
mu_post, sigma_post = conditional_normal_mu_sigma(
mu_d=np.zeros_like(x),
x_d=x,
y_d=y,
x_p=x_p,
mu_p=np.zeros_like(x_p),
kernel=kernel_square_exponential,
)
plt.scatter(x, y)
plt.scatter(x_p, mu_post, c="r")
plt.vlines(
x=x_p,
ymin=mu_post - 1.96*np.sqrt(np.diag(sigma_post)),
ymax=mu_post + 1.96*np.sqrt(np.diag(sigma_post)),
colors="r",
)
plt.xlabel("X")
plt.ylabel("Y")
pass
Notice that, compellingly, the posterior uncertainty is much smaller for interpolated points within the range of our data than for extrapolated points outside of it. And as we get far away, the posterior distribution converges to our prior distribution, $N(0, 1)$.
That’s it! We’ve built a simple Gaussian Process. The key points are:
- The data is modeled as a giant multivariate normal vector, together with the points to predict.
- The covariance is parametrized based on some kernel function such that “closer” points in $x$ have more positive covariance.
- The conditional normal equations do the rest.
Let’s discuss some finer points.
First, notice that we don’t have a parametrized line. Statisticians who are used to equations of the form $y= \alpha + \beta x$, or splines, or some other “parametrized” model where you have a functional form whose coefficients you are trying to estimate, will find this unsettling. Among other things, the model doesn’t inherently model increasing- or decreasingness. There’s no slope. Instead, if one $y$ is low and the next $y$ is high, the points in between are expected to be increasing just because of proximity. Predicting a new point requires using every observed datapoint every time, and handling the possibly-growing $\Sigma_{dd}$ and $\Sigma_{pd}$ matrices.
Second, relatedly, the only way $x$ matters is via the kernel function. This makes the method extend really nicely to multi-dimensional $x$, or even more complicated structures (maybe some elements in $x$ are categorical? and the kernel function is 1 if they match, 0 if they don’t?). As long as you can provide a kernel function that spits out a covariance between any two $\vec{x_i}, \vec{x_j}$, you’re in business.
Third, this method doesn’t work with repeated $x$ values. If you gave two observations at $x=5$, with different $y$, the math would break since our kernel function would say they are perfectly correlated. Implicitly, we’re assuming that $y$ is exactly observed at each $x$. But that’s easy to fix by allowing for noise in the observation of $y$, which just becomes an additional $\sigma_e$ added to the diagonal of $\Sigma_{dd}$.
Fourth, what’s up with this whole “Process” thing? That deserves its own section.
A “Process”?
What makes this a Process is that it’s defined over infinite $x$s. We only ever have to deal with a finite number of points–our data points plus the points we want to predict–and that’s what makes it tractable, but the methodology is capable of producing values at any $x$. Rigorously defining this random distribution of infinitely-dimensional vectors (or “lines”), requires defining a “Statistical Process”.
If you consider this methodology before you’ve seen any data, it defines a prior over types of lines, with the kernel function determining the wiggliness of the probable lines. The output of your GP will be a compromise of the data itself with the wiggliness the prior expects. Here are some plots of lines sampled from the data-less prior for different denominators (or “bandwidths”) in our kernel function:
View code
n_sim = 5
x_sim = np.arange(0, 10, 0.1)
for bandwidth in (0.5, 1, 2, 4):
sigma_sim = kernel_square_exponential(x_sim, x_sim, bandwidth=bandwidth)
y_sim = np.random.multivariate_normal(np.zeros_like(x_sim), sigma_sim, n_sim)
plt.plot(x_sim, y_sim.T)
plt.title(f"{bandwidth=}")
plt.show()
This inspires a totally different way to talk about the GP, which is the formal way they are usually taught. Suppose we write write our infinitely many pairs of $y$, $x$ as $y = f(x)$. The values of $x$ are fixed but $y$ is random, which means $f$ itself is random! This random function is our “Process”. $f$ is a magical random function that is defined purely by the property that it makes the above methodology correct: for any finite set of $x$, $f(x)$ will have a multivariate normal distribution $N(\mu, \Sigma(x))$. I want to emphasize that I’m certain this definition of a “Process” did not emerge from first principles, but instead was developed by people (Andrey Kolmogorov and Norbert Wiener?) who wanted to try the intuitive thing above, and reverse engineered the definition that would make it valid. (Which honestly is probably how all definitions are developed.)
You never have to actually care about $f$ itself, since we only ever deal with it on a finite set of observed and predicted points, but it makes this well-defined. (If you want $y$ to have noise, just say $y = f(x) + \epsilon$, which turns into our friend $\sigma_\epsilon$ on the diagonal of $\Sigma$.)
Extensions of our simple GP¶
The Gaussian Process above isn’t ready for the real world. There are a number of extensions that have been built to make this workable. Those include:
- Estimating a kernel function. Instead of choosing a fixed kernel function, we usually want to allow for a parametrized family of kernel functions (allowing for e.g. different distance decays), and letting the model learn the optimal value. In this world, it helps to consider a two-step kernel function: first calculate a distance $d(x_i, x_j)$, and then use a parametrized decay $k(d; \theta)$, where we learn the $\theta$.
- Reducing computational complexity. Our GP algorithm required using every data point to make a new prediction, which obviously grows unwieldy.
Sparse GPalgorithms keep track of a smaller number of “inducing points”, and make new predictions by interpolating between them (among other tricks).