Bayesian Optimization

Imagine you have a function that you want to optimize, but it costs a lot to actually run this function.

This could be for a variety of reasons. Maybe the function is compute intensive, maybe it implies lots of setup to run, or maybe it disturbs the process of what you’re trying to work around.

Lots of functions are fairly cheap to compute on modern hardware. (Especially with GPUs and other process specific pieces of silicon.) A case in which we might actually want to consider using this more advanced technique could be for something like a large financial model that takes hours to run. Perhaps we want to take a whole host of variables: temperature, humidity, pressure, wind speed and then put them into some sort of three dimensional space. Each factor will interact with each other. There are millions of grid points of different features and as a result a high combinatorial space.

Let’s imagine that we want to fine tune the parameters of this climate model such that they match the real world conditions for cloud formation for example. There are so many parameters to evaluate, especially given that there is maybe no “known” or “best guess” even of what value they should be. It’s also very expensive to test even with a few parameters.

Luckily, we can exploit a method called Bayesian optimization, which also is useful if there is no gradient information to work around. Since this method relies on more generalized information (unspecific to the exact problem it’s often used in) we call this a heuristic method. In the case of a climate model, this might be defining how well the given simulated output variables match some observed historical climate variable. We thus want to maximize the additive inverse of the root mean square error between the observed and simulated output variables.

To put the whole process rather simply, we limit ourselves to assuming that we can query some function $g$ at any point $x$ to get $g(x)$.

Note that the dimensionality of $x$ is not limited to just being some real number, but is a stand in for more complex kinds of inputs.

We solve this problem:

$$ \max_{x \in A} f(x) $$

Before we get to specifying the algorithm to solve this problem, let’s start with some preliminary knowledge.

Recall Bayes’ theorem for calculating conditional probabilities:

$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} = \frac{P(B|A) \cdot P(A)}{\int P(B|A) P(A) dA} $$

If we assign $A$ to some hypothesis and $B$ to some evidence, then we can view Bayes’ theorem as updating our understanding of the probability of the evidence being related to the result in terms of some previous estimation in form of the likelihood

Imagine two cases of $P(B|A)$, one in which $P(B|A)$ is high and the other where it is low. If $P(B|A)$ is high, then $P(A|B)$ should be higher than $P(A)$, which means that based on the evidence $B$, our belief in $A$ is stronger.

In the second case, where we find a low likelihood ($P(B|A)$), $P(A|B)$ should generally be lower than $P(A)$, which means that based on any evidence $B$, our belief in $A$ is weaker.

Based on this (and some continual sampling), we can iteratively update our estimate through Bayesian inference.

| Event/Hypothesis (H)      | Prior P(H) | Likelihood P(E|H) | Prior x Likelihood P(E and H)  | Posterior P(H|E) |
|---------------------------|------------|-------------------|--------------------------------|------------------|
| Has Disease               | 0.01       | 0.95              | 0.0095                         | 0.1610 (16.1%)   |
| Doesn't Have Disease      | 0.99       | 0.05              | 0.0495                         | 0.8390 (83.9%)   |
| Sum (P(E))                |            |                   | 0.059                          | 1.00             |

You may recall doing this on an exam. If you do, you might be entitled to compensation for your experiences.

Let’s also recall what covariance is:

(If you learn visually, I might consider looking at the graphic on the wikipedia article for covariance.)

Ah, shoot, first you need to know what a random variable is.

Notably, a random variable is like a guinea pig, it is neither random nor a variable. (Just as the guinea pig is neither a pig nor from Guinea.) A random variable (I’m quoting this from wikipedia) is a function that goes from the possible outcomes to some set of values that we can use to distinguish outcomes.

The Wikipedian example is the act of flipping a coin. Here the domain (possible outcomes) correspond to the range (${1, -1}$) for heads and tails respectively. Note that the random variable itself does not define their probability.

The utility of a random variable is, thus, to quantify the outcomes of random experiments.

Now, the covariance of two jointly distributed real-valued random variables $X$ and $Y$ is defined as:

$$ cov(X,Y) = E[(X-E[X])(Y-E[Y])] $$

Imagine this value as how we expect $X$ and $Y$ to move together. For a positive covariance, when $X$ is above its mean, $Y$ also tends to be above its mean. For a negative covariance, they tend to move in opposite directions. At a zero covariance, we say they have no relationship. (Bummer for $X$ and $Y$, but it’s probably for the best. They never were a good couple after all.)

Understanding covariance is crucial for Bayesian optimization because we need to model how similar function values are at nearby points - this similarity relationship is what allows us to make informed predictions about unexplored regions.

Okay, let’s get back to defining Bayesian optimization.

We define a model called a Gaussian process that effectively models the expensive function $f(x)$. It does stipulate that for some $x_1, x_2, \cdots, x_t \in \mathbb{R}^n$, the vector comprised of $f(x_1), f(x_2), \cdots, f(x_t)$ is distributed as a multivariate Gaussian variable.

As a result, we get to model this vector as the mean of each $x$ and the covariance between each pair of points.

This covariance is called the kernel of the Gaussian process. It’s just a function! (woooh! no new complicated math!!)

$$ K(x,y) := k \cdot \exp \left( -\frac{||x-y||_2^2}{2\sigma^2} \right) $$

where $k > 0$ and $\sigma > 0$ are defined as parameters of the kernel.

The $k$ parameter effects the amplitude of the function, whereas $\sigma$ controls how quickly the correlation between two points end up decaying. These parameters can be learned to fit the observed data so far.

The process of fitting to the data is referred to as “conditioning” the Gaussian process.

After we condition the Gaussian process on some existing sampled points, we get to predict at test points based on this condition Gaussian process. We predict for some $x^*$ a mean and variance for this point. This gives us $\mu(x^*)$ and $\sigma^2(x^*)$.

The utility of getting a mean and variance is that we have some sense of how “uncertain” the values around some prediction are.

I will avoid mentioning the exact equations by which we get to calculate the mean and variance of a new estimated point, we evaluate the kernel at each sampled point as well as the new value of the estimated point. As we add more sampled points, we will thus get a lower range of uncertainty around any given estimated point.

It might start to clear up now that we’re effectively building a function that models an expensive function in the form of a multivariate Gaussian for the purpose of getting much cheaper evaluation. Compared with a sample hungry technique like deep learning, this is a dramatic improvement in sampling cost.

References: