Weights & Biases recently launched Sweeps, a sophisticated way to optimize your model hyperparameters. And since then the team has been getting a lot of questions about bayesian hyperparameter search – Is it faster than random search? When should I use it? How does it work exactly?

In this post, I’ll try to answer some of your most pressing questions about Bayesian hyperparameter search. I’ll also show you how to run a Bayesian search in 3 steps using Weights & Biases.

You can find the accompanying code here, and see a comparison of bayesian, grid and random sweeps here.

Hyperparameters in a machine learning model are the knobs used to optimize the performance of your model - e.g *learning rate* in neural networks, *depth* in random forests. It’s tricky to find the right hyperparameter combinations for a machine learning model, given a specific task. What’s even more concerning is machine learning models are very sensitive to their hyperparameter configurations - performance of a machine learning model with a certain hyperparameter configuration may not be similar when the hyperparameter configuration is changed.

To address these problems, we resort to *hyperparameter tuning* and the general process of it looks like so:

- Select the hyperparameters to be tuned (there can be a number of hyperparameters in a machine learning model).
- Specify a grid of acceptable values for the specified hyperparameters or specify distribtutions that would generate the acceptable values.
- Train a number of machine learning models pertaining of each of the different hyperparameter configurations results from the above two steps.
- Select the model that performs the best from the pool of many models.

Although there are many niche techniques which help us in effectively tuning the hyperparameters, following are the two most predominant ones:

- Grid search
- Random search

In grid search, we first start by defining a grid containing the list of hyperparameters along with lists of acceptable values you would want the search process to try. Following is a sample hyperparameter grid for a *Logistic Regression* model:

In this case, the search process would end up training a total of 2x3x3x4 = 72 different Logistic Regression models.

In random search, other than defining a grid of hyperparameter values, we specify a distribution from which the acceptable values for the specified hyperparameters could be sampled. However, the main difference here from grid search is instead of trying all the possible combinations of hyperparameters each of combination of hyperparameters is sampled randomly and the hyperparameter values come from the distribution that we specify in the beginning of the random search process.

In the general hyperparameter tuning framework that you just saw, were you able to spot anything that could be improved? *Well, the different hyperparameter configurations that we specify are all independent of each other*. Could those configurations be made a bit more *informed*? What if we could guide the hyperparameter search process by using the past results (in this case results from running different hyperparameter configurations)? Would we able to make the tuning process better? Let’s find it out!

When it comes to using Bayesian principles in hyperparameter tuning the following steps are generally followed:

- Pick a combination of hyperparameter values (our belief) and train the machine learning model with it.
- Get the evidence (i.e. score of the model).
- Update our belief that can lead to model improvement.
- Terminate when a stopping criteria is met (think when a loss a quantity is minimized or classification accuracy is maximized).

To understand this in a bit more details, let’s introduce the Bayes rule which is *250+ years old*. It helps us to predict Y given X, what is the probability of John playing soccer tomorrow provided it rains today, for example. So, the Y here is the probability of John playing soccer tomorrow and X denotes it rained today. In probabilistic literature, you can denote it like so -

It read as - *what is the probability of Y given X*. Now, this has got an RHS as well:

where,

- P(X) is the probability of observing this new evidence.
- P(X|Y) is the probability of observing the new evidence X given the event Y that we care about.
- P(Y) is the initial hypothesis about the event Y that we care about (treat it like an initial belief about the event Y).

So, if we try to apply the same principle to hyperparameter tuning, following would be the equation for it -

So, now each of the above terms would be:

- P(metric | hyperparameter combination) gives the probability of the given metric to be minimized/maximized given the combination of hyperparameter values.
- P(hyperparameter combination | metric) is the probability of a certain hyperparameter combination if the given metric is minimized/maximized.
- P(metric) is the initial metric quantity in scalar.
- P(hyperparameter combination) is the probability of getting that particular hyperparameter combination.

Now, the different hyperparameter configurations are navigated through, they go on to take advantage from the previous ones and eventually help the given machine learning model train with better hyperparameter combination with each passing run. A question that might still be bugging you is - what hyperparameter space should we explore next in order to improve the most? Let’s find that out.

Take a look at the following figure which shows how training accuracy of a certain neural network varies with respect to the number of epochs -

Upon taking a look the figure, we can clearly see that we are to get better results for more number of epochs. So, the point now is *how can we leverage this knowledge in the hyperparameter tuning process? *

Bayesian hyperparameter tuning allows us to do so by *building a probabilistic model for the objective function* we are trying to minimize/maximize in order to train our machine learning model. Examples of such objective functions are not scary - *accuracy*, *root mean squared error *and so on. This probabilistic model is referred to a surrogate model for the objective function in this case. It is represented as P(metric | hyperparameter combination) or more generically P(y|x).

Experiments show that this surrogate model is much easier to optimize than the actual. objective function. It’s important to keep in mind that the next set of hyperparameters in the hyperparameter tuning process is chosen that perform the best on the surrogate function. That set is then further evaluated using the actual objective function. An important catch here is -

- The lesser the number of calls to the objective function the more efficient the hyperparameter search would be. This is because as the dimensionality of the data increases it becomes more computationally costly to evaluate that using the objective function. Having the surrogate model helps us in reducing the number of calls to the actual objective function by choosing a promising set of hyperparameters.

This is exactly why *Bayesian hyperparameter tuning is preferable when the hyperparameter tuning task includes a lot of different combinations*.

Let’s go back to the way we humans approach things in general - we first build an initial model (also called a prior) of the world we are about to step in and based on our experiences of interactions (evidence) we update that model (the updated model is called posterior). Now, let’s replicate this to tune hyperparameters! We start with an initial estimate of the hyperparameters and we update it gradually based on the past results. Consider the following to be a representation of that initial estimate -

Where the black line is the initial estimate made by the surrogate model and the red line represents the true values of the objective function. As we proceed the surrogate model manages to mimic the true values of the objective function much more densely:

The gray areas represent the uncertainity of the surrogate model which is defined by a standard deviation and a mean.

In order to select the next set of hyperparameters in this process, we need to devise a way that would return a probabilistically score of that set of hyperparameters. The better the score the more likely the set of hyperparameters is to be selected. This is generally done via *expected improvement*. Other methods include probability of improvement, lower confidence bound and so on.

It works by introducing a threshold value for the objective function and we are tasked to find a set of hyperparameters that beats that threshold. So, mathematically it would be -

where -

- y^* is a threshold value for the objective function
- y is the actual value of the objective function
- p(y|x) is the surrogate model

The above equation enforces the following:

- If for a certain x (a combination of hyperparameters) p(y|x) becomes such that y^* > y then it means that hyperparameter combination will not yield better score than the threshold but if it’s the opposite then it’s worth pursuing that combination of hyperparameters.

The last piece of the puzzle is still remaining to be added however - the *surrogate model*.

The problem of constructing a surrogate model is generally modeled as a regression problem where we feed the data as input (with a set of hyperparameters) and it returns an approximation of the objective function parameterized by a mean and a standard deviation. The common choices for surrogate models are:

- Gaussian Process Regression
- Random Forest Regression
- Tree-structured Parzen Estimator

Let’s talk about the first one briefly.

The Gaussian Process works by constructing a joint probability over the input features and the true values of the objective function. In that way, with enough iterations it becomes able to capture an effective estimate of the objective function. To know more about the process, you are encouraged to check the “Surrogate Function” section of this article.

If all of the above seemed like a bit heavy for you then just stick to the idea that *the main objective of Bayesian reasoning is to become* *“less wrong” with more data*. Let’s now see Bayesian hyperparameter tuning in action.

Even though Bayesian hyperparameter tuning makes the most sense compared to the other approaches of hyperparameter tuning it has got some down sides:

- Bayesian search process in sequential in nature so it’s extremely hard to parallelize it which might be necessary in order to scale.
- Defining a well-suited surrogate model can be challenging and it has got its own hyperparameters.

I have made the code snippets shown in this section available as a Colab notebook here (no setup is required to run it).

Before diving into the code that deals with Bayesian hyperparameter tuning, let’s put together the components we would need before that.

We will be using the Keras library for our experiments (more specifically tf.keras with a TensorFlow 2.0 environment). We will be using the FashionMNIST dataset and a shallow convolutional neural network as our machine learning model. Our humble model is defined using -

model = Sequential([

Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),

MaxPooling2D((2,2)),

Conv2D(64, (3, 3), activation='relu'),

MaxPooling2D((2,2)),

Conv2D(64, (3, 3), activation='relu'),

GlobalAveragePooling2D(),

Dense(config.layers, activation=tf.nn.relu),

Dense(10, activation='softmax')

])

The images of the dataset come in 28x28 pixels. But we need to reshape it to 28x28x1 to be able to make it work with the Conv2D layer of Keras. Proceeding further, here is how our hyperparameter search configuration looks like -

sweep_config = {

'method': method,

'metric': {

'name': 'accuracy',

'goal': 'maximize'

},

'parameters': {

'layers': {

'values': [32, 64, 96, 128, 256]

},

'batch_size': {

'values': [32, 64, 96, 128]

},

'epochs': {

'values': [5, 10, 15]

}

}

}

In our case we will be running the same experiments but with different hyperparameter tuning methods - grid, random, and Bayesian and we are specifying that via method (three options available to us - grid, random, and bayes). metric is our objective function and we would like to *maximize* it. What follows that is the grid of hyperparameters we are interested in tuning and they are -

- number of
*dense*layers that would go in our model - batch size
- number of epochs

Weights and Biases (W&B) allows us to efficient tune hyperparameters via Sweeps. Running a hyperparameter sweep is extremely simple and you can see that in action in the article I just linked. Once we have prepared the dataset, defined the model and configured the hyperparameter search configuration we can proceed to running the sweep which is just a matter of a few keystrokes -

sweep_id = wandb.sweep(sweep_config, project='project-name')

wandb.agent(sweep_id, function=train)

And the train function looks like so that actually trains our model with the provided set of hyperparameters -

def train():

# Prepare data tuples

(X_train, y_train) = train_images, train_labels

(X_test, y_test) = test_images, test_labels

# Default values for hyper-parameters we're going to sweep over

configs = {

'layers': 128,

'batch_size': 64,

'epochs': 5,

'method': METHOD

}

# Initilize a new wandb run

wandb.init(project='hyperparameter-sweeps-comparison', config=configs)

# Config is a variable that holds and saves hyperparameters and inputs

config = wandb.config

# Define the model

model = Sequential([

Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),

MaxPooling2D((2,2)),

Conv2D(64, (3, 3), activation='relu'),

MaxPooling2D((2,2)),

Conv2D(64, (3, 3), activation='relu'),

GlobalAveragePooling2D(),

Dense(config.layers, activation=tf.nn.relu),

Dense(10, activation='softmax')

])

# Compile the model

model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy',

metrics=['accuracy'])

# Train the model

model.fit(X_train, y_train,

epochs=config.epochs,

batch_size=config.batch_size,

validation_data=(X_test, y_test),

callbacks=[WandbCallback(data_type="image",

validation_data=(X_test, y_test), labels=labels)])

With each run the hyperparameter combination is updated and is made available via the config argument of wandb. Diving deep into the details of hyperparameter sweeps iusing W&B is out of the scope of this article but if you are interested, I recommend the following articles -

- Sweeps Overview
- Introduction to Hyperparameter Sweeps – A Model Battle Royale To Find The Best Model In 3 Steps
- Running Hyperparameter Sweeps to Pick the Best Model

As mentioned earlier, we will be running the sweeps using three different methods - grid search, random search and Bayesian search. Let’s do a battle between the three!

The beauty of using W&B is that it creates separate Sweeps pages each time you kickstart a new sweep. The runs under the grid search experiment are available here to visualize. The major plot to notice there is there following -

Before I proceed any further let me show you plots I got from random search and Bayesian search -

From the above three plots, the most important thing to notice is the number of different runs method took to reach to the maximum val_accuracy. A different combination of hyperparameters is referred to as a *different run* here. Clearly enough, the Bayesian search takes the least amount of runs to get to our desired result.

An important note here is for both random search and Bayesian search, you need to manually terminate the search process once you get to the desired result. You can easily do that by going to your respective Sweeps page while that is running and navigate to *Sweep Controls - *

Upon clicking that you will get a page like so and from there you can have complete control over your sweeps -

What is more convincing is with W&B, you get a nice overview of Sweeps and it looks like so -

Now on the respective workspace, you will have a collective overview of the *first ten* runs that came out as results of running the different methods -

What is even more convincing for going with Bayesian search is if we group all the important metrics like loss, val_loss, accuracy and val_accuracy we would get and it clearly tells us about the supremacy of the Bayesian search process -

On yours Sweeps Overview page (a sample), you would find a button Create sweep and when clicked it shows something like the following -

It lets you take an existing W&B Sweeps project with some runs and generate a new sweep based on your configuration variables quickly.

Now what I would do is -

- Download the default configuration file (sweep.yaml) (you can, of course, configure your own).
- Package the train function in a Python script and name it as train.py. You can check out the script here.
- Follow the instructions from the above snapshot.

And voila! Your sweep should be up and running -

That’s it for this article. If the topic of hyperparameter tuning interests you then you should check out James Bergstra’s works -

- Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures
- Algorithms for Hyper-Parameter Optimization

I hope you got a good introduction to the Bayesian search process for tuning hyperparameters for Machine Learning models. I cannot wait to see how beneficial Bayesian search would be for your projects.