30
Jan
2020
/
Sayak Paul, Deep Learning Associate at PyImageSearch

Bayesian Hyperparameter Optimization - A Primer

Weights & Biases recently launched Sweeps, a sophisticated way to optimize your model hyperparameters. And since then the team has been getting a lot of questions about bayesian hyperparameter search – Is it faster than random search? When should I use it? How does it work exactly?

In this post, I’ll try to answer some of your most pressing questions about Bayesian hyperparameter search. I’ll also show you how to run a Bayesian search in 3 steps using Weights & Biases.

You can find the accompanying code here, and see a comparison of bayesian, grid and random sweeps here.


Hyperparameters in a machine learning model are the knobs used to optimize the performance of your model - e.g learning rate in neural networks, depth in random forests. It’s tricky to find the right hyperparameter combinations for a machine learning model, given a specific task. What’s even more concerning is machine learning models are very sensitive to their hyperparameter configurations - performance of a machine learning model with a certain hyperparameter configuration may not be similar when the hyperparameter configuration is changed.

To address these problems, we resort to hyperparameter tuning and the general process of it looks like so:

Although there are many niche techniques which help us in effectively tuning the hyperparameters, following are the two most predominant ones:

In grid search, we first start by defining a grid containing the list of hyperparameters along with lists of acceptable values you would want the search process to try. Following is a sample hyperparameter grid for a Logistic Regression model:

In this case, the search process would end up training a total of 2x3x3x4 = 72 different Logistic Regression models.  

In random search, other than defining a grid of hyperparameter values, we specify a distribution from which the acceptable values for the specified hyperparameters could be sampled. However, the main difference here from grid search is instead of trying all the possible combinations of hyperparameters each of combination of hyperparameters is sampled randomly and the hyperparameter values come from the distribution that we specify in the beginning of the random search process.

In the general hyperparameter tuning framework that you just saw, were you able to spot anything that could be improved? Well, the different hyperparameter configurations that we specify are all independent of each other. Could those configurations be made a bit more informed? What if we could guide the hyperparameter search process by using the past results (in this case results from running different hyperparameter configurations)? Would we able to make the tuning process better? Let’s find it out!

Bayesian hyperparameter tuning: Nuts & bolts

When it comes to using Bayesian principles in hyperparameter tuning the following steps are generally followed:

To understand this in a bit more details, let’s introduce the Bayes rule which is 250+ years old. It helps us to predict Y given X, what is the probability of John playing soccer tomorrow provided it rains today, for example.  So, the Y here is the probability of John playing soccer tomorrow and X denotes it rained today. In probabilistic literature, you can denote it like so -

It read as - what is the probability of Y given X. Now, this has got an RHS as well:


where,

So, if we try to apply the same principle to hyperparameter tuning, following would be the equation for it -

So, now each of the above terms would be:

Now, the different hyperparameter configurations are navigated through, they go on to take advantage from the previous ones and eventually help the given machine learning model train with better hyperparameter combination with each passing run. A question that might still be bugging you is - what hyperparameter space should we explore next in order to improve the most? Let’s find that out.

The idea of informed search in Bayesian hyperparameter tuning

Take a look at the following figure which shows how training accuracy of a certain neural network varies with respect to the number of epochs -


Upon taking a look the figure, we can clearly see that we are to get better results for more number of epochs. So, the point now is how can we leverage this knowledge in the hyperparameter tuning process?

Bayesian hyperparameter tuning allows us to do so by building a probabilistic model for the objective function we are trying to minimize/maximize in order to train our machine learning model. Examples of such objective functions are not scary - accuracy, root mean squared error and so on. This probabilistic model is referred to a surrogate model for the objective function in this case. It is represented as P(metric | hyperparameter combination) or more generically P(y|x).

Experiments show that this surrogate model is much easier to optimize than the actual. objective function.  It’s important to keep in mind that the next set of hyperparameters in the hyperparameter tuning process is chosen that perform the best on the surrogate function. That set is then further evaluated using the actual objective function. An important catch here is -


This is exactly why Bayesian hyperparameter tuning is preferable when the hyperparameter tuning task includes a lot of different combinations.

Let’s go back to the way we humans approach things in general - we first build an initial model (also called a prior) of the world we are about to step in and based on our experiences of interactions (evidence) we update that model (the updated model is called posterior). Now, let’s replicate this to tune hyperparameters! We start with an initial estimate of the hyperparameters and we update it gradually based on the past results. Consider the following to be a representation of that initial estimate -

Where the black line is the initial estimate made by the surrogate model and the red line represents the true values of the objective function. As we proceed the surrogate model manages to mimic the true values of the objective function much more densely:


The gray areas represent the uncertainity of the surrogate model which is defined by a standard deviation and a mean.

In order to select the next set of hyperparameters in this process, we need to devise a way that would return a probabilistically score of that set of hyperparameters. The better the score the more likely the set of hyperparameters is to be selected. This is generally done via expected improvement. Other methods include probability of improvement, lower confidence bound and so on.

Expected improvement

It works by introducing a threshold value for the objective function and we are tasked to find a set of hyperparameters that beats that threshold. So, mathematically it would be -

where -

The above equation enforces the following:


The last piece of the puzzle is still remaining to be added however - the surrogate model.

Surrogate model

The problem of constructing a surrogate model is generally modeled as a regression problem where we feed the data as input (with a set of hyperparameters) and it returns an approximation of the objective function parameterized by a mean and a standard deviation. The common choices for surrogate models are:

Let’s talk about the first one briefly.

The Gaussian Process works by constructing a joint probability over the input features and the true values of the objective function. In that way, with enough iterations it becomes able to capture an effective estimate of the objective function. To know more about the process, you are encouraged to check the “Surrogate Function” section of this article.

If all of the above seemed like a bit heavy for you then just stick to the idea that the main objective of Bayesian reasoning is to become “less wrong” with more data. Let’s now see Bayesian hyperparameter tuning in action.

Pitfalls of Bayesian hyperparameter tuning

Even though Bayesian hyperparameter tuning makes the most sense compared to the other approaches of hyperparameter tuning it has got some down sides:

Bayesian hyperparameter tuning: In action

I have made the code snippets shown in this section available as a Colab notebook here (no setup is required to run it).  

Before diving into the code that deals with Bayesian hyperparameter tuning, let’s put together the components we would need before that.

We will be using the Keras library for our experiments (more specifically tf.keras with a TensorFlow 2.0 environment). We will be using the FashionMNIST dataset and a shallow convolutional neural network as our machine learning model. Our humble model is defined using -

model = Sequential([
       Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
       MaxPooling2D((2,2)),
       Conv2D(64, (3, 3), activation='relu'),
       MaxPooling2D((2,2)),
       Conv2D(64, (3, 3), activation='relu'),
       GlobalAveragePooling2D(),
       Dense(config.layers, activation=tf.nn.relu),
       Dense(10, activation='softmax')
   ])


The images of the dataset come in 28x28 pixels. But we need to reshape it to 28x28x1 to be able to make it work with the Conv2D layer of Keras. Proceeding further, here is how our hyperparameter search configuration looks like -

sweep_config = {
       'method': method,
       'metric': {
         'name': 'accuracy',
         'goal': 'maximize'  
       },
       'parameters': {
           'layers': {
               'values': [32, 64, 96, 128, 256]
           },
           'batch_size': {
               'values': [32, 64, 96, 128]
           },
           'epochs': {
               'values': [5, 10, 15]
           }
       }
   }

In our case we will be running the same experiments but with different hyperparameter tuning methods - grid, random, and Bayesian and we are specifying that via method (three options available to us - grid, random, and bayes). metric is our objective function and we would like to maximize it. What follows that is the grid of hyperparameters we are interested in tuning and they are -

Weights and Biases (W&B) allows us to efficient tune hyperparameters via Sweeps. Running a hyperparameter sweep is extremely simple and you can see that in action in the article I just linked. Once we have prepared the dataset, defined the model and configured the hyperparameter search configuration we can proceed to running the sweep which is just a matter of a few keystrokes -

sweep_id = wandb.sweep(sweep_config, project='project-name')
wandb.agent(sweep_id, function=train)

And the train function looks like so that actually trains our model with the provided set of hyperparameters -

def train():
   # Prepare data tuples
   (X_train, y_train) = train_images, train_labels
   (X_test, y_test) = test_images, test_labels
   
   # Default values for hyper-parameters we're going to sweep over
   configs = {
       'layers': 128,
       'batch_size': 64,
       'epochs': 5,
       'method': METHOD
   }
   
   # Initilize a new wandb run
   wandb.init(project='hyperparameter-sweeps-comparison', config=configs)
   
   # Config is a variable that holds and saves hyperparameters and inputs
   config = wandb.config

   # Define the model
   model = Sequential([
       Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
       MaxPooling2D((2,2)),
       Conv2D(64, (3, 3), activation='relu'),
       MaxPooling2D((2,2)),
       Conv2D(64, (3, 3), activation='relu'),
       GlobalAveragePooling2D(),
       Dense(config.layers, activation=tf.nn.relu),
       Dense(10, activation='softmax')
   ])
   
   # Compile the model
   model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
   
   # Train the model
   model.fit(X_train, y_train,
            epochs=config.epochs,
            batch_size=config.batch_size,
            validation_data=(X_test, y_test),
            callbacks=[WandbCallback(data_type="image",
               validation_data=(X_test, y_test), labels=labels)])

With each run the hyperparameter combination is updated and is made available via the config argument of wandb. Diving deep into the details of hyperparameter sweeps iusing W&B is out of the scope of this article but if you are interested, I recommend the following articles -

As mentioned earlier, we will be running the sweeps using three different methods - grid search, random search and Bayesian search. Let’s do a battle between the three!

The beauty of using W&B is that it creates separate Sweeps pages each time you kickstart a new sweep. The runs under the grid search experiment are available here to visualize. The major plot to notice there is there following -


Before I proceed any further let me show you plots I got from random search and Bayesian search -



From the above three plots, the most important thing to notice is the number of different runs method took to reach to the maximum val_accuracy. A different combination of hyperparameters is referred to as a different run here. Clearly enough, the Bayesian search takes the least amount of runs to get to our desired result.

An important note here is for both random search and Bayesian search, you need to manually terminate the search process once you get to the desired result. You can easily do that by going to your respective Sweeps page while that is running and navigate to Sweep Controls -

Upon clicking that you will get a page like so and from there you can have complete control over your sweeps -


What is more convincing is with W&B, you get a nice overview of Sweeps and it looks like so -

Now on the respective workspace, you will have a collective overview of the first ten runs that came out as results of running the different methods -



What is even more convincing for going with Bayesian search is if we group all the important metrics like loss, val_loss, accuracy and val_accuracy we would get and it clearly tells us about the supremacy of the Bayesian search process -



Bonus: Create sweeps at a blazing speed

On yours Sweeps Overview page (a sample), you would find a button Create sweep and when clicked it shows something like the following -

It lets you take an existing W&B Sweeps project with some runs and generate a new sweep based on your configuration variables quickly.

Now what I would do is -

And voila! Your sweep should be up and running -


So, how Bayesian are you?

That’s it for this article. If the topic of hyperparameter tuning interests you then you should check out James Bergstra’s works -

I hope you got a good introduction to the Bayesian search process for tuning hyperparameters for Machine Learning models. I cannot wait to see how beneficial Bayesian search would be for your projects.

Newsletter

Enter your email to get updates about new features and blog posts.

Weights & Biases

We're building lightweight, flexible experiment tracking tools for deep learning. Add a couple of lines to your python script, and we'll keep track of your hyperparameters and output metrics, making it easy to compare runs and see the whole history of your progress. Think of us like GitHub for deep learning.

Partner Program

We are building our library of deep learning articles, and we're delighted to feature the work of community members. Contact Lavanya to learn about opportunities to share your research and insights.

Try our free tools for experiment tracking →