Boris Dayma, ML Engineer

Iteratively Fine-Tuning Neural Networks with Weights & Biases

When I’m trying to solve a problem with machine learning, I always follow three steps:

  1. Inspect the data
  2. Find typical architectures for this type of problem
  3. Train and fine-tune my neural network

In this article I’ll dive into the third step, training and fine-tuning.

Before Weights & Biases

When I first tried to optimize neural networks, I was using my local desktop machine to test different combinations of hyperparameters, and I took a lot of notes.

Does this picture remind you of some of your model fine-tuning?

As I started using multiple remote machines, tracking became more difficult. In some cases, I had local changes that weren’t reflected on remote machines, and it was hard to notice the errors.

Out of desperation (and an inability to read my terrible handwriting), I turned to Excel… but manual spreadsheets still didn’t solve my real problem.

I was using handwritten notes and Excel because there were no tools or frameworks  to automatically track and compare my training runs. TensorBoard only solves one part of the problem, visualizing experiments on graphs.

Unfortunately, I found that TensorBoard made it hard to compare multiple experiments, especially when I was using multiple servers. I was also trying PyTorch and Fast.ai, which made it hard to continue using TensorBoard.

I took the Full Stack Deep Learning class with Josh Tobin from OpenAI, and he guided us through best practices for training models. One of the tools he mentioned was Weights & Biases for experiment tracking, so I picked it up to see if it would help with my organization problem.

The wandb tool helped in a few different ways:

Here’s an example of an experiment where I was doing semantic segmentation.

In the web interface, it was easy to:

How I use Weights & Biases

When I have to optimize a model, I run into one of the two following cases:

When trying to observe the best runs, I first look at the parallel coordinates graph.

If I cannot observe easily which runs are good or bad, I also plot graphs grouped by hyper parameter. In those cases, you need to check that you get better results in average but you may also want to compare the best result of each grouped set (top of the band).

Once I find a parameter value much better (or much worse) than the other ones, I use it to filter all my runs by this value, and look for the next parameter to select.

Using a group of runs to make a conclusion is much more reliable as it decreases the effects of random noise that exist on every experiment.

This process is iterative and as I refine my values, I may decide to run more experiments within my reduced range of hyper-parameters until I am completely satisfied. I also save my reasoning in a report so that I can remember why I selected a particular filter.

I often like to start with shorter runs or reduced input size to try and get a few insights quickly. It is not completely reliable but useful when you have a limited amount of time to solve a problem.

I’ve not ventured into any kind of AutoML techniques yet, starting with hyper-parameter bayesian optimization but I’ll try it next to see how I can integrate it with my current workflow.

Please share your own workflow or any suggestions and comments you may have to make it better!

Weights & Biases

We're building lightweight, flexible experiment tracking tools for deep learning. Add a couple of lines to your python script, and we'll keep track of your hyperparameters and output metrics, making it easy to compare runs and see the whole history of your progress. Think of us like GitHub for deep learning.

Partner Program

We are building our library of deep learning articles, and we're delighted to feature the work of community members. Contact Carey to learn about opportunities to share your research and insights.

Try our free tools for experiment tracking →