One of the exciting things about running Weights and Biases is that we can research how models are actually using their computational resources in real world scenarios. Since we are simplifying the tasks of monitoring GPU and memory, we wanted to help our users by taking a look at how well folks utilize their computational resources.
For users training on GPUs, I looked at their average utilization across all runs. Nearly a third of our users are averaging less than 15% utilization. Average GPU memory usage is quite similar. Our users tend to be experienced deep learning practitioners and GPUs are an expensive resource so I was surprised to see such low average usage.
Here’s a few easy, concrete suggestions for improving GPU usage that apply to almost everyone:
1. Measure your GPU usage consistently over your entire training runs
You can’t improve GPU usage without measuring it. It’s not hard to take a snapshot of your usage with useful tools like nvidia-smi, but a simple way to find issues is to track usage over time. Anyone can turn on system monitoring in the background, which will track GPU, CPU, memory usage etc over time by adding two lines to their code:
The wandb.init() function will create a lightweight child process that will collect system metrics and send them to a wandb server where you can look at them and compare across runs with graphs like these:
The danger of taking a single measurement is that GPU usage can change over time. This is a common pattern we see where our user Boris is training an RNN; mid-training, his usage plummets from 80 percent to around 25 percent.
You can see his complete set of stats and training log at https://app.wandb.ai/borisd13/char-RNN/runs/cw9gnx9z/system.
A related case we commonly see with multiple GPUs is that mid-training, some of the GPUs stop handling any load. In this example both GPUs started off doing computations, but a few minutes in, all the load is sent to a single GPU. This could be intentional but this is often the sign of a hard to catch bug in the code.
Another common issue we see is that there are long periods of not using the GPUs - often corresponding with a testing or validation phases in training or bottlenecked on some data preprocessing. Here is a typical graph, training on 8 GPUs where all of them turn off and wait for some time at a regular interval.
2) Make sure your GPU is the bottleneck
This is a common situation we see - here the system memory is significantly used and the memory usage seems to be gradually increasing. As the memory usage goes up the GPU usage goes down. We also often see network being the bottleneck when people try to train on datasets that aren’t available locally.
3) Try increasing your batch size
It doesn’t work in every case, but one simple way to possibly increase GPU utilization is to increase batch size. Gradients for a batch are generally calculated in parallel on a GPU, so as long as there is enough memory to fit the full batch and multiple copies of the neural network into GPU memory, increasing the batch size should increase the speed of calculation.
If I increase the batch size and change nothing else, I might conclude that increasing the batch size speeds up computation but reduces model performance. Here are my results training CIFAR with batch sizes 32, 64 and 128.
Indeed, there are many papers and a top post on StackOverflow warning about large batch sizes. There is a simple way to make larger batch sizes work reasonably well. Increase the learning rate along with batch size. Intuitively, this makes sense, batch sizes are how many examples a training algorithm looks at before making a step and learning rate is roughly the size of the step. So if the model looks at more examples it should probably be comfortable taking a larger step. This is recommended in the paper One weird trick for parallelizing convolutional neural networks and later in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour where the authors managed to increase the batch size to 8000 before they saw a loss in performance.
I tried increasing the learning rate with the batch size on my own model and reached the same conclusion. When I multiply the learning rate by 4 and increase the batch size by 4, my model trains faster and learns faster at each step.
The Facebook paper does some fancy things to make the model work well at very large batch sizes and they are able to get the same performance but at much higher speeds up to 8000 samples per batch.
These huge batch sizes make sense for distributed training, and the paper’s scheme of starting with a lower learning rate and then ramping it up looks very valuable in that context. If you’re training on one GPU and not maxing out your utilization, I have a quick recommendation: double your batch size and double your learning rate.
You can dive into more data from my runs in the Batch Size Report.
GPUs are getting faster and faster but it doesn’t matter if the training code doesn’t completely use them. The good news is that for most people training machine learning models there is still a lot of simple things to do that will significantly improve efficiency.
There’s another, probably larger, waste of resources: GPUs that sit unused. We don’t measure this, but I’ve heard it anecdotally from many of the companies we work with. It’s hard to queue up work efficiently for GPUs, in a typical workflow a researcher will set up a large number of experiments, wait for them to finish and then spend quite a lot of time digesting the results while the GPUs sit idle. This is outside the scope of wandb, but tools like Paperspace and Amazon’s Sagemaker make it easy to spin up and down resources as needed.
Thanks Sam Pottinger, Carey Phelps, James Cham, Yanda Erlich, Stephanie Sher for edits and feedback.