TL;DR: To understand the significance of hyperparameter tuning outcomes, quantify the random variance of your experiments
Hyperparameter tuning is a game of accumulating tiny victories. When trying lots of different values in many combinations, it can be hard to tell the difference between background noise and real improvement in target metrics. Is a 0.6% uptick in validation accuracy when I increase the learning rate meaningful? What if I’m also changing the batch size? How many values and combinations should I keep trying to make sure? Maybe my model’s not that sensitive to these, and I should focus on more interesting variables.
One approach is to compare my observations across two conditions: how much do the results change in my active experiments versus in a random background noise condition (e.g. when the random seed is not set)? The magnitude of the variance in the signal condition (changing hyperparameters, fixed random seed) versus the noise condition (changing random seed, fixed hyperparameters) can quantify the observed improvement relative to a baseline (a null hypthosesis, if you will). If the variance in validation accuracy over many trials from noise alone is 0.5%, then my 0.6% change isn’t very interesting. However, if the variance from noise alone is 0.06%, then my learning rate tuning improved the validation accuracy tenfold, which is much more promising.
In this report I use W&B Sweeps to explore this approach on a simple example (a bidirectional RNN trained on MNIST) and visualize the difference between noise (left image below) and signal (right image). Click here for the full report