Carey Phelps, Product Lead

ML Best Practices: Test Driven Development at Latent Space

I sat down with the Latent Space team to talk about best practices around collaboration and managing model iteration. In machine learning, bugs may affect the distribution of possible models more than any particular instance, making traditional deterministic tests misleading. Because of this, a test-driven development framework for large ML models must account for the statistical nature of training. This is especially crucial when multiple researchers and engineers are contributing to the same model, as it’s easy to silently introduce regressions into a codebase. Here, the team shares some insights about how this new form of test-driven development has been the key to moving quickly on a large-scale collaborative project.

Latent Space is democratizing creativity through AI. They are making the first fully AI-rendered 3D engine, where users create the world itself and objects in it. Creation is as easy as uploading a picture to Instagram. But when you upload in Latent Space, you see it come to life in 3D. Bring your family car or the Millennium Falcon into a shared world that could look like anything from the Taj Mahal to Winterfell. This new level of self expression enables greater diversity in 3D content, tackling one of the largest problems in the games industry.

Process in a nutshell

To iterate quickly and with confidence on cutting-edge generative models with a team, Latent Space has found two crucial strategies:

  1. Fully modular models.
  2. Testing with rigor.

Fully Modular Models

Latent Space built what they call the “block system” combined with the gin configuration framework to allow them to create new generators, discriminators, encoders, arbitrary skip connections, or tuning parameters in a high-level configuration file. The name was inspired by MILA’s old Theano “Blocks” system. Darryl Barnhart, architect of the block system, describes its benefits:

  1. Clarity of abstraction. By separating hyperparameters from the business logic of forward passes and model construction, configs become a higher level way to design a model.
  2. Ease of ablation. Comparing two variants of a model is as easy as changing a few parameters in a config, lowering the barrier to experimentation significantly.
  3. Ease of hyperparameter search. They use W&B’s sweep feature to run thousands of variations of a model to find the best setup. (An even more extreme extension of this is Neural Architecture Search where an ML model drives architecture selection with even finer granularity.)

A W&B “parallel coordinates” visualization of sweep results over learning rate, batch size and optimizer type, with the best performing runs interactively highlighted

Their config structure is hierarchical. They have one HEAD model that represents the latest and most effective changes. Everything else is a ‘child’ configuration. This makes ablations (A/B tests of changes) clear: inherit from the HEAD model, change the settings that are different/new and run that new config. Each config is essentially a model definition, and in W&B each config gets its own tag.

Here’s an example gin config:

Chase Basich on the config system:

One thing about our config system that has been very helpful is the checkpoint-like nature of the system. For any feature, we can easily find the config that first introduced that feature to see a canonical example of how it was setup and then run that config. We can then use that config to find the corresponding W&B runs and reports. For example, I was recently investigating a regression and this provided a quick way to re-run the ablations and compare the new and old W&B data.

Testing with Rigor

Finding the new SOTA

One of the things that initially attracted them to W&B was how shareable W&B reports are. They loved the idea of associating reports with pull requests. Pull requests happen when code is merging into their develop branch, and when they are changing their HEAD model configs. To have stats to prove a set of changes are an improvement to their model was very enticing.

Initially, they compared the average and min/max of a few runs from a branch vs HEAD.

Two ablations (orange and red) compared with one of their HEAD models over the course of training in W&B. Each config is comprised of 5-10 separate runs. The solid lines indicate mean and the vertical bands indicate min/max values at a given iteration.

If you run the same config many times, you’ll get different results depending on model initialization, randomized shuffled data-loader reads, or the parallel nature of GPU operations. Moreover, they realized they don’t just care about improving the average, or even the min/max. If a change didn’t affect the mean/min/max, but the variance significantly reduced, it would be a win. These kinds of changes make the signal much clearer and reduce the number of runs needed.

Even during development, the team triggers a training run over 10 times for each of their changes. Using an in-house cloud setup, they can run them all in parallel, then analyze their changes in W&B. 

To communicate how changes improve their models, all their code pull requests use the following template:

They use one or more W&B tags per config, and then use different run sets (with their corresponding tags) in the same section of a report to pull the different arms of the experiment together.

Tips from Jeff Ling:

In W&B, each of our configs is mapped to one or many tags. That's how we find relevant runs. We use tags with reports to show the A/B comparison of a config change. For example, if I'm testing the effects of a new feature, I would find the head cloud model and the config with the feature, and put the same tag in both. When I run both, it appears in my report.

Preventing Regressions

According to Sarah Jane Hong

ML models are difficult to debug due to their adaptive nature; if you have a bug (or many!) the model can still perform adequately. The most insidious forms of bugs are the ones that "fail" silently. It can range from something unintended such as passing in softmax outputs to a loss that expects raw logits, or as subtle as accidentally broadcasting a tensor somewhere in your model.

They have hourly, nightly, and weekly integration tests, and unit tests on every PR. The tests are used to ensure that their models won’t perform worse as new changes come in. The more frequent the integration test, the shorter the duration of the run. Runs have several key metrics (for them this includes Fréchet Inception Distance), and any of these can be compared to a distribution of baseline runs.

They use a W&B report to track all their hourly integration test results over time. Each data point is the final score from a test (lower is better). Notice around 19-07-24 13:26 a bug results in spikes across all metrics for several runs, followed by a lasting improvement.

Getting fast and confident feedback on new features is central to Latent Space’s automated testing. They use a mixture of frequentist and Bayesian methods to compare the old and new distributions, but have increasingly relied on Bayesian approaches to more quickly update their beliefs based on new evidence (each run and sub-metrics within a run representing new evidence). Most tests involve comparing two distributions: a previous distribution from an earlier config/commit, and a new distribution for the integration test, PR, or ablation/sweep being run. As Latent Space has scaled their generative models, better automated statistical analysis has become increasingly beneficial the larger their training regime becomes. It creates a fast feedback loop for adding new features and understanding their effect at scale at minimal cost.

They also treat extremely out of distribution “good” results as a failure. If they were expected, they should have come in with a PR and established new baseline runs. If unexpected, likely there’s a new bug in how the metric is calculated and it’s not actually improved.

Slack tells them when integration tests pass or fail and posts links to W&B runs.

Finally, they have Unit Tests that run via CircleCI on every PR to ensure the model can still mechanically train and key functionality always works as expected.


Latent Space’s meticulous approach to coordinating model improvements stood out to me as an example of a thoughtful and repeatable structure for pushing the envelope of best practices in the industry. 

One of the most interesting parts of Latent Space’s system is their use of composable blocks. They’ll dive into more detail on that subject in a future post. Get ready to learn about how this design choice allows the team to sweep through different architectures instead of just tweaking hyperparameters!

If you found this blog post interesting or valuable feel free to share with a friend! Their team is always happy to hear feedback - contact me at carey@wandb.com for any questions or ari@wandb.com to learn how to use W&B at your company.

Learn more about Latent Space here.

Weights & Biases

We're building lightweight, flexible experiment tracking tools for deep learning. Add a couple of lines to your python script, and we'll keep track of your hyperparameters and output metrics, making it easy to compare runs and see the whole history of your progress. Think of us like GitHub for deep learning.

Partner Program

We are building our library of deep learning articles, and we're delighted to feature the work of community members. Contact Carey to learn about opportunities to share your research and insights.

Try our free tools for experiment tracking →