26
Feb
2020
/
Chris Van Pelt, CVP of Weights & Biases

Distributing Our Self-Hosted Software

At Weights & Biases, we are building tools that help machine learning teams version, optimize, and understand their models.  Often these models rely on sensitive or proprietary data that is not allowed to leave a customer’s network.  We also have many users who are more familiar with running popular open source tools such as TensorBoard or Visdom locally.  I’m excited to announce the launch of wandb/local and to share the saga of how it came to be.

The wandb/local Docker image contains all of our backend services and UIs. It enables you to use our tools entirely in-house, without any data leaving your secure networks. wandb/local can be run on your laptop, a server, or in a Kubernetes cluster.  The production version can connect to external MySQL / Cloud Storage and scale with the team’s needs.  Using a single docker image makes upgrades as simple as docker pull wandb/local:latest.

How Local came to be

We started delivering a self-hosted version of our software a year ago using Terraform and Kubernetes.  This approach has three key problems:

  1. It assumes the customer is hosted on a public cloud
  2. It’s complicated to install and upgrade
  3. It doesn’t support the use-case of running wandb quickly on a local PC.

To simplify the installation of our software, we used Packer to build a single VM for all major cloud vendors and VirtualBox / VMWare.  This solved #1 and partially addressed #2 but didn’t address #3.  Some prospective customers started asking if we provide our service as a Docker image.  I understood how convenient it would be to simply type docker run wandb/local, but the general rule of thumb with docker is one process per container.  I needed to run 10+ services in the container—did I dare try?

Attempt 1

Packer has a Docker builder, so I repurposed our packer config and gave it a whirl.  The trickiest part of the Packer build was installing MySQL.  Ubuntu’s init system does not like Docker.  Without an init system it was going to be really challenging to configure the system and ensure all processes kept running.  I found some ways to hack around the init system, and after hours of trial and error, I finally got a build.  I checked the size of my new Docker image and realized I had another problem: a 10GB docker image wasn’t gonna fly.

Attempt 2

I went back to the drawing board and did some googling.  The first resource I found was this very helpful page on the Docker doc site.  I had new hope: I could build a Docker image from scratch.  I went to work and got something minimal running on a Sunday afternoon.  Mid-way through my implementation I stumbled on this phusion/baseimage-docker Github repo.  This was exactly what I needed!  The Phusion BaseImage makes building multi-process docker images really easy.  They replaced Ubuntu’s init system with Runit.  All I had to do was add a startup script to /etc/service/NAME/run for each of my services, and I was off to the races.

Our container runs the following services by default:

  1. four Go Services
  2. one Python Service
  3. Nginx
  4. MySQL
  5. Redis
  6. Minio
  7. Cron

Attempt 3?

At this point, some of you may be thinking, why not use docker-compose?  We actually attempted this and quickly found it difficult to manage the dependencies between services.  For instance, our graphql service needs to be able to connect to a MySQL instance that has been migrated to the most recent schema.  Doing this with compose would require informing Nginx to display a loading state, preventing the graphql service from starting, running migrations with a special container, and then informing nginx and the graphql service they could start.  Instead of adding complexity to manage these dependencies—as well as introducing the possibility of users running incompatible versions of our services—we opted for the single image approach.  With this approach, we were able to handle migrations and loading screens with a simple bash script on startup.

Persistence

For persistence we configured MySQL and Minio to store all data under /vol.  As long as persistent storage is mounted to /vol, all data will persist on restarts.  This also means once that disk fills up, the services will crash.  In production environments, our customers use an external MySQL instance and Cloud Storage bucket.  We make the system configurable through an administration interface that persists all settings to Cloud Storage, removing the need to mount any persistent storage.

Performance

Because we use Go for our backend services, we can make use of all cores available to the container with a single process.  We suggest 4 cores and 8GB of RAM as minimum requirements for running wandb/local in a production environment, but the services will use all the cores you give them.

Availability

Having a single docker container running a production service is generally a non-starter.  We are working on a simple path to a highly available version of wandb/local, but it’s less of a concern for us because we designed our clients to be fault tolerant.  Any training runs that can’t connect to a W&B backend go into a retry loop for up to 24 hours.  We also persist all training metrics on the training instance, so users can run wandb sync after a run has exited.

Size

In the end the image weighs in at just under 1.5GB.   We could make it smaller, but with PyTorch weighing in at 750MB, we don’t think this will be an issue for the majority of our users.

Conclusion

Having a single Docker image as an entry-point for our users makes it really easy to get started with W&B in an isolated network.  We’re able to scale the service by moving key data stores out of the container, and distributing updates is as simple as docker pull wandb/local:latest.  If you would like to try wandb/local today, install docker and run:

pip install wandb --upgrade
wandb local
python -c 'import wandb; wandb.init(project="local"); wandb.log({"hello": "world"})'

We’ll be updating the image regularly with new features. To upgrade your instance, run:

wandb local --upgrade

If you’re interested in running wandb/local in a production setting, send us a note at contact@wandb.com, and we’ll set you up with a free trial!

Join our mailing list to get the latest machine learning updates.