1
Apr
2020
/
Robert Porsch, Data Scientist at Apoidea

Model explorations and hyperparameter search with W&B and Kubernetes

In every machine learning project we have to continuously tweak and experiment with our models. This is necessary, not only to further improve performance, but also to explore underlying model characteristics. These constant experiments require rigorous logging and performance tracking. Hence, various different providers have come up with solutions to facilitate this tracking such as Tensorboard, Comet, W&B, as well as others. Here at Apoidea we make use of W&B.

Within this blog post we would like to give a more practical overview of how we run machine learning experiments and track their performance. Specifically, how we quickly set up clusters in the cloud and train our models. We hope this might help others, as well as improve our current practices by engaging in a discussion with the wider machine learning community.

Within this post we will outline the following:

We assume the reader is familiar with deep learning models, docker as well as some basics of GCP command line tools such as gcloud.

Running experiments in the cloud

Fortunately for many machine learning practitioners there has been a real explosion of tools to facilitate machine learning and deep learning development and deployment.

These tools often allow you to not only deploy your training applications in the cloud but also to fine-tune hyperparameters as well as deploy your models in production.

These tools, such as SageMaker from AWS or Google's AI Platform, are great for fine tuning well established models on known problems but come up rather short when doing more research focused development. In addition, these products are often rather pricey.

Likely it is not very difficult to deploy your own deep learning cluster which will give you a more fine-grained control over your experiments (and potentially will save you money).

Hence, we will give you a step by step tutorial on how we run deep learning experiments in the cloud with the help of Kubernetes.

Deploying on a Kubernetes cluster

Let us first discuss what Kubernetes is. Kubernetes is a "portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation".

Kubernetes has been used for a number of different deployments and it would be outside the scope of this blog post to discuss all its components. In essence it allows you to manage your containerised training models on a cluster.

The great benefits of Kubernetes from a machine learning point of view are that it provides you with:

Getting your code ready

As we have seen above Kubernetes makes use of dockerized applications, hence it needs a version of your training code within a docker container.

We will not go into great detail how to do this since there are some great online resources available (see here or here for example). However, let's assume we have the following basic Dockerfile available already:

FROM gcr.io/deeplearning-platform-release/pytorch-gpu
RUN pip install wandb # install W&B resources
COPY model-training-code /train
CMD ["python", "/train/trainer.py"]

Hence, we have a dockerized version of our training application ready. Within this application we log our training process with W&B (see here on how to do this).

Making your dockerized application accessible

Next we need to push the dockerized version of the application to a private docker repository. This can be done via Google's Container Registry, which doesn't cost anything except storage.

This, again, is done quite simply. Just run the following code which will:

  1. Build the container
  2. Push the container to the google container registry.
#!/bin/bash
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
IMAGE_REPO_NAME=pytorch_custom_container
IMAGE_TAG="latest"
IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
docker build -f Dockerfile -t "${IMAGE_URI}" ./
docker push "${IMAGE_URI}"

Now your containerized version is accessible throughout your GCP project. However, we now need to create a cluster of machines in order to train our model.

Starting a GCP Cluster

Let's start up a simple cluster with only one node. We will use a n1-standard machine with 4 CPUs as well as a nividia-tesla-p100 for GPU training.

#!/bin/bash
name_of_your_cluster="pytorch-training-cluster"
gcloud container clusters create $name_of_your_cluster \
   --num-nodes=1 \
   --zone=asia-east1-a \
   --accelerator="type=nvidia-tesla-p100,count=1" \
   --machine-type="n1-standard-4" \
   --scopes="gke-default,storage-rw"
# install gpu drivers across all machines
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Do not forget to run the last line of that code since it will install all the necessary nvidia drivers on each machine within the cluster.

Running your container

Now we have an up and running cluster of nodes with GPU support. Next we need to run our container. Container specifications as well as individual resource requests are specified in yaml files. For example, the following shows an example of how one could configurate the deployment of your container:

apiVersion: v1
kind: Pod
metadata:
 name: gke-training-pod
spec:
 restartPolicy: Never
 containers:
 - name: my-custom-container
   image: url_to_container_image
   resources:
     limits:
       nvidia.com/gpu: 1
   env:
   - name: WANDB_API_KEY
     valueFrom:
       secretKeyRef:
         name: wandb-secret
         key: secret

Let's break down the aspects of this yaml file:

As you can see we define the environmental variable WAND_API_KEY within the yaml description. This will allow us to deploy our secret W&B key into the container without storing it anywhere in clear text format.

Indeed there are various ways we can store secretes (for a full overview please see here). In this particular example we have chosen to deploy the secret as an environmental variable. This can be done with a separate yaml file which is then distributed across all nodes within the cluster. Let us see how to do that:

  1. Convert your wandb key into base64 with echo -n 'my-wandb-key' | base64
  2. create a new yaml file, lets call it wandb_kubernetes.yaml which looks something like this:
apiVersion: v1
kind: Secret
metadata:
name: wandb-secret
data:
secret: your_secret_in_base64
  1. Deploy the secret with kubectl apply -f wandb_kubernetes.yaml

Please make sure not to add this wandb_kubernetes.yaml in your git repository.

Now we are ready to deploy our container on the GCP cluster. Simply run kubectl apply -f pod.yaml. Your containerized training application should run on your cluster and log training metrics on W&B.

Hyperparamenter tuning with W&B sweeps

Now since we have set up our cluster and container it is an easy step up to multiple runs in parallel with slightly different parameters in order to explore our model or fine-tune hyperparameters. We use W&Bs sweep which will help us to do this kind of exploration in an automated fashion.

Setting up your experiment

W&B expects that parameters within your training script can be changed via the command line. Hence your training script will need to be able to accept parameters such as the following:

python train/trainer.py --learning_rate=0.005 --optimizer=adam
python train/trainer.py --learning_rate=0.03 --optimizer=sgd

This can easily be achieved with packages such as argparse.

We can then setup our experimentation paramenter space with a simple yaml file:

program: train/trainer.py
method: bayes
metric:
 name: val_loss
 goal: minimize
parameters:
 learning_rate:
   min: 0.001
   max: 0.1
 optimizer:
   values: ["adam", "sgd"]

Then running wandb sweep sweep.yaml will initialize the sweep (but not yet run any code) which will give you your sweep id. Please take a look at the W&B wiki for a more thorough explanation on how to configure your sweep.

Running your experiments

As previously, we set up our kubernetes cluster. However, this time we increase the number of nodes from 1 to 4.

#!/bin/bash
name_of_your_cluster="pytorch-training-cluster"
gcloud container clusters create $name_of_your_cluster \
   --num-nodes=4 \
   --zone=asia-east1-a \
   --accelerator="type=nvidia-tesla-p100,count=1" \
   --machine-type="n1-standard-4" \
   --scopes="gke-default,storage-rw"

# install gpu drivers across all machines
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
# deploy secrets for W&B
kubectl apply -f wandb_secret.kubernetes.yaml

Since now W&B will orchestrate the training, we need to connect each pod to the W&B server. This is simply done with wandb agent your_sweep_id. We can automatically call this command in all our deployments by simply overwriting the CMD field of our docker container within our deployment yaml specification:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
 name: sweep-model-quality
spec:
 replicas: 4
 selector:
   matchLabels:
     app: model-quality
 template:
   metadata:
     labels:
       app: model-quality
   spec:
     containers:
     - name: model-quality
       image: url_to_container
       resources:
         limits:
           nvidia.com/gpu: 1
       command: ["wandb", "agent", "your_sweep_id"]
       env:
         - name: WANDB_API_KEY
           valueFrom:
             secretKeyRef:
               name: wandb-secret
               key: secret

As you can see above we have now specified that kubernetes should have replicated our container 4 times and changed the command file to call wandb agent your_sweep_id. This will let the container connect to the W&B server which will then orchestrate the hyperparameter search.

And that is it. You can now go to your W&B side to monitor all your models.

Further reading

Join our mailing list to get the latest machine learning updates.