Multitask Learning with Weights & Biases

Ayush Thakur

In this post, I’ll walk you through my project "Faceless”. Some of the ideas are inspired from this article, Formulate your problem as an ML problem. We’ll apply these best practices around formulating your problem and will extensively cover multi-output classification. Weights & Biases was super useful in iterating through model architectures quickly and finding a good architecture for this project, and also in monitoring model performance. You can find the code for the project here, and the W&B dashboard with the metrics here.

Defining the problem statement

The problem we’re solving is detecting faces in an input image, and predicting the age, gender and ethnicity(attributes) of each detected face. Thus we need to build a computer vision pipeline which can take an image as an input and predict the above attributes. Our pipeline will have two parts:

  1. Face detection: Questions – How to approach this? Should I try DL because it’s cool.
  2. Attributes prediction: Questions – How to classify the images? Should there be three distinct classifiers? Should I take a multi-label approach or multi-output approach?


With all the hype about Deep Learning, our first instinct is to approach such problems with an ML/DL approach. But we need to be thoughtful about the problem-solving approaches we take. I did the same and here’s what my thought process was. For the sake of simplicity, I divided the problem into two parts. The first part is to detect all the faces that appear in the input image. The second part is to determine the age, gender and emotion of the detected face.

To achieve the first part the simplest method in my knowledge is to use haar cascade. But haar cascade have lots of limitations and thus is not used widely. The other method is to use Histogram of Oriented Gradients or HOG feature descriptors paired with an SVM(a classical classification technique). Certainly, this can work but will be a tedious approach and will not generalize well. Next up we need to predict the attributes of the detected faces. The prediction of a face’s attributes can be a classification task. Simple classification algorithms like K-nearest neighbor or SVM can be used. But images are high dimensional data and generalization can be an issue. Thus the best tool in my arsenal was to rely on a Convolutional Neural Network based face detector and classifier.

I made the following decisions for this project:

  1. Finalize the architecture for face detection and find an open-source implementation, to avoid training the model from scratch.
  2. Build a multi-output classifier from scratch for face attribute classification and integrate it with the face detection pipeline.


More on face detection later, but since we will be building our face attributes classifier we need to collect the data and prepare it before we start training.

I will be using UTKFace dataset which has over 20,000 face images with annotations of age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, occlusion, resolution, etc. I used “Aligned and Cropped” dataset since my face detection pipeline was modified to output aligned and cropped detected faces. Click here to download the raw dataset.

The labels for each face image were embedded in the file name, in the form of [age]_[gender]_[race]_[date&time].jpg. I wrote a simple python script to extract this information and prepare the final dataset with randomly generated image ids and a CSV file. To prepare this dataset,

python run

The csv file generated can be found in datasets dir (face_dataset.csv). Next up, we do a simple Exploratory Data Analysis on the csv file.

Fire up a Jupyter Notebook in the root directory of this project and open `EDA of Face Dataset.ipynb`. The important things to note are:

  1. There are three bad file names in the dataset, these should be removed during data preparation.
  2. The age label is continuous, as it was originally provided for a regression task. I grouped it into 12 categories. The difference between the maximum and minimum age is not uniform across categories. It was done in a way to get a balanced class.

Face Detection

Let’s come back to where we left off for face detection. Following the decisions taken previously I needed an architecture with an open-source implementation. It’s a good practice to find the paper on that topic/architecture for an in-depth understanding. Being able to find the code helps quickly test the concept. I decided to use Multi-Task Cascaded Convolutional Neural Networks(MTCNN) for the face detection pipeline. I quickly tested the architecture and was satisfied with the results. I was able to find a pretrained weight file which was another reason for choosing this architecture. Thus I can either use the weights as it is or retrain them with my dataset(transfer learning).

For the sake of this blog, I am not fine-tuning the weights. It’s always recommended to retrain the model to fine-tune the weights with your dataset.

Let’s quickly go through the MTCNN architecture. You can find the research paper here. The high-level overview of this paper can be summarized in Figure 1.

There are a couple different components of the face detection architecture:

  1. Image Pyramid: The input image is resized to different scales to build an image pyramid, which is the input to the next three stages.
  2. Stage 1: Exploits Fully Convolutional Network which can obtain the candidate windows and their bounding box regression vectors. Non-maximum suppression(NMS) is applied to merge highly overlapped candidates.
  3. Stage 2: The candidates from the first stage are passed through another CNN which further rejects a large number of false candidates, performs calibration with bounding box regression, and NMS.
  4. Stage 3: Similar to stage 2 but describes the face in more detail. In the original implementation, the network would output the five facial landmarks’ positions. Since facial landmarks are not required for my pipeline, I discarded them.

The output of this architecture is the bounding box coordinates of the detected faces. I created a function that takes the image and the bounding box as input and returns tightly cropped faces. These tightly cropped detected faces(image) will be input to the face attribute classifier.

You can find the open-source implementation that I used here.

To learn more about this architecture you can go through this blog.

If you want to train this architecture this blog is a starting point.

You can find my extension of MTCNN in the face_detector dir. The main definition of the MTCNN class can be found in The output of this architecture is produced using the script. The getCropedImages method is responsible for cropping the faces.

Attributes Prediction

For attribute prediction, I chose to build my classifier and train it from scratch. There are two ways we can approach this.

  1. Multi-label classification: In this method, your network only has one set of fully connected (FC) layers at the end responsible for classification.
  2. Multi-output classification: In this method, your network has at least two (sometimes more) FC layers at the end responsible for classification.

In multi-label classification, the network is learning the relationship between the input and the labels, where the labels are learnt as joint sets; whereas multi-output classification is best suited for disjoint labels. Go through this amazing blog to learn more.

The face attributes (age, gender and ethnicity) are not related to each other – they are disjoint labels. So we’ll take a multi-output approach. Now that I’ve covered most of the theoretical aspects of this project, let’s dive into building this classifier.

Classifier and Weights and Biases integration

I used TensorFlow 2.0 for modelling and processing the input pipeline. By integrating Weights and Biases, I was able to find a good architecture for my model quickly.

I used GitHub to host the dataset used for training. In my Google Colab notebook, I simply cloned the repo. One can also upload the data to Google Drive. But in my experience, it’s slower to read the data from the drive.

Let’s setup W&B for tracking our experiments. The following code shows you the basic steps to integrate W&B with your machine learning experiments.

I used to build my input pipeline. There were a total of 23705 images and labels. The best practice is to always start with a smaller dataset. I used a total of 5000 images to train and 1000 to validate. I also wrote a custom function to split the dataset into training, validation and test set. This function is responsible for creating age groups using the groupAge function and generate one-hot encoded labels. For a multi-output classifier, each head has its true value (this is intuitive enough).

We have now created our labels. To get training and validation images I wrote a function loadImages, and used the partitions dictionary generated previously to load the images into trainX and validationX. I then scaled down the image pixels and finally applied the input pipeline.

I used tf.keras to build my model. Since there are three attributes to predict, the classifier has three legs. All the three legs of the classifier were simple VGG-inspired models. I didn’t experiment with layer depth or kernel size in the convolution layer, or the activation functions but I highly encourage you to tweak them to see if you can beat my model’s performance.

I was able to visualize the model created using Model visualization tool offered by W&B. Navigate to a run and click on Model on the left-hand side in the nav bar.

At this stage you could spend a lot of time tweaking hyperparameters and experimenting with model architectures.

The next step is to compile the model with an optimization function, a loss function and specify the metrics. We define a loss function for each leg of a multi-output classifier. Here age and ethnicity have multiple classes, while gender is binary in nature. So we can define our loss like so –

One of the most important hyperparameters in a multi-class classifier is the weight given to each of the loss functions. The overall loss is determined by this hyperparameter. I experimented with a few combinations of loss weights.

I compiled the model with the Adam optimizer and trained for 5 epochs. I experimented with a few different loss weights. You can find all the runs here. There are 14 important metrics to look at. W&B helped me visualize them separately, with no effort on my part.

Putting it all together

Now we have our face detection and face attribute classifiers. We treated them as two separate pipelines so far, now they should be cascaded such that faces detected by the MTCNN should go in as inputs to face attribute classifier. I talked about earlier. is responsible for loading the saved model. The predict() method in IdentifyFace class outputs the predictions for the face we detect.

Finally, puts everything together.

Simply run the script from your command line.

If you don’t want to see the bounding boxes on the faces, you can simply pass False to the `-b` argument. I also gave each face an ID.

The program outputs two dictionaries. The first dictionary contains the bounding boxes for each detected face, along with a confidence score. The second dictionary contains the face attributes for each face – age, gender, ethnicity. (face_0 represents the prediction for the detected face face ID 0.)

That’s it!

If you want to make improvements to the pipeline, you can try:

  1. Fine-tuning the MTCNN face detector. I used an off the shelf weights file because it was good enough for the scope of my project.
  2. I didn’t hard train face attributes classifier. The model can be tweaked in many ways. The distribution of labels after train-validation-test split is not considered.
  3. Both the pipelines can be built as a single architecture and trained end to end. This in my opinion should give a much better result. But you need the right kind of dataset for this.

I hope this project will help build your machine learning pipeline. I tried to document my thought process while doing this project. I would like to thank Sayak Paul (da) for his guidance and opportunities. I would also like to thank Weights and Biases for the amazing service and all the support I received.

You can find the code for this project here, and the W&B dashboard with the metrics here.

Final words: We should respect the ethnicity, age and gender of every individual. This project is in no way intended to be demeaning any ethnic group or gender. I hope the reader will not use this for any nefarious purposes. Harmony is important for true happiness. Thank you.

Join our mailing list to get the latest machine learning updates.