2
Apr
2020
/
Ayush Thakur & Sayak Paul

Introduction to image inpainting with deep learning

In this article, we are going to learn how to do “image inpainting”, i.e. fill in missing parts of images precisely using deep learning. We’ll first discuss what image inpainting really means and the possible use cases that it can cater to . Next we’ll discuss some traditional image inpainting techniques and their shortcomings. Finally, we’ll see how to train a neural network that is capable of performing image inpainting with the CIFAR10 dataset. Here is the brief outline of the article:

Try it out →

Grab a cup of coffee and let’s dive in! This is going to be a long one.

Fig 1: Masked images, ground truth and deep inpainting result. Link to the Weights and Biases page from where it was captured.

What is image inpainting?

Image inpainting is the art of reconstructing damaged/missing parts of an image and can be extended to videos easily. There are a plethora use cases that have been made possible due to image inpainting.

Fig 2: Image inpainting results gathered from NVIDIA’s web playground

Imagine having a favorite old photograph with your grandparents from when you were a child but due to some reasons, some portions of that photograph got corrupted. This would be the last thing you would want given how special the photograph is for you. Image inpainting can be a life savior here.

Image inpainting can be immensely useful for museums that might not have the budget to hire a skilled artist to restore deteriorated paintings.

Now, think about your favorite photo editor. Having the image inpainting function in there would be kind of cool, isn’t it?

Image inpainting can also be extended to videos (videos are a series of image frames after all). Due to over-compression, it is very likely that certain parts of the video can get corrupted sometimes. Modern image inpainting techniques are capable of handling this gracefully as well.

Producing images where the missing parts have been filled with bothvisually and semantically plausible appeal  is the main objective of an artificial image inpainter. It’s safe enough to admit that it is indeed a challenging task.

Now, that we have some sense of what image inpainting means (we will go through a more formal definition later) and some of its use cases, let’s now switch gears and discuss some common techniques used to inpaint images (spoiler alert: classical computer vision).

Doing image inpainting: The traditional way

There is an entire world of computer vision without deep learning. Before Single Shot Detectors (SSD) came into existence, object detection was still possible (although the precision was not anywhere near what SSDs are capable of). Similarly, there are a handful of classical computer vision techniques for doing image inpainting. In this section, we are going to discuss two of them. First, let’s introduce ourselves to the central themes these techniques are based on - either texture synthesis or patch synthesis.

To inpaint a particular missing region in an image they borrow pixels from surrounding regions of the given image that are not missing. It’s worth noting that these techniques are good at inpainting backgrounds in an image but fail to generalize to cases where:

In some cases for the latter one, there have been good results with traditional systems. But when those objects are non-repetitive in structure, that again becomes difficult for the inpainting system to infer.

If we think of it, at a very granular level, image inpainting is nothing but restoration of missing pixel values. So, we might ask ourselves - why can’t we just treat it as another missing value imputation problem? Well, images are not just any random collection of pixel values, they are a spatial collection of pixel values. So, treating the task of image impainting as a mere missing value imputation problem is a bit irrational. We will answer the following question in a moment - why not simply use a CNN for predicting the missing pixels?

Now, coming to the two techniques -

Fig 3: Edges in an image are assumed to be continuous in natures

To have a taste of the results that these two methods can produce, refer to this article. Now that we have familiarized ourselves with the traditional ways of doing image inpainting let’s see how to do it in the modern way i.e. with deep learning.

Doing image inpainting: The modern way

In this approach, we train a neural network to predict missing parts of an image such that the predictions are both visually and semantically consistent. Let’s take a step back and think how we (the humans) would do image inpainting. This will help us formulate the basis of a deep learning-based approach. This will also help us in forming the problem statement for the task of image impainting.

When trying to reconstruct a missing part in an image, we make use of our understanding of the world and incorporate the context that is needed to do the task. This is one example where we elegantly marry a certain context with a global understanding. So, could we instill this in a deep learning model? We will see.

We humans rely on the knowledge base(understanding of the world) that we have acquired over time. Current deep learning approaches are far from harnessing a knowledge base in any sense. But we sure can capture spatial context in an image using deep learning. A convolutional neural networks or CNN is a specialized neural network for processing data that has known grid like topology – for example an image can be thought of as 2D grid of pixels. It will be a learning based approach where we will train a deep CNN based architecture to predict missing pixels.

A simple image inpainting model with the CIFAR10 dataset

ML/DL concepts are best understood by actually implementing them. In this section we will walk you through the implementation of the Deep Image Inpainting, while discussing the few key components of the same. We first require a dataset and most importantly prepare it to suit the objective task. Just a spoiler before discussing the architecture, this DL task is in a self-supervised learning setting.

Why choose a simple dataset?

Since inpainting is a process of reconstructing lost or deteriorated parts of images, we can take any image dataset and add artificial deterioration to it. For this specific DL task we have a plethora of datasets to work with. Having said that we find that real life applications of image inpainting are done on high resolution images(Eg: 512 x 512 pixels). But according to this paper, to allow a pixel being influenced by the content 64 pixels away, it requires at least 6 layers of 3×3 convolutions with dilation factor 2.

Thus using such a high resolution images does not fit the purpose here. It’s a general practice to apply ML/DL concepts on toy datasets. Cutting short on computational resources and for quick implementation we will use CIFAR10 dataset.

Data Preparation

Certainly the entry step to any DL task is data preparation. In our case as mentioned we need to add artificial deterioration to our images. This can be done using the standard image processing idea of masking an image. Since it is done in a self-supervised learning setting, we need X and y (same as X) pairs to train our model. Here X will be batches of masked images, while y will be original/ground truth image.

Fig 4: Original images, masks and masked images on CIFAR10

To simplify masking we first assumed that the missing section is a square hole. To prevent overfitting to such an artifact, we randomized the position of the square along with its dimensions.

Using these square holes significantly limits the utility of the model in application. This is because in reality deterioration in images is not just a square bob. Thus inspired by this paper we implemented irregular holes as masks. We simply drew lines of random length and thickness using OpenCV.

We will implement a Keras data generator to do the same. It will be responsible for creating random batches of X and y pairs of desired batch size, applying the mask to X and making it available on the fly. For high resolution images using data generator is the only cost effective option. Our data generator createAugment is inspired by this amazing blog. Please give it a read.

class createAugment(keras.utils.Sequence):
 # Generates masked_image, masks, and target images for training
 def __init__(self, X, y, batch_size=32, dim=(32, 32),
   n_channels=3, shuffle=True):
     # Initialize the constructor
     self.batch_size = batch_size
     self.X = X
     self.y = y
     self.dim = dima
     self.n_channels = n_channels
     self.shuffle = shuffle
     self.on_epoch_end()

 def __len__(self):
   # Denotes the number of batches per epoch
   return int(np.floor(len(self.X) / self.batch_size))

 def __getitem__(self, index):
   # Generate one batch of data
   indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
   # Generate data
   X_inputs, y_output = self.__data_generation(indexes)
   return X_inputs, y_output

 def on_epoch_end(self):
   # Updates indexes after each epoch
   self.indexes = np.arange(len(self.X))
   if self.shuffle:
     np.random.shuffle(self.indexes)

The methods in the code block above are self explanatory. Let’s talk about the methods data_generation and createMask implemented specifically for our use case. As the name suggests this private method is responsible for generating binary masks for each image in a batch of a given batch size. It’s drawing black lines of random length and thickness on white background. You may notice that it’s returning the mask along with the masked image. Why do we need this mask? We will see soon.

def __createMask(self, img):
 ## Prepare masking matrix
 mask = np.full((32,32,3), 255, np.uint8) ## White background
 for _ in range(np.random.randint(1, 10)):
   # Get random x locations to start line
   x1, x2 = np.random.randint(1, 32), np.random.randint(1, 32)
   # Get random y locations to start line
   y1, y2 = np.random.randint(1, 32), np.random.randint(1, 32)
   # Get random thickness of the line drawn
   thickness = np.random.randint(1, 3)
   # Draw black line on the white mask
   cv2.line(mask,(x1,y1),(x2,y2),(0,0,0),thickness)
 ## Mask the image
 masked_image = img.copy()
 masked_image[mask==0] = 255
 return masked_image, mask

Keras' model.fit requires input and target data for which it calls __getitem__ under the hood. If traingen is an instance of createAugment, then traingen[i] is roughly equivalent to traingen.__getitem__(i), where i ranges from 0 to len(traingen). This special method is internally calling __data_generation which is responsible for preparing batches of Masked_images, Mask_batch and y_batch.

def __data_generation(self, idxs):
 # Masked_images is a matrix of masked images used as input
 Masked_images = np.empty((self.batch_size, self.dim[0], self.dim[1], self.n_channels)) # Masked image
 # Mask_batch is a matrix of binary masks used as input
 Mask_batch = np.empty((self.batch_size, self.dim[0], self.dim[1], self.n_channels)) # Binary Masks
 # y_batch is a matrix of original images used for computing error from reconstructed image
 y_batch = np.empty((self.batch_size, self.dim[0], self.dim[1], self.n_channels)) # Original image
 
 ## Iterate through random indexes
 for i, idx in enumerate(idxs):
   image_copy = self.X[idx].copy()
   ## Get mask associated to that image
   masked_image, mask = self.__createMask(image_copy)
   
   ## Append and scale down.
   Masked_images[i,] = masked_image/255
   Mask_batch[i,] = mask/255
   y_batch[i] = self.y[idx]/255

 return [Masked_images, Mask_batch], y_batch

Architecture

Inpainting is part of a large set of image generation problems. The goal of inpainting is to fill the missing pixels. It can be seen as creating or modifying pixels which also includes tasks like deblurring, denoising, artifact removal, etc to name a few. Methods for solving those problems usually rely on an Autoencoder – a neural network that is trained to copy it’s input to it’s output. It is comprised of an encoder which learns a code to describe the input, h = f(x), and a decoder that produces the reconstruction, r = g(h) or r = g(f(x)).

Vanilla Convolutional Autoencoder

An Autoencoder is trained to reconstruct the input, i.e. g(f(x)) = x, but this is not the only case. We hope that training the Autoencoder will result in  h taking on discriminative features. It has been noticed that if the Autoencoder is not trained carefully then it tends to memorize the data and not learn any useful salient feature.

Rather than limiting the capacity of the encoder and decoder (shallow network), regularized Autoencoders are used.  Usually a loss function is used such that it encourages the model to learn other properties besides the ability to copy the input. These other properties can include sparsity of the representation, robustness to noise or to missing input. This is where image inpainting can benefit from Autoencoder based architecture. Let’s build one.

Fig 5: Simple deep autoencoder architecture.(Souce)

To set a baseline we will build an Autoencoder using vanilla CNN. It’s always a good practice to first build a simple model to set a benchmark and then make incremental improvements. If you want to refresh your concepts on Autoencoders this article here by PyImageSearch is a good starting point. As stated previously the aim is not to master copying, so we design the loss function such that the model learns to fill the missing points. We use mean_square_error as the loss to start with and dice coefficient as the metric for evaluation.

def dice_coef(y_true, y_pred):
   y_true_f = keras.backend.flatten(y_true)
   y_pred_f = keras.backend.flatten(y_pred)
   intersection = keras.backend.sum(y_true_f * y_pred_f)
   return (2. * intersection) / (keras.backend.sum(y_true_f + y_pred_f))

For tasks like image segmentation, image inpainting etc, pixel-wise accuracy is not a good metric because of high color class imbalance. Though it’s easy to interpret, the accuracy score is often misleading. Two commonly used alternatives are IoU (Intersection over Union) and Dice Coefficient. They are both similar, in the sense that the goal is to maximize the area of overlap between the predicted pixel and the ground truth pixel divided by their union. You can check out this amazing explanation here.


We implemented a simple demo PredictionLogger callback that, after each epoch completes, calls model.predict() on the same test batch of size 32. Using wand.log() we can easily log masked images, masks, prediction and ground truth images. Fig 1 is the result of this callback. Here’s the full callback that implements this -

class PredictionLogger(tf.keras.callbacks.Callback):
   def __init__(self):
       super(PredictionLogger, self).__init__()
   
# The callback will be executed after an epoch is completed
   def on_epoch_end(self, logs, epoch):
       # Pick a batch, and sample the masked images, masks, and the labels
       sample_idx = 54
       [masked_images, masks], sample_labels = testgen[sample_idx]  
       
       # Initialize empty lists store intermediate results
       m_images = []
       binary_masks = []
       predictions = []
       labels = []
       
       # Iterate over the batch
       for i in range(32):
         # Our inpainting model accepts masked imaged and masks as its inputs,
         # then use perform inference
         inputs = [B]
         impainted_image = model.predict(inputs)
       
         # Append the results to the respective lists
         m_images.append(masked_images[i])
         binary_masks.append(masks[i])
         predictions.append(impainted_image.reshape(impainted_image.shape[1:]))
         labels.append(sample_labels[i])

       # Log the results on wandb run page and voila!
       wandb.log({"masked_images": [wandb.Image(m_image)
                             for m_image in m_images]})
       wandb.log({"masks": [wandb.Image(mask)
                             for mask in binary_masks]})
       wandb.log({"predictions": [wandb.Image(inpainted_image)
                             for inpainted_image in predictions]})
       wandb.log({"labels": [wandb.Image(label)
                             for label in labels]})
Fig 6: Result of the baseline implementation.
Fig 7: You can see the progression using PredictionLogger.

You can find the notebook for this baseline implementation here. The associated W&B run page can be found here.  

Even though the results are satisfactory in case of CIFAR10 dataset the authors of this paper

point out that the convolution operation is ineffective in modeling long term correlations between farther contextual information (groups of pixels) and the hole regions. CNN-based methods can create boundary artifacts, distorted and blurry patches. Post-processing is usually used to reduce such artifacts, but are computationally expensive and less generalized. This compelled many researchers to find ways to achieve human level image inpainting score.

Partial Convolutions

We will now talk about Image Inpainting for Irregular Holes Using Partial Convolutions as a strong alternative to vanilla CNN. Partial convolution was proposed to fill missing data such as holes in images. The original formulation is as follows – Suppose X is the feature values for the current sliding (convolution) window, and M is the corresponding binary mask. Let the holes be denoted by 0 and non-holes by 1. Mathematically partial convolution can be expressed as,

The scaling factor, sum(1)/sum(M), applies appropriate scaling to adjust for the varying amount of valid (unmasked) inputs. After each partial convolution operation, we update our mask as follows: if the convolution was able to condition its output on at least one valid input (feature) value, then we mark that location to be valid. It can be expressed as,

With multiple layers of partial convolutions, any mask will eventually be all ones, if the input contained any valid pixels. In order to replace the vanilla CNN with a partial convolution layer in our image inpainting task, we need an implementation of the same.

Fig 8: Use of partial convolution in an autoencoder decoder architecture. (Source)

Unfortunately, since there is no official implementation in TensorFlow and Pytorch we have to implement this custom layer ourselves. This TensorFlow tutorial on how to build a custom layer is a good stating point. Luckily, we could find a Keras implementation of partial convolution here. The codebase used TF 1.x as Keras backend which we upgraded to use TF 2.x. We have provided this upgraded implementation along with the GitHub repo for this blog post. Find the PConv2D layer here.

Let’s implement the model in code, and train it on CIFAR 10 dataset. We implemented a class inpaintingModel. To build the model you need to call the prepare_model() method.

def prepare_model(self, input_size=(32,32,3)):
   input_image = keras.layers.Input(input_size)
   input_mask = keras.layers.Input(input_size)
 
   conv1, mask1, conv2, mask2 = self.__encoder_layer(32, input_image, input_mask)
   conv3, mask3, conv4, mask4 = self.__encoder_layer(64, conv2, mask2)
   conv5, mask5, conv6, mask6 = self.__encoder_layer(128, conv4, mask4)
   conv7, mask7, conv8, mask8 = self.__encoder_layer(256, conv6, mask6)

   conv9, mask9, conv10, mask10 = self.__decoder_layer(256, 128, conv8, mask8, conv7, mask7)
   conv11, mask11, conv12, mask12 = self.__decoder_layer(128, 64, conv10, mask10, conv5, mask5)
   conv13, mask13, conv14, mask14 = self.__decoder_layer(64, 32, conv12, mask12, conv3, mask3)
   conv15, mask15, conv16, mask16 = self.__decoder_layer(32, 3, conv14, mask14, conv1, mask1)

   outputs = keras.layers.Conv2D(3, (3, 3), activation='sigmoid', padding='same')(conv16)

   return keras.models.Model(inputs=[input_image, input_mask], outputs=[outputs])

As it’s an Autoencoder, this architecture has two components – encoder and decoder which we have discussed already. In order to reuse the encoder and decoder conv blocks we built two simple utility functions encoder_layer and decoder_layer.

def __encoder_layer(self, filters, in_layer, in_mask):
 conv1, mask1 = PConv2D(32, (3,3), strides=1, padding='same')([in_layer, in_mask])
 conv1 = keras.activations.relu(conv1)

 conv2, mask2 = PConv2D(32, (3,3), strides=2, padding='same')([conv1, mask1])
 conv2 = keras.layers.BatchNormalization()(conv2, training=True)
 conv2 = keras.activations.relu(conv2)

 return conv1, mask1, conv2, mask2

def __decoder_layer(self, filter1, filter2, in_img, in_mask, share_img, share_mask):
 up_img = keras.layers.UpSampling2D(size=(2,2))(in_img)
 up_mask = keras.layers.UpSampling2D(size=(2,2))(in_mask)
 concat_img = keras.layers.Concatenate(axis=3)([share_img, up_img])
 concat_mask = keras.layers.Concatenate(axis=3)([share_mask, up_mask])

 conv1, mask1 = PConv2D(filter1, (3,3), padding='same')([concat_img, concat_mask])
 conv1 = keras.activations.relu(conv1)

 conv2, mask2 = PConv2D(filter2, (3,3), padding='same')([conv1, mask1])
 conv2 = keras.layers.BatchNormalization()(conv2)
 conv2 = keras.activations.relu(conv2)

 return conv1, mask1, conv2, mask2

The essence of the Autoencoder implementation lies in the Upsampling2D and Concatenate layers. An alternative to this is to use Conv2DTranspose layer.

We compiled the model with the Adam optimizer with default parameters, mean_square_error as the loss and dice_coef as the metric. Using model.fit() we trained the model, the results of which were logged using WandbCallback and PredictionLogger callbacks.

Fig 9: Result of the partial convolution based deep inpainting.
Fig 10: Some sample predictions on the test dataset as our network is training

You will notice that vanilla CNN based image inpainting worked a bit better compared to the partial convolution based approach. This boils down to the fact that partial convolution is a complex architecture for the CIFAR10 dataset. This layer was designed for high resolution images which are greater than 256x256 pixels.

Unlike the authors of this paper who used loss functions to target both per pixel reconstruction loss as well as composition loss, i.e. how smoothly the predicted hole values transition into their surrounding context, we simply used L2 loss. The holes present a problem for batch normalization layer because the mean and variance is computed only for hole pixels.  Thus to use this layer the authors initially trained with batch normalization on in the encoder layer which was turned off for final training. We didn’t train using this method.

Conclusion

Let’s conclude with some additional pointers on the topic, including how it relates to self-supervised learning, and some recent approaches for doing image inpainting.

A very interesting property of an image inpainting model is that it is capable of understanding an image to some extent. Much like in NLP, where we use embeddings to understand the semantic relationship between the words, and use those embeddings for downstream tasks like text classification.

The premise here is, when you start to fill in the missing pieces of an image with both semantic and visual appeal, you start to understand the image. This is more along the lines of self-supervised learning where you take advantage of the implicit labels present in your input data when you do not have any explicit labels.

This is particularly interesting because we can use the knowledge of an image inpainting model in a computer vision task as we would use the embeddings for an NLP task. For learning more about this, we highly recommend this excellent article by Jeremy Howard.

So far, we have only used a pixel-wise comparison as our loss function. This often forces our network to learn very rigid and not-so-rich features representations. A very interesting yet simple idea, approximate exact matching, was presented by Charles et al. in this report. According to their study, if we shift the pixel values of an image by a small constant, that does not make the image visually very different to its original form. So, they added an additional term in the pixel-wise comparison loss to incorporate this idea.

Another interesting tweak to our network would be to enable it to attend on related feature patches at distant spatial locations in an image. In this paper Generative Image Inpainting with Contextual Attention, Jiahui et al. introduced the idea of contextual attention which allows the network to explicitly utilize the neighboring image features as references during its training.

Thanks for reading this article until the end. Image inpainting is a very interesting computer vision task and we hope this article gave you a fair introduction to the topic. Please feel free to let us know about any feedback you might have on the article via Twitter (Ayush and Sayak). We would really appreciate it :)

Join our mailing list to get the latest machine learning updates.