Recurrent Neural Networks (RNNs) are difficult to train for long term sequences, owing to the exploding and vanishing gradient problem. There are many strategies which have been adapted to solve this problem, such as gradient clipping, constraining model parameters and self-attention. In this report I show how attention is leveraged to solve problems with long sequences, using RNNs augmented with self-attention. I propose that attention allows these problems to be solved by creating skip connections between later points in the sequence and earlier points. Two tasks are considered:
RNNs with attention use both their hidden state and the attention module to solve these problems, I explore the strategies the networks adopt by considering attention heat maps like this:
to gain intuition on how attention is leveraged to solve these problems and how attention mitigates the vanishing gradient problem.
Code can be found here, feel free to try this at home.