Transformers: Age of Attention

November 26, 2020

In their highly-memorable paper titled “Attention Is All You Need”, Google Brain researchers introduced the Transformer, a new type of encoder-decoder model that relies solely on attention for sequence-to-sequence modelling. Before the Transformer, attention was used to help improve the performance of the likes of Recurrent Neural Networks (RNNs) on sequential data.

Now, this is a lot. You might be wondering, “What the hell is sequence-to-sequence modelling, Jeanne?” You may also suffer from an attention deficiency, so allow me to introduce you to…

Seq2seq models

Sequence-to-sequence models (or seq2seq, for shorthand) are a class of machine learning models that translates an input sequence to an output sequence. Typically, seq2seq models consist of two distinct components: an encoder and a decoder. The encoder constructs a fixed-length latent vector(or context vector) of the input sequence. The decoder uses the latent vector to (re)construct the output or target sequence. Both the input and output sequence can be of variable length.

The applications of seq2seq models extend beyond simply machine translation. The architecture of the encoder-decoder model can be used for question-answering, mapping speech-to-text and vice versa, text summarisation, image captioning, as well as learning contextualised word embeddings.

Unlike (context-independent) word embeddings, which have static representations, contextualized embeddings have dynamic representations that are sensitive to their context. Sourced from Researchgate

The novel RNN Encoder-Decoder model was introduced in 2014 by renowned researcher Kyunghyun Cho and his team, to perform statistical machine translation. This model uses the final hidden representations of the encoder RNN as context vector for the decoder RNN. This approach works fine for short sequences, but fails to accurately encode longer sequences. In addition, RNNs suffer from Vanishing Gradients, and are slow to train. Adding the attention mechanism to the RNN encoder-decoder architecture helps improve on its ability to model long-term dependencies.

Attention? Attention!

Attention is a function of the hidden states of the encoder to help the decoder decide which parts of the input sequence are most important for generating the next output token. Attention allows the decoder to focus on different parts of the input sequence at every step of the output sequence generation. This means that dependencies can be identified and modeled, regardless of their distance in the sequences.

Attention applied to an input sequence to assist in machine translation

When attention is added to the RNN encoder-decoder, all the encoder’s hidden representations are used during the decoding process. The attention mechanism creates a unique mapping between the output of the decoder and the hidden states of the encoder at each time step. These mappings reflects how important that part of the input is for generating the next output token .Thus, the decoder can “see” the entire input sequence and decide which elements to pay attention to when generating the next output token.

There are two major types of attentions: Bahdanau Attention, and Luong Attention.

Bahdanau Attention works by aligning the decoder with the relevant input sentences. The alignment scores is a function of the hidden states produced by the decoder in the previous time step and the encoder outputs. The attention weights are the output of softmax applied to the alignment scores. After this, the encoder’s outputs and their attention weights are multiplied element-wise to form the context vector.

The context vector of Luong Attention is calculated similarly to Bahdanau’s Attention. The key differences between the two are as follows:

  • with Luong Attention, the context vector is only utilized after the RNN produced the output for that time step,

  • the ways in which the alignment scores are calculated, and

  • the context vector is concatenated with the decoder hidden state to produce a new output.

There are three different ways to compute the alignment scores: dot-product (multiply the hidden states of the encoder and decoder), general (multiply the hidden states of the encoder and decoder, preceded by a multiplication with a weight matrix), and concatenation (a function applied on top of adding together the hidden states of the encoder and decoder). More information on this can be found here. Subsequently, the context vector, together with the previous output will determine the new hidden state of the decoder.

The Transformer

Google Brain researchers flipped the tables on the NLP community and showed that sequential modelling can be done using just attention, in their 2017 paper titled “Attention is all you need”. In this paper, they introduce the Transformer, a simplistic architecture that makes use of only attention mechanisms to draw global dependencies between input and output sequences.

The Transformer consists of two parts: an encoding component and a decoding component. Additionally, positional encodings are injected to give information about the absolute and relative positions of the tokens.

The Encoder-Decoder Model

The encoding component is a stack of 6 encoders. Although identical, the encoders do not have shared weights. Each encoder can be deconstructed into a multi-head attention part and a fully connected feed-forward network.

Similarly, the decoding component is a stack of 6 identical decoders, whose architecture is similar to the encoder’s, except it masks its multi-head attention to ensure that the next output can only depend on the known outputs of the previous tokens. Also, the decoder has an extra multi-head attention component that applies self-attention over the output of the encoding component, providing access to the inputs during decoding.

Multi-head attention

Each multi-head attention component consists of several attention layers running in parallel. The Transformer makes use of scaled dot-product attention, which is very similar to Luong’s dot-product attention, except it is scaled. Multi-headed attention allows for scaled dot-product attention to be aggregated across n different, randomly-initialized representation subspaces. This multi-headed attention function can also be parallelized and trained across multiple computers.

Why Transformers?

Much like Recurrent Neural Networks (RNNs), Transformers allows for processing sequential information, but in a much more efficient manner. The Transformer outperform RNNs, both in terms of accuracy and computational efficiency. Its architecture is devoid of any recurrence or convolutions, and thus training can be parallelizable across multiple processing units. It has achieved state-of-the-art performance on several tasks, and, even more importantly, was found to generalize very well to other NLP tasks, even with limited data.

In conclusion

The Transformer has taken the NLP community by storm, earning a place among the ranks of Word2Vec and LSTMs. Today, some of the state-of-the-art language models are based on the Transformer architecture, such as BERTand GPT-3. In this blog post, we discussed the evolution of sequence-to-sequence modelling, from RNNs, to RNNs with Attention, to solely relying on attention to model input to output sequences with the Transformer.

If you enjoyed this blog post, please consider supporting my writing by buying me a coffee.

Previous
Previous

What In The Corpus is a Word Embedding?

Next
Next

Amazon’s Meteoric Rise