Self-Attention is not typical Attention model

Understanding Transformer model

Published in

DataDrivenInvestor

14 min readJun 12, 2023

Self-Attention is at the core of Transformer model and in fact, the entire LLM models mechanisms relies on the Transformer model. This is one of the trickiest model, I studied till date.

Transformer model is an advanced sequence to sequence model, which consists of group of Encoder-Decoder models BUT, the underlying model for Encoder-Decoder model is NOT a recurrent neural network. The base model is Self-Attention model

It is the reason I mentioned that — Self-Attention model is not typical Attention model, because attention-model is built on the base of recurrent models like LSTM/RNN/GRU whereas Self-Attention model is not.

Top-level view of Transformer model

It is a advanced sequence to sequence model

Transformer is an advanced sequence to sequence model

Transformer consists of group of Encoders and a group of Decoders

6-ft view of Transformer model

Transformer model consists of group of Encoder-Decoder models.

The research paper states that the model consists of 6-Encoders and 6-Decoders

The original research paper used 6 Encoders and 6 Decoders. We can use any number, other than 6. But, the condition is — The number of Encoders and number of Decoders should be Same.

Underlying view of Transformer model

Structure of an Encoder model

Each ENCODER network consists of 2 parts — Self-attention and feed-forward-neural-networks.

Feed forward neural network is a fully connected MLP (multi-layered perceptron), which consists of 512 hidden units. .

Before stepping on to the other part, Self-Attention — let’s understand How the input is given to the Encoder unit.

Considering entire sentence as x1. So, the words in the sentence would be divided into x11, x12, x13 and x14

Each word in the sentence is encoded into 512 dimensional vector

All these 512-dimensional encoded vectors are passed as an input to the first Encoder unit. The output of this Encoder unit will be passed as an input to the next Encoder unit.

All the inputs are passed simultaneously (at a time) to the self-attention layer in the encoder network. It is not like recurrent networks, where the inputs are provided at each timesteps.

z_1 is the output corresponding to x_11

z_2 is the output corresponding to x_12

z_3 is the output corresponding to x_13

z_4 is the output corresponding to x_14

These z_1, z_2, z_3 and z_4 are fed into the Feed-forward neural network.

How does the Self-Attention model generate z_1, z_2, z_3 and z_4 given x_11, x_12, x_13 and x_14 ? —So, now the task is to understand Self-Attention model.

Dive into Self-Attention model

Before diving deep into Self-attention model — let’s consider a simple input, rather than the input taken in the above explanation.

Input : Learning data

So, x11 = learning and x12 = data

Overview of the above Encoder network

This Encoder-1 structure states that the input x11 and x12 is encoded as 512-dimensional vectors. This 512-dimensional vectors of x11 and x12 are passed onto Self-Attention layer as input, which gives z1 and z2 as output corresponding to the input of x11 and x12. z1 and z2 is passed onto the feed-forward neural network, which outputs r1 and r2 — this is the input to Encoder-2.

What is Self-Attention?

Input : The animal did’nt cross the street because it was too tired.

In the above sentence, what does the word it refers to? Does it refers to animal or street? — We as humans know, the word it refers to animal but not street.

We need to train the self-attention model, so that the model will understand — the word it refers to animal but not street.

When the model is processing the word ‘it’ — self-attention helps in associating the word ‘it’ with ‘animal’. As the model processes each word (each position in the input sequence) — self-attention allows it to look at other positions in the input sequence for clues — which can help in encoding this the word ‘it’ in better way.

As discussed above, our task is to generate zi given xi — to do this task, the attention or focus should be on the other words in the same sentence.

For instance: In the input sentence — The animal didn’t cross the street because it was too tired — the word “it” in this sentence is xi and zi should be generated based on this xi. To generate zi — the focus or attention should be on the remaining words in the same sentence.

Task: Using Self-Attention model, generate zi given xi

Using the input data example — Input: Learning data — we need to generate zi for each of the xi in this input.

Step-1

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (the embedding of each word).

For each word — Query vector, Key vector, and Value vector needs to be created. These three vectors are created by multiplying the embedding vector by respective matrices that are trained during the training process

The entire process is illustrated below for two words (‘learning’ and ‘data’)

Step-2

The second step is to calculate score. The input sentence, we considered is ‘learning data’ — So, the self-attention for the first word in this sentence is ‘learning’ — After calculating score for this word, the score should be calculated for the second word ‘data’.

The score determines — How much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the word, on which score is to be calculated.

Step-3

The third step is to divide the scores by 8 (why 8? — it is the square root of the dimension of the key vectors, as 64 dimensional key vectors are used, we are the square root of it )— by division, the gradients would be much stable.

Step-4

Softmax function is applied on to these values. Softmax function normalizes the scores — so, the result of Softmax function is positive and adds up to 1.

The same Step-3 and Step-4 are also to be applied for the word x12 (data)

Step-5

Multiply each value vector by the Softmax score. The intuition is to keep intact the values of the words, on which the focus is required, and drown-out irrelevant words (for example: multiply irrelevant words by tiny numbers like 0.001).

Step-6

Sum up the weighted value vectors. This produces the output of the self-attention layer at the position of each word in a sentence.

The same Step-5 and Step-6 are also to be applied for the word x12 (data)

The resulting vector, which is obtained after this entire 6-step process could be sent as an input to the feed-forward neural network. This 6-step process is called self-attention calculation.

When this model is implemented or trained, this entire calculation will happen in the matrix form for faster processing.

Self-Attention calculation in matrix form

The reason to understand this self-attention representation in matrix form is to drive intuition for multi-head attention.

So, far we have seen about Self-Attention model. This can be scaled up to Multi-head Attention model.

The intention behind Multi-head Attention model is to improve the performance of the attention layer :

Multi-head Attention model helps in focusing on on different positions. In the above example, z1 contains about encoding of other words too — sometimes, because of these other words, the actual word itself might be dominated.

Multiple representational subspaces can be provided with Multi-head Attention layer. In Multi-head attention layer — there would be multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads — which gives eight sets for each encoder/decoder).

Multi-head Attention representation

In multi-headed attention, separate Q/K/V weight matrices are maintained for each head — by multiplying X with WQ/WK/WV matrices.

Let’s assume that self-attention mechanism is performed eight times with eight different weight matrices — then eight different Z matrices will be obtained.

But, feed-forward layer doesn’t expect eight matrices — it believes to only have a single matrix (a vector for each word). So, how should eight dimensional matrix be scaled down to single vector — concatenate the matrices and then multiply the concatenated matrices by an additional weights matrix Wo — then the result will be a Z matrix — that captures information from all the attention-heads — the resultant Z matrix will be sent to be feed-forward neural network.

If the self-attention is applied only once on the given input in the initial encoder network, then it is termed as Single-head Self-attention model. If multiple self-attention mechanisms are used as discussed above, then it is called as Multi-head Self-attention mechanism.

Positional Encoding

నేను డేటా సైన్స్ నేర్చుకుంటున్నాను — This is considered as an input to the model.

If recurrent networks such as RNN/LSTM/GRU is used, then the order sequence (or) position of the word can be captured by basing on the timesteps.

But, the self-attention mechanism is different, where we don’t deal with the timesteps (all the input is fed into the model at a time). So, how does the model knows డేటా comes before సైన్స్ and it comes after the word నేను.

So, the concept of positional encoded vectors is to be incorporated into the input of the model.

Positional encoded vectors helps us to find the sequence (position) of the words and also find the distance between words.

These positional encoded vectors are embedded with input word-embedded vectors to generate Embedded vectors with time signal.

In the input sentence — నేను డేటా సైన్స్ నేర్చుకుంటున్నాను— x11 = నేను, x12 = డేటా, x13 = సైన్స్, x14 = నేర్చుకుంటున్నాను — as we have 4 words in the sentence, 4 word-embedded vectors are created, with each vector being 512 dimensions. So, there would be 4 corresponding positional encoded vectors (t1, t2, t3 and t4).

The positional encoded vectors t1, t2, t3 and t4 helps in understanding that x11 word is closer to x12 and x13 is far away from x11 when compared to x12 and x14 is much far way from x11 when compared to x13 — This says the ordering of the words.

It is important to acknowledge that t1, t2, t3 and t4 are static vectors with each vector being 512 dimensions. Static vectors means — these vectors are NOT TRAINABLE (there are no trainable parameters). These vectors are designed in a way, such that the distance between t1 vector and t2 vector is less than the distance between t1 vector and t3 vector. Similarly, the distance between t2 vector and t3 vector is less than the distance between t2 vector and t4 vector. The similar pattern and theory is applicable for all the position encoded vectors.

Summarization of Encoder block architecture

This is an overview of How one encoder layer is! Each Encoder-layer consists of self-attention and feed forward neural network — in which, layer-normalization step is followed by self-attention layer and also feed-forward neural network.

Self-attention layer consists of multiple-attention heads, img src

Skip-connections in Encoder block

The dotted lines indicate the skip-connections. It is considered as residuals — it is considered as residuals because, if incase the self-attention layer doesn’t help in generating z-vectors, then it could be skipped directly to the layer-normalization step.

The output of this encoder block goes in as input to the next encoder block…this process continues till last block of Encoder. Then the output of the last block’s Encoder goes in as input to all the Decoders in the entire network.

Example Transformer model

In the actual paper, the Transformer model is designed with 6 Encoder blocks and 6 Decoder blocks.

The below mentioned is a Transformer model of 2 stacked encoders and decoders

entire transformer model structure, img src

Decoder

Input Embedded vectors & Positional Encoded vectors

Even in the Decoder network, the input word is converted into embedded vectors and converting into positional encoded vectors using positional encoding layer. These positional embeddings get fed into the first multi-head attention layer which computes the attention scores for the decoder’s input BUT, the first multi-head attention layer works differently.

Decoder’s First Multi-Headed Attention (Decoder’s self-attention)

The Decoder is an autoregressive and generates the sequence word-by-word.

For example: when computing attention scores on the word “am”, you should not have access to the word “learning” — because “learning” word is a future word, that would be generated after the word “am”. The word “am” should only have access to itself and the words before it — The same holds true for all the other words in the sentence.

Showing a sample of Decoder’s first Multi-headed Attention scaled attention scores

The method to prevent computing attention scores for future words is called as masking

The mask is added before applying the softmax function and after scaling the scores.

Look-Ahead Mask — The mask is a matrix, which is same size that of attention scores matrix, that is filled with values of 0’s and negative infinities. When mask is added to the scaled attention scores, then a new matrix of masked scores would be obtained.

The reason for the mask is because once softmax function is applied to the masked scores, the negative infinities get zeroed out, leaving zero attention scores for future tokens. In the above image, it is clear that the attention scores for the word “am” — has values for itself and all words before it, but for the word “learning”, the value is zero. It is evident from here that the model puts no focus on other words, it only focuses on itself and it’s previous words.

Applying softmax function to the masked values

This masking is the only difference in How the attention scores are calculated in the first multi-headed attention layer. This layer still has multiple self-attention heads — in the same way, as discussed above, masking is done for all the scaled matrices of Q and K, before concatenating and fed through a linear layers.

Decoder’s Second Multi-Headed Attention (Encoder-Decoder Attention)

The second layer implements a multi-head self-attention mechanism similar to the one implemented in the first sublayer of the encoder.

Example of how decoder gets input, img src

On the decoder side, this multi-head mechanism receives the queries from the previous decoder sublayer and the keys and values from the output of the encoder. This allows the decoder to attend to all the words in the input sequence.

Feed Forward Layer

Feed-forward layer is the third layer in the decoder. The output of the second multi-headed attention layer is the input to the feed-forward layer, it is similar to the one, that is implemented in the second sublayer of the encoder.

The three sublayers on the decoder, also have residual connections around them, which are succeeded by a normalization layer

Example of Decoder stack’s output conversion to word

The decoder stack outputs a vector of floats. How do we turn that into a word? — Final Linear layer does this job by using a Softmax Layer.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders into a larger vector called a logits vector.

For instance : Let’s assume that model knows 1000 unique English words (model’s output vocabulary) — which is learned from its training dataset. This makes the logit vector 1000 cells wide — each cell corresponding to the score of a unique word — — the interpretation of output of the model happens in such way.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

conversion of decoder’s output vector to word, img src

Transformer model summarized

Every word from an input sequence is transformed into a d-dimensional word-embedded vector.
These word-embedded vectors are to be converted into a positional encoded vector of the same d-dimensional vector length — which introduces the positional information into the input.
Positional encoded vectors are fed into the encoder block consisting of the two sublayers (Self-attention layer and Feed-forward layer). Encoder attends to all the words in the input sequence, irrespective of considering, whether the word precede (or) succeed the word under consideration — thus the Transformer encoder is bidirectional.
The decoder receives as input its own predicted output word at time-step t-1.
Decoder’s input is also converted into positional encoded vector — (in the similar way, how it is created on the Encoder side).
Then the Decoder’s input is fed into the three sub layers (Decoder’s self-attention, Encoder-Decoder attention and Feed-forward layer). The decoder’s position encoded vector’s input is fed into the three sublayers. Masking is applied in the first sublayer in order to stop the decoder from attending to the succeeding words. At the second sublayer, the decoder also receives the output of the encoder, which allows the decoder to attend to all the words in the input sequence.
Finally, output of the decoder passes through a fully connected layer, then by a softmax layer — to generate a prediction for the next word of the output sequence.

Reference

I considered most of the images from Jay Alammar’s blog and also few concepts but Attention is all you need research paper also helped me understand it much better.