Demistifying the Confusing Dimensions in Attention Mechanism.
Published:
Many people understand the main ideas of how Transformers and attention work. But when it comes to the small details inside, like the size of the data and how each part changes it, they often get confused. This blog post will look at how the data really moves inside and what things actually happen in each part.
Step 1: Input Representation
So lets Consider we have a sentence and we are going to process that sentence. We divide that sentence in 4 tokens. And then each token is mapped to the 512 dimensional embedding. Meaning each token is a list of 512 numbers and we have 4 lists for 4 tokens. And we know in deep learning we need to batch our data in order to process it, So we batch it now our input has following shape:
\[Input: X \in \mathbb{R}^{(\text{batch_size, seq_length, } d_{\text{model}})}\]where
batch_size = 1 (one sentence)
seq_length = 4 (4 tokens in the sentence)
d_model = 512 (each token is a 512-dimensional vector)
So, the input tensor shape is: \
\[X \in \mathbb{R}^{(\text{1, 4, 512})}\]Step 2: Linear Transformation to Q, K, V
For self attention, we compute queries(Q), keys(K), and values(v) using three separate linear layers. These are just matrix multiplications:
\(Q = X W_Q\) , \(K = X W_K\), \(V = X W_V\)
where \(W_Q , W_K, W_V\) are the weight matrices of shape (512 , 512)
The new shape is:
\[Q, K, V \in \mathbb{R}^{(\text{1, 4, 512})}\]Step 3: Split into Multiple Heads
Let’s consider we have 8 Attention heads(You can have as many as you want), We split 512 dimentional vector into 8 different pieces. Each head gets 64-Dimentional slice.
\[d_{\text{head}} = \frac{d_{\text{model}}}{n_{\text{heads}}} = \frac{512}{8} = 64\]We reshape our Q, K, V into multiple heads:
\[Q, K, V \in \mathbb{R}^{(\text{1, 8, 4, 64})}\](1, 8, 4, 64) means:
1- Batch Size
8- Number of Heads
4- Sequence length(4 tokens in sentence)
64- Features per head
Step 4: Compute Attention Scores (QK^T)
Now we will compute atttention scores-
\[\text{Scores} = \frac{Q K^T}{\sqrt{d_{\text{head}}}}\]Each attention head does:
\[Q K^T\]\(Q\) has a shape (1, 8, 4, 64)
\(K^T\) has a shape of (1, 8, 64, 4)
This results in a square attention matrix of size (4,4) for each head where:
Row i represents the attention score of token i with all other tokens.
Column j represents how much attention token i payes to token j.
Step 5: Apply Softmax
To normalize the scores into probability distribution, we apply Softmax:
\[Attention = Softmax(\frac{Q K^T}{\sqrt{64}})\]Softmax function ensures that each row sums to 1, so each token assigns some probability to all other tokens.
Step 6: Multiply the Attention Scores with Values
Now, we multiply the attention weights to V:
\[\text{Output} = \text{Attention} \ \times \ \text{V}\]Attention has shape (1, 8, 4, 4)
V has a shape (1, 8, 4, 64)
This produces updated token representation for each head.
Step 7: Concatenate Heads & Apply Final Linear Layer
Then we concatinate those 8 heads into one.
\[\text{Concat}(Head_1, Head_2,.....,Head_8) \in \mathbb{R}^{(1, 4, 512)}\]Finally a linear layer project it back to 512 dimentions:
\[\text{Final Output} = \text{Concat} \times \text{W_0}\]Now it’s Ready to go to next Adventure…..
We will discuss the next Part in another Blog.
I am creating the pytorch code for it. You can checkout this colab book in future I will update the code.