If Attention Were a Study Group

When it comes to understanding the Transformer model, nothing beats 3Blue1Brown’s YouTube series on Neural Networks. As I was reviewing the model, I found myself deeply contemplating the meanings of queries (QQ), keys (KK), and values (VV) in the attention layer, which I thought I had already grasped. This article serves as a supplementary explanation for the video Attention in transformers, step-by-step | Deep Learning Chapter 6.

A Note on Notation

There is a difference in notation between the Attention Is All You Need paper and the 3Blue1Brown video. This article will follow the notation of the paper.

Notation in “Attention Is All You Need”

  • Vectors are represented as row vectors (1×d1 \times d).
  • The result of the attention layer is as follows:
Attention(Q,K,V)=softmax(QKTdk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V
  • Softmax is applied row-wise.
softmax([A1An])=[softmax(A1)softmax(An)]\mathrm{softmax}( \begin{bmatrix} A_1 \\ \cdots \\ A_n \end{bmatrix} ) = \begin{bmatrix} \mathrm{softmax}(A_1) \\ \cdots \\ \mathrm{softmax}(A_n) \end{bmatrix}

Notation in the 3Blue1Brown Video

  • Vectors are represented as column vectors (d×1d \times 1).
  • The result of the attention layer is as follows:
Attention(Q,K,V)=softmax(KTQdk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{K^T Q}{\sqrt{d_k}}) V
  • However, the correct formula should be:
Attention(Q,K,V)=Vsoftmax(KTQdk)\mathrm{Attention}(Q, K, V) = V \mathrm{softmax}(\frac{K^T Q}{\sqrt{d_k}})
  • Softmax is applied column-wise.
softmax([A1An])=[softmax(A1)softmax(An)]\mathrm{softmax}( \begin{bmatrix} A_1 & \cdots & A_n \end{bmatrix} ) = \begin{bmatrix} \mathrm{softmax}(A_1) & \cdots & \mathrm{softmax}(A_n) \end{bmatrix}

The Meaning of the Attention Layer

Let’s imagine that the nn words in a context window are in a study group. Their goal is to predict the next word through discussion. For simplicity, let’s consider one word as one token.

Our example sentence is “I have a lot”. The combined embedding from token and position embeddings is as follows:

E=[EIEhaveEaElot]E = \begin{bmatrix} E_{\mathrm{I}} \\ E_{\mathrm{have}} \\ E_{\mathrm{a}} \\ E_{\mathrm{lot}} \end{bmatrix}

Queries (QQ): The Question and its Nature

The queries (QQ) represent the topic and the questions of this study group.

Let’s consider the topic: “What word generally follows a specific word?” The questions related to this topic can be represented by a matrix WQW_Q​. When each word asks its version of the topic’s question, i.e., when we apply WQW_Q to EE, we get:

Q=EWQ=[QIQhaveQaQlot]Q = E W_Q = \begin{bmatrix} Q_{\mathrm{I}} \\ Q_{\mathrm{have}} \\ Q_{\mathrm{a}} \\ Q_{\mathrm{lot}} \end{bmatrix}

Each resulting query becomes a different question under the same theme.

  • QIQ_{\mathrm{I}}: What word generally follows “I”?
  • QhaveQ_{\mathrm{have}}: What word generally follows “have”?
  • QaQ_{\mathrm{a}}: What word generally follows “a”?
  • QlotQ_{\mathrm{lot}}: What word generally follows “lot”?

Keys (KK): The Nature of the Answer

The keys (KK) represent the nature of the answer or information that each word can provide. Similar to the queries (QQ), applying WKW_K ​to EE gives us:

K=EWK=[KIKhaveKaKlot]K = E W_K = \begin{bmatrix} K_{\mathrm{I}} \\ K_{\mathrm{have}} \\ K_{\mathrm{a}} \\ K_{\mathrm{lot}} \end{bmatrix}

This represents the nature of the answer each word can offer.

For the question “What word generally follows a specific word?”, we can think that the word itself holds the most relevant answer. In response to “What word generally follows ‘I’?” (QIQ_{\mathrm{I}}), the answer from “I” (perhaps suggesting “am”) would be the most relevant. The word “a” might give a less relevant response to this question, like, “Hmm… I’m not sure… Is it me? ‘a’?”

We can say that the nature of this answer, KIK_{\mathrm{I}}, is “close” to the nature of the question, QIQ_{\mathrm{I}}. In other words, the dot product of a well-matched question and answer is large, while the dot product of a mismatched pair is small. The dot product of the nature of the question QiQ_i and the nature of the answer KjK_j, QiKjQ_i \cdot K_j, is called the attention score.

Softmax: From Absolute Attention Score to Relative Weight

For example, in “I have a lot”, the word “lot” might have an overwhelmingly high attention score for the query QlotQ_{\mathrm{lot}}. Let’s say “lot” has a score of 100, while the other words have a score of 1. In this case, we might conclude that the next word should be a period, following the answer from “lot”. The four words, through their discussion, would conclude that the sentence should be completed as “I have a lot.”

But what if this sentence was preceded by a question? If the context is “Cash or credit card? I have a lot”, then the words from the preceding question would also have something to say. They might strongly argue, with an attention score of 1,000, that “Since we just asked ‘Cash or credit card?’, the sentence should end with ‘of cash’ or ‘of credit card’,” suggesting the next word must be “of”.

In this scenario, the voice of “lot” becomes relatively weaker. The group has to weigh the two potential answers—a period or “of”—and might conclude that the sentence is becoming “Cash or credit card? I have a lot of”.

Values (VV): The Actual Answer

The values (VV) represent the actual answer or information that each word provides. Again, by applying WVW_V ​to EE, we can represent the actual answer from each word.

V=EWV=[VIVhaveVaVlot]V = E W_V = \begin{bmatrix} V_{\mathrm{I}} \\ V_{\mathrm{have}} \\ V_{\mathrm{a}} \\ V_{\mathrm{lot}} \end{bmatrix}

By using softmax to normalize the attention scores for each query into weights between 0 and 1 that sum to 1, we can filter the words’ answers by these weights. That is, we create a weighted sum to produce the final result.

Attention(Q,K,V)=softmax()V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\cdot) V

The part that used to confuse me was that it seemed like queries (QQ) and keys (KK) could be interchangeable. However, when you consider the direction in which softmax is applied and the position where the values (VV) are multiplied, you can see how their roles differ. During the training process, they are each trained to fit their respective roles.

Conclusion

Let’s look at the attention layer formula again.

Attention(Q,K,V)=softmax(QKTdk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V

Although queries (QQ) and keys (KK) may appear symmetrical, now that we understand their meanings, we know in which direction to apply softmax and where to multiply the values (VV).

© 2018 - 2025 Junhee Cho