What Is Attention?
At its simplest, attention allows a model to focus on different parts of the input when producing each element of the output. Unlike earlier sequence models that compressed all input information into a fixed-size vector, attention mechanisms let the model selectively draw information from the entire input sequence.
The key insight: not all parts of the input are equally relevant for generating each part of the output.
How Attention Works in Transformers
In transformer models like GPT or BERT, the attention mechanism operates through three key vectors:
- Query (Q): What the current token is looking for
- Key (K): What each token in the sequence offers
- Value (V): The actual content of each token
For each position, the model calculates compatibility scores between its query and all keys in the sequence. These scores determine how much each value contributes to the output at that position.
The formula looks something like this:
$$ Attention(Q, K, V) = softmax(QK^T / √d_k)V $$
Where d_k
is the dimension of the key vectors, and the scaling factor √d_k
prevents extremely small gradients.
Multi-Head Attention
What makes transformers even more powerful is multi-head attention. Instead of performing attention once, the model performs it multiple times in parallel with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces.
Each attention “head” can specialize in different linguistic patterns - some might focus on syntactic relationships, others on semantic connections.
Why Attention Matters
Attention mechanisms solved several critical limitations of previous models:
- Long-range dependencies: Models can now connect relevant information regardless of how far apart it appears in the text
- Parallelization: Unlike RNNs, transformer computations can be highly parallelized
- Interpretability: Attention weights provide insights into which parts of the input the model considers relevant
Understanding attention helps us grasp how modern LLMs manage to generate coherent, contextually appropriate text across long contexts.
In future posts, we’ll explore how these attention patterns manifest in different types of generation tasks and how they contribute to the capabilities and limitations of current AI systems.