Transformer Architecture: How Attention Powers LLMs (2026)
Learning Objectives
- Understand why transformers replaced RNNs for language tasks
- Explain self-attention and why it matters
- Walk through the full transformer architecture layer by layer
- Distinguish between encoder-only, decoder-only, and encoder-decoder models
- Know how pre-training and fine-tuning work at a high level
Why Transformers?
Before transformers (2017), sequence modeling used RNNs and LSTMs. These had two critical limitations:
- Sequential processing — each token depends on the previous, making parallelization impossible during training
- Vanishing gradients — long-range dependencies were hard to learn; the model "forgot" tokens from early in the sequence
The transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), solved both problems:
- Parallel processing — all tokens are processed simultaneously
- Direct attention — any token can attend to any other token in a single step, regardless of distance
The Big Picture
A transformer processes a sequence of tokens. Each token is first converted to a vector (embedding), then processed through multiple transformer blocks, and finally decoded into a prediction.
Input Tokens → Embeddings + Positional Encoding
↓
[Transformer Block] × N layers
↓
Output (next token logits, classification, etc.)Each Transformer Block contains:
- Multi-Head Self-Attention
- Add & Norm (residual connection + layer normalization)
- Feed-Forward Network
- Add & Norm
Token Embeddings
Tokens (words or subwords) are mapped to dense vectors via an embedding matrix.
import torch
import torch.nn as nn
vocab_size = 50000
d_model = 512 # embedding dimension
embedding = nn.Embedding(vocab_size, d_model)
tokens = torch.tensor([[1, 542, 23, 9, 1002]]) # batch of 1 sequence
x = embedding(tokens) # shape: (1, 5, 512)The embedding dimension d_model is a key hyperparameter. GPT-2 uses 768; GPT-3 uses 12288.
Positional Encoding
Self-attention has no inherent sense of order. Positional encoding adds position information to each token embedding.
Sinusoidal positional encoding (original paper):
import torch
import math
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # even dims
pe[:, 1::2] = torch.cos(position * div_term) # odd dims
return pe # shape: (seq_len, d_model)Modern LLMs often use Rotary Positional Embedding (RoPE) or ALiBi instead, which better handle longer sequences.
Self-Attention: The Core Mechanism
Self-attention lets each token look at all other tokens and decide which ones to focus on.
Queries, Keys, and Values
Each token embedding is projected into three vectors via learned weight matrices:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I pass on?"
d_k = 64 # dimension of queries and keys
# Learned projections
W_q = nn.Linear(d_model, d_k, bias=False)
W_k = nn.Linear(d_model, d_k, bias=False)
W_v = nn.Linear(d_model, d_k, bias=False)
Q = W_q(x) # shape: (batch, seq_len, d_k)
K = W_k(x)
V = W_v(x)Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × Vimport torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)
return output, attn_weightsThe intuition: QK^T computes a similarity score between every pair of tokens. Softmax converts scores to probabilities (attention weights). The output is a weighted average of the value vectors — tokens attend more to semantically related tokens.
The 1/sqrt(d_k) scaling prevents softmax from saturating when d_k is large (which would push gradients toward zero).
Multi-Head Attention
Instead of one attention head, transformers run H attention heads in parallel, each with its own Q/K/V projections. Each head can learn to attend to different types of relationships.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch, seq_len, d_model = x.size()
h = self.num_heads
# Project and split into heads
Q = self.W_q(x).view(batch, seq_len, h, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch, seq_len, h, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch, seq_len, h, self.d_k).transpose(1, 2)
# Attention per head
attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads and project
attn_out = attn_out.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
return self.W_o(attn_out)GPT-3 uses 96 attention heads. GPT-2 (small) uses 12 heads.
Feed-Forward Network
After attention, each token passes through a two-layer fully connected network independently:
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂class FeedForward(nn.Module):
def __init__(self, d_model, d_ff=2048):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(), # modern models use GELU over ReLU
nn.Linear(d_ff, d_model),
)
def forward(self, x):
return self.net(x)The FFN dimension d_ff is typically 4× the model dimension.
Layer Normalization and Residual Connections
Residual connections (skip connections) allow gradients to flow directly through the network:
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))Pre-norm (shown above, used in GPT models) is more stable during training than the original post-norm formulation.
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x, mask=None):
x = x + self.drop(self.attn(self.norm1(x), mask))
x = x + self.drop(self.ffn(self.norm2(x)))
return xModel Variants
Encoder-Only (BERT family)
- Bidirectional: each token sees all other tokens
- Used for: text classification, NER, question answering
- Pre-trained with masked language modeling (MLM)
- Examples: BERT, RoBERTa, DeBERTa
Decoder-Only (GPT family)
- Causal/autoregressive: each token sees only previous tokens (causal mask)
- Used for: text generation, chat, completion
- Pre-trained with next-token prediction (CLM)
- Examples: GPT-2, GPT-4, LLaMA, Mistral, Gemma
Encoder-Decoder (T5 / BART family)
- Encoder processes input, decoder generates output
- Used for: translation, summarization, structured generation
- Examples: T5, BART, mT5
Pre-training vs Fine-tuning
Pre-training: Train on massive text corpus (terabytes) to predict next tokens. This teaches general language understanding. Very expensive (millions of dollars in compute).
Fine-tuning: Take a pre-trained model and continue training on a smaller task-specific dataset. Adapts the model to a specific task or domain. Affordable.
PEFT (Parameter-Efficient Fine-Tuning): Instead of updating all parameters, methods like LoRA add small trainable adapter matrices. Reduces memory and compute by 10–100×.
Troubleshooting
Out-of-memory errors during training
- Reduce batch size and increase gradient accumulation steps
- Use gradient checkpointing:
model.gradient_checkpointing_enable() - Use mixed precision:
torch.cuda.amp.autocast()
Model generates repetitive text
- Adjust temperature (higher = more diverse)
- Use repetition penalty
- Try top-p (nucleus) sampling instead of greedy decoding
Attention weights don't look meaningful
- This is normal for lower layers. Deeper layers develop more interpretable attention patterns.
Key Takeaways
- Transformers replaced RNNs by processing all tokens simultaneously via self-attention, solving the vanishing gradient and sequential bottleneck problems
- Self-attention computes relevance scores between every pair of tokens using Query, Key, and Value projections — this is how the model understands context
- Multi-head attention runs multiple attention heads in parallel, each learning to focus on different types of relationships (syntax, semantics, coreference)
- Feed-forward layers store "knowledge" from training — they act independently on each token after attention has gathered contextual information
- Residual connections and layer normalization are critical for training stability, especially in deep networks with 32–96 layers
- Encoder-only models (BERT) use bidirectional attention for classification; decoder-only models (GPT, Llama) use causal masking for text generation
- Modern LLMs use RoPE or ALiBi positional encodings instead of sinusoidal, which better generalize to longer sequences than seen during training
- LoRA (Low-Rank Adaptation) reduces fine-tuning cost by 10–100× by adding small trainable adapter matrices rather than updating all parameters
FAQ
How many parameters does a transformer have? It depends on depth and width. GPT-2: 117M to 1.5B. GPT-3: 175B. Llama 3: 8B to 70B. Most parameters live in the FFN and attention projection matrices. The FFN alone accounts for roughly two-thirds of total parameters.
What is the context window? The maximum sequence length the model can process at once. It is determined by positional encoding design and the attention mask. GPT-4 supports 128K tokens. Llama 3.1 supports 128K tokens. Mistral 7B v0.3 supports 32K tokens.
What is a token? A token is typically a word piece (subword unit). "unbelievable" might tokenize to ["un", "believ", "able"]. Most LLMs use BPE (Byte Pair Encoding) tokenization. One token averages about 0.75 English words.
Why does attention scale by sqrt(d_k)? Without scaling, dot products between large query and key vectors become very large, pushing softmax into saturation regions where gradients approach zero. Dividing by sqrt(d_k) keeps the scale stable regardless of the key dimension size.
What is the difference between pre-norm and post-norm? Pre-norm applies layer normalization before the attention and FFN sub-layers; post-norm applies it after. Pre-norm (used in GPT models) is more stable during training of very deep networks and has largely replaced post-norm in modern architectures.
Why do some models use Grouped Query Attention (GQA)? GQA reduces the number of key and value heads while keeping the same number of query heads. This cuts KV cache size (and thus VRAM usage) significantly — Llama 3 uses GQA to make 70B models more memory-efficient without major quality loss.
What is the difference between BERT and GPT architecturally? BERT is encoder-only and uses bidirectional attention — every token can attend to every other token. GPT is decoder-only and uses causal (left-to-right) masking — each token only attends to previous tokens. BERT is better for understanding tasks; GPT is better for generation.