Transformer Architecture: How Attention Powers LLMs (2026)

M
Mamta Chauhan
Content Creator and AI Enthusiast

Learning Objectives

  • Understand why transformers replaced RNNs for language tasks
  • Explain self-attention and why it matters
  • Walk through the full transformer architecture layer by layer
  • Distinguish between encoder-only, decoder-only, and encoder-decoder models
  • Know how pre-training and fine-tuning work at a high level

Why Transformers?

Before transformers (2017), sequence modeling used RNNs and LSTMs. These had two critical limitations:

  1. Sequential processing — each token depends on the previous, making parallelization impossible during training
  2. Vanishing gradients — long-range dependencies were hard to learn; the model "forgot" tokens from early in the sequence

The transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), solved both problems:

  • Parallel processing — all tokens are processed simultaneously
  • Direct attention — any token can attend to any other token in a single step, regardless of distance

The Big Picture

A transformer processes a sequence of tokens. Each token is first converted to a vector (embedding), then processed through multiple transformer blocks, and finally decoded into a prediction.

plaintext
Input TokensEmbeddings + Positional Encoding[Transformer Block] × N layersOutput (next token logits, classification, etc.)

Each Transformer Block contains:

  1. Multi-Head Self-Attention
  2. Add & Norm (residual connection + layer normalization)
  3. Feed-Forward Network
  4. Add & Norm

Token Embeddings

Tokens (words or subwords) are mapped to dense vectors via an embedding matrix.

Python
import torch
import torch.nn as nn

vocab_size = 50000
d_model    = 512  # embedding dimension

embedding = nn.Embedding(vocab_size, d_model)
tokens    = torch.tensor([[1, 542, 23, 9, 1002]])  # batch of 1 sequence
x         = embedding(tokens)  # shape: (1, 5, 512)

The embedding dimension d_model is a key hyperparameter. GPT-2 uses 768; GPT-3 uses 12288.


Positional Encoding

Self-attention has no inherent sense of order. Positional encoding adds position information to each token embedding.

Sinusoidal positional encoding (original paper):

Python
import torch
import math

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1).float()
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)  # even dims
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dims
    return pe  # shape: (seq_len, d_model)

Modern LLMs often use Rotary Positional Embedding (RoPE) or ALiBi instead, which better handle longer sequences.


Self-Attention: The Core Mechanism

Self-attention lets each token look at all other tokens and decide which ones to focus on.

Queries, Keys, and Values

Each token embedding is projected into three vectors via learned weight matrices:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I pass on?"
Python
d_k = 64  # dimension of queries and keys

# Learned projections
W_q = nn.Linear(d_model, d_k, bias=False)
W_k = nn.Linear(d_model, d_k, bias=False)
W_v = nn.Linear(d_model, d_k, bias=False)

Q = W_q(x)  # shape: (batch, seq_len, d_k)
K = W_k(x)
V = W_v(x)

Scaled Dot-Product Attention

plaintext
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V
Python
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

The intuition: QK^T computes a similarity score between every pair of tokens. Softmax converts scores to probabilities (attention weights). The output is a weighted average of the value vectors — tokens attend more to semantically related tokens.

The 1/sqrt(d_k) scaling prevents softmax from saturating when d_k is large (which would push gradients toward zero).


Multi-Head Attention

Instead of one attention head, transformers run H attention heads in parallel, each with its own Q/K/V projections. Each head can learn to attend to different types of relationships.

Python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch, seq_len, d_model = x.size()
        h = self.num_heads

        # Project and split into heads
        Q = self.W_q(x).view(batch, seq_len, h, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch, seq_len, h, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch, seq_len, h, self.d_k).transpose(1, 2)

        # Attention per head
        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads and project
        attn_out = attn_out.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
        return self.W_o(attn_out)

GPT-3 uses 96 attention heads. GPT-2 (small) uses 12 heads.


Feed-Forward Network

After attention, each token passes through a two-layer fully connected network independently:

plaintext
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
Python
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),            # modern models use GELU over ReLU
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)

The FFN dimension d_ff is typically 4× the model dimension.


Layer Normalization and Residual Connections

Residual connections (skip connections) allow gradients to flow directly through the network:

plaintext
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))

Pre-norm (shown above, used in GPT models) is more stable during training than the original post-norm formulation.

Python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn  = MultiHeadAttention(d_model, num_heads)
        self.ffn   = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop  = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        x = x + self.drop(self.attn(self.norm1(x), mask))
        x = x + self.drop(self.ffn(self.norm2(x)))
        return x

Model Variants

Encoder-Only (BERT family)

  • Bidirectional: each token sees all other tokens
  • Used for: text classification, NER, question answering
  • Pre-trained with masked language modeling (MLM)
  • Examples: BERT, RoBERTa, DeBERTa

Decoder-Only (GPT family)

  • Causal/autoregressive: each token sees only previous tokens (causal mask)
  • Used for: text generation, chat, completion
  • Pre-trained with next-token prediction (CLM)
  • Examples: GPT-2, GPT-4, LLaMA, Mistral, Gemma

Encoder-Decoder (T5 / BART family)

  • Encoder processes input, decoder generates output
  • Used for: translation, summarization, structured generation
  • Examples: T5, BART, mT5

Pre-training vs Fine-tuning

Pre-training: Train on massive text corpus (terabytes) to predict next tokens. This teaches general language understanding. Very expensive (millions of dollars in compute).

Fine-tuning: Take a pre-trained model and continue training on a smaller task-specific dataset. Adapts the model to a specific task or domain. Affordable.

PEFT (Parameter-Efficient Fine-Tuning): Instead of updating all parameters, methods like LoRA add small trainable adapter matrices. Reduces memory and compute by 10–100×.


Troubleshooting

Out-of-memory errors during training

  • Reduce batch size and increase gradient accumulation steps
  • Use gradient checkpointing: model.gradient_checkpointing_enable()
  • Use mixed precision: torch.cuda.amp.autocast()

Model generates repetitive text

  • Adjust temperature (higher = more diverse)
  • Use repetition penalty
  • Try top-p (nucleus) sampling instead of greedy decoding

Attention weights don't look meaningful

  • This is normal for lower layers. Deeper layers develop more interpretable attention patterns.

Key Takeaways

  • Transformers replaced RNNs by processing all tokens simultaneously via self-attention, solving the vanishing gradient and sequential bottleneck problems
  • Self-attention computes relevance scores between every pair of tokens using Query, Key, and Value projections — this is how the model understands context
  • Multi-head attention runs multiple attention heads in parallel, each learning to focus on different types of relationships (syntax, semantics, coreference)
  • Feed-forward layers store "knowledge" from training — they act independently on each token after attention has gathered contextual information
  • Residual connections and layer normalization are critical for training stability, especially in deep networks with 32–96 layers
  • Encoder-only models (BERT) use bidirectional attention for classification; decoder-only models (GPT, Llama) use causal masking for text generation
  • Modern LLMs use RoPE or ALiBi positional encodings instead of sinusoidal, which better generalize to longer sequences than seen during training
  • LoRA (Low-Rank Adaptation) reduces fine-tuning cost by 10–100× by adding small trainable adapter matrices rather than updating all parameters

FAQ

How many parameters does a transformer have? It depends on depth and width. GPT-2: 117M to 1.5B. GPT-3: 175B. Llama 3: 8B to 70B. Most parameters live in the FFN and attention projection matrices. The FFN alone accounts for roughly two-thirds of total parameters.

What is the context window? The maximum sequence length the model can process at once. It is determined by positional encoding design and the attention mask. GPT-4 supports 128K tokens. Llama 3.1 supports 128K tokens. Mistral 7B v0.3 supports 32K tokens.

What is a token? A token is typically a word piece (subword unit). "unbelievable" might tokenize to ["un", "believ", "able"]. Most LLMs use BPE (Byte Pair Encoding) tokenization. One token averages about 0.75 English words.

Why does attention scale by sqrt(d_k)? Without scaling, dot products between large query and key vectors become very large, pushing softmax into saturation regions where gradients approach zero. Dividing by sqrt(d_k) keeps the scale stable regardless of the key dimension size.

What is the difference between pre-norm and post-norm? Pre-norm applies layer normalization before the attention and FFN sub-layers; post-norm applies it after. Pre-norm (used in GPT models) is more stable during training of very deep networks and has largely replaced post-norm in modern architectures.

Why do some models use Grouped Query Attention (GQA)? GQA reduces the number of key and value heads while keeping the same number of query heads. This cuts KV cache size (and thus VRAM usage) significantly — Llama 3 uses GQA to make 70B models more memory-efficient without major quality loss.

What is the difference between BERT and GPT architecturally? BERT is encoder-only and uses bidirectional attention — every token can attend to every other token. GPT is decoder-only and uses causal (left-to-right) masking — each token only attends to previous tokens. BERT is better for understanding tasks; GPT is better for generation.


What to Learn Next

MC
Mamta Chauhan
Content Creator and AI Enthusiast

Mamta Chauhan is an AI enthusiast and content creator behind ailearnings.in. She writes practical guides on LLMs, RAG, and AI engineering to help developers navigate the fast-moving world of artificial intelligence. Passionate about bridging the gap between cutting-edge research and real-world application.

← Back to all articles