Decoder-Only Transformers
Decoder-only transformers predict the next token given all previous tokens — causal, left-to-right attention. This simple objective, scaled to trillions of tokens, produces models that can generate text, code, reason through multi-step problems, follow complex instructions, and learn new tasks from a few examples in context. This is the architecture behind every modern LLM.
When to Use Decoder-Only
Text Generation
Writing, summarization, translation, creative content. The autoregressive loop generates one token at a time, each conditioned on all previous tokens. Temperature and top-p sampling control creativity vs. determinism.
Code Generation & Completion
Copilot, Cursor, code assistants. Code is sequential and highly structured — perfect for autoregressive generation. Models trained on code (CodeLlama, StarCoder, DeepSeek-Coder) generate functions, debug errors, and write tests.
Conversational AI & Instruction Following
Chatbots, assistants, customer support. After SFT (supervised fine-tuning) and RLHF/DPO alignment, decoder models follow complex multi-turn instructions, maintain context, and refuse harmful requests.
Reasoning & Chain-of-Thought
Mathematical proofs, logical deduction, planning. Decoder models "think out loud" by generating intermediate reasoning steps before the final answer. This emergent capability appears at scale (>60B parameters) and dramatically improves accuracy on complex tasks.
Few-Shot & Zero-Shot Learning
Solve new tasks from examples in the prompt (few-shot) or from instructions alone (zero-shot). No fine-tuning needed. GPT-3 demonstrated this: include 3–5 examples in the prompt and the model generalizes. This is in-context learning — a capability unique to large decoder models.
Structured Output & Tool Use
JSON generation, function calling, API orchestration. Modern LLMs generate structured outputs reliably with constrained decoding or fine-tuning. This enables agents that call tools, query databases, and execute code.
What to Look for in Your Data
Pretraining Data: Scale Is (Almost) Everything
- GPT-3 (2020): 300B tokens from Common Crawl, WebText2, Books, Wikipedia
- LLaMA 1 (2023): 1.4T tokens — 4.7× more, proving smaller models trained longer beat larger models trained less
- LLaMA 3 (2024): 15T tokens — 10× more again, with aggressive data quality filtering
- The pattern: Data quantity matters, but data quality matters more. LLaMA 3 spent more effort on deduplication, filtering, and mixing ratios than on architecture changes.
Fine-Tuning Data: Instructions & Preferences
- SFT data: (instruction, response) pairs. FLAN (1.8M tasks), Alpaca (52K GPT-generated), ShareGPT (user conversations), OpenAssistant
- Preference data: (prompt, chosen_response, rejected_response) triples for RLHF/DPO alignment. Critical for safety and helpfulness.
- Quality over quantity: LIMA (2023) showed that just 1,000 carefully curated SFT examples can match models trained on 50K+ examples. Data curation matters enormously.
In-Context Learning vs. Fine-Tuning
- Few-shot prompting: No training needed. Include 3–10 examples in the prompt. Works for classification, extraction, formatting. Limited by context window.
- Fine-tuning: When you have >1K examples and need consistent, cost-efficient performance. LoRA/QLoRA make fine-tuning accessible on consumer GPUs.
- Rule of thumb: Start with prompting. If accuracy is <90% or per-query cost is too high, fine-tune. A fine-tuned 7B model often beats a prompted 70B model on domain-specific tasks.
Task Framing: Everything as Generation
Decoder models cast every task as "generate the next tokens":
- Classification: "Classify this review as positive or negative: {text}\nLabel:" → generate "positive"
- NER: "Extract entities from: {text}\nEntities:" → generate JSON
- Translation: "Translate to French: {text}\n" → generate French text
- This works surprisingly well but is less efficient than a fine-tuned encoder for pure classification (a 110M BERT is faster and cheaper than a 7B LLM for sentiment analysis).
Context Length & Token Efficiency
- Original GPT-2/3: 1K–2K token context with absolute positional embeddings
- LLaMA 2: 4K context with RoPE (extends to 32K+ with interpolation)
- LLaMA 3: 128K context natively
- Cost scales with context: Self-attention is O(n²). A 128K context query costs 1,000× more compute than a 4K query. Use only what you need.
Architecture
Key Architectural Innovations
Modern decoder-only models (LLaMA, Mistral, Phi) share a common set of architectural upgrades over the original GPT. These are the building blocks worth understanding.
1 KV Cache
During autoregressive generation, the model generates one token at a time. Without KV cache, generating token N requires recomputing attention over all N previous tokens — O(n²) total work. With KV cache, we store the Key and Value projections from previous tokens and only compute the new token's Q, K, V. Each new token is O(n) instead of O(n²).
2 Grouped-Query Attention (GQA)
Standard multi-head attention (MHA) has separate K, V projections per head. This means KV cache grows linearly with the number of heads. GQA shares K, V across groups of query heads, reducing KV cache size by 4–8× with minimal quality loss.
3 RoPE (Rotary Position Embeddings)
Unlike absolute positional embeddings (GPT-2) or learned positions, RoPE encodes position by rotating the query and key vectors. The dot product between rotated Q and K depends only on their relative distance, not absolute position. This enables length generalization — train on 4K context, extend to 128K with interpolation (NTK-aware RoPE, YaRN).
- How: Split each head dimension into pairs. Rotate each pair by an angle proportional to its position × a frequency. Different dimension pairs use different frequencies.
- Why it works: The attention score between positions i and j depends only on (i - j), making the model naturally translation-invariant.
4 SwiGLU Activation
Replaces the standard ReLU or GELU FFN with a gated linear unit using Swish activation:
FFN(x) = (Swish(xW₁) ⊙ xW₃)W₂
The gate (xW₃) controls information flow element-wise. This adds one more weight matrix but consistently outperforms ReLU/GELU by 1–2% on benchmarks. The FFN hidden dimension is typically 8/3 × model dimension (e.g., 11,008 for LLaMA 7B's hidden dim of 4,096).
5 Mixture of Experts (MoE)
Replace the single FFN with N expert FFNs (typically 8). A router network selects the top-K experts (usually 2) for each token. Only the selected experts compute — total compute per token is 2/8 = 25% of the dense equivalent, but the model has 8× the parameters.
Training Pipeline
Modern LLMs follow a three-stage training process. Each stage uses different data, objectives, and compute budgets.
Stage 1: Pretraining
Pure next-token prediction on trillions of tokens from the web. The model learns language, facts, reasoning patterns, code syntax, and world knowledge. This is the most expensive stage — LLaMA 3 70B required ~6M GPU-hours on H100s. The key decisions are data mixture ratios (web, code, math, multilingual) and data quality filtering.
Stage 2: Supervised Fine-Tuning (SFT)
Train on (instruction, response) pairs to teach the model to follow instructions, answer questions, and format outputs. This transforms the base model from a text completer into an assistant. Key datasets: FLAN (1.8M multi-task), OpenAssistant, ShareGPT. LIMA showed 1K high-quality examples can suffice.
Stage 3: RLHF / DPO
Align the model with human preferences. Two dominant approaches:
- RLHF: Train a reward model on human preference data. Use PPO to optimize the LLM against the reward model. Complex but proven (ChatGPT, Claude).
- DPO: Direct Preference Optimization skips the reward model. Directly optimize the policy from preference pairs using a classification-like loss. Simpler, no RL infrastructure needed. Used by LLaMA 3, Zephyr.
Key Models
Model Comparison
| Model | Params | Training Data | Context | Key Innovation | Best For |
|---|---|---|---|---|---|
| GPT-3 | 175B | 300B tokens | 2K | In-context learning | Historical reference |
| LLaMA 2 | 7–70B | 2T tokens | 4K | GQA, open weights | Open-source baseline |
| LLaMA 3 | 8–405B | 15T tokens | 128K | Data scale + quality | Best open model |
| Mistral 7B | 7B | Undisclosed | 32K | Sliding window attn | Efficient inference |
| Mixtral 8×7B | 47B (13B active) | Undisclosed | 32K | Sparse MoE | Quality/cost ratio |
| Phi-3 Mini | 3.8B | 3.3T tokens | 128K | Data curation | On-device / edge |
When NOT to Use Decoder-Only
Pure Classification with Small Labeled Data
If you have 5K labeled examples and need to classify text into 10 categories, a fine-tuned DeBERTa (400M params) will match or beat a prompted LLaMA 70B at 1/175th the compute. Encoders are purpose-built for this. Don't use a cannon to kill a mosquito.
Retrieval & Embedding Tasks
Dense retrieval, semantic search, sentence similarity. Encoder-only models (Sentence-BERT, E5, GTE) produce better embeddings than decoder models because bidirectional attention captures meaning from both directions. A 110M encoder for embeddings + a vector database is the standard retrieval stack.
Latency-Critical Understanding-Only Tasks
If you only need to understand text (not generate it) and latency matters, encoder-only is faster. A single forward pass through BERT takes ~5ms. Generating 100 tokens from an LLM takes 500ms–2s. For real-time classification pipelines processing millions of items, this difference is decisive.
Decision Framework
Choose Decoder-Only If...
- Your task requires generating text, code, or structured output
- You need conversational / multi-turn interaction
- Complex reasoning or chain-of-thought is required
- You want few-shot / zero-shot learning without fine-tuning
- The task requires world knowledge or commonsense reasoning
- You need flexibility across many task types (one model for everything)
- You're building an agent that uses tools, calls APIs, or writes code
- Your budget allows for the higher compute cost
Use Cases in Practice
Code Completion & Generation
Developer writes a function signature and docstring. The model autoregressively generates the implementation. CodeLlama, StarCoder, and DeepSeek-Coder are trained on hundreds of billions of code tokens. Fill-in-the-middle (FIM) training allows completion at any cursor position, not just the end.
Customer Support Chatbot
Fine-tune a 7B–13B model on your company's support conversations. RAG retrieves relevant docs from a knowledge base. The model generates human-like responses grounded in your documentation. LoRA fine-tuning on 10K conversations takes hours on a single A100. Serve at ~50 tokens/sec per user.
Document Summarization
Input: long document (legal filing, research paper, earnings call transcript). Output: concise summary. With 128K context (LLaMA 3), most documents fit in a single prompt. For longer documents, hierarchical summarization: summarize chunks, then summarize summaries.
Structured Data Extraction via Prompting
Extract structured information from unstructured text without training a custom model. "Extract the following fields from this invoice: vendor_name, date, total, line_items. Output as JSON." With constrained decoding or tool calling, the output is guaranteed to be valid JSON. Replaces custom NER pipelines for many extraction tasks.
Scaling & Emergent Abilities
What Emerges at Each Scale
- 1–3B params: Fluent text generation, basic instruction following, simple code completion. Good for on-device use. (Phi-3 Mini, LLaMA 3 3B)
- 7–13B params: Reliable multi-turn conversation, moderate reasoning, consistent code generation, basic math. The sweet spot for fine-tuning. (LLaMA 2 13B, Mistral 7B)
- 30–70B params: Complex reasoning, chain-of-thought, multi-step problem solving, strong code generation, nuanced instruction following. (LLaMA 3 70B)
- 100B+ params: Advanced mathematical reasoning, sophisticated multi-step planning, reliable tool use, strong multilingual performance. (GPT-4, Claude, LLaMA 3 405B)
The Chinchilla Insight
Hoffmann et al. (DeepMind 2022) showed that for a fixed compute budget, you should scale model size and data equally. A 70B model trained on 1.4T tokens outperforms a 175B model trained on 300B tokens. LLaMA's success was building on this insight — training smaller models on much more data than conventional wisdom suggested.
The Data Wall
The current frontier challenge: we may be running out of high-quality training data. LLaMA 3 used 15T tokens — approaching the total amount of quality text on the internet. The response: synthetic data (Phi-3 was trained on textbook-quality synthetic data), multimodal data (images, video, audio as additional training signal), and inference-time compute (spending more time thinking per query rather than pretraining longer).
The Trajectory
The open questions now are about efficiency (MoE, speculative decoding, quantization), alignment (how to make models reliably safe and helpful), and capability (test-time compute, reasoning, tool use, agents). The architecture itself has largely stabilized — the innovations are in data, training, and inference.