Decoder-Only Transformers

GPT, LLaMA & Autoregressive Generation

Decoder-only transformers predict the next token given all previous tokens — causal, left-to-right attention. This simple objective, scaled to trillions of tokens, produces models that can generate text, code, reason through multi-step problems, follow complex instructions, and learn new tasks from a few examples in context. This is the architecture behind every modern LLM.

When to Use Decoder-Only

The core insight: Next-token prediction is a universal learning objective. A model that can predict what comes next in any text sequence implicitly learns grammar, facts, reasoning, code syntax, math, and even tool use. Causal attention + massive scale = general-purpose intelligence.

Text Generation

Writing, summarization, translation, creative content. The autoregressive loop generates one token at a time, each conditioned on all previous tokens. Temperature and top-p sampling control creativity vs. determinism.

Code Generation & Completion

Copilot, Cursor, code assistants. Code is sequential and highly structured — perfect for autoregressive generation. Models trained on code (CodeLlama, StarCoder, DeepSeek-Coder) generate functions, debug errors, and write tests.

Conversational AI & Instruction Following

Chatbots, assistants, customer support. After SFT (supervised fine-tuning) and RLHF/DPO alignment, decoder models follow complex multi-turn instructions, maintain context, and refuse harmful requests.

Reasoning & Chain-of-Thought

Mathematical proofs, logical deduction, planning. Decoder models "think out loud" by generating intermediate reasoning steps before the final answer. This emergent capability appears at scale (>60B parameters) and dramatically improves accuracy on complex tasks.

Few-Shot & Zero-Shot Learning

Solve new tasks from examples in the prompt (few-shot) or from instructions alone (zero-shot). No fine-tuning needed. GPT-3 demonstrated this: include 3–5 examples in the prompt and the model generalizes. This is in-context learning — a capability unique to large decoder models.

Structured Output & Tool Use

JSON generation, function calling, API orchestration. Modern LLMs generate structured outputs reliably with constrained decoding or fine-tuning. This enables agents that call tools, query databases, and execute code.

What to Look for in Your Data

Pretraining Data: Scale Is (Almost) Everything

GPT-3 (2020): 300B tokens from Common Crawl, WebText2, Books, Wikipedia
LLaMA 1 (2023): 1.4T tokens — 4.7× more, proving smaller models trained longer beat larger models trained less
LLaMA 3 (2024): 15T tokens — 10× more again, with aggressive data quality filtering
The pattern: Data quantity matters, but data quality matters more. LLaMA 3 spent more effort on deduplication, filtering, and mixing ratios than on architecture changes.

Fine-Tuning Data: Instructions & Preferences

SFT data: (instruction, response) pairs. FLAN (1.8M tasks), Alpaca (52K GPT-generated), ShareGPT (user conversations), OpenAssistant
Preference data: (prompt, chosen_response, rejected_response) triples for RLHF/DPO alignment. Critical for safety and helpfulness.
Quality over quantity: LIMA (2023) showed that just 1,000 carefully curated SFT examples can match models trained on 50K+ examples. Data curation matters enormously.

In-Context Learning vs. Fine-Tuning

Few-shot prompting: No training needed. Include 3–10 examples in the prompt. Works for classification, extraction, formatting. Limited by context window.
Fine-tuning: When you have >1K examples and need consistent, cost-efficient performance. LoRA/QLoRA make fine-tuning accessible on consumer GPUs.
Rule of thumb: Start with prompting. If accuracy is <90% or per-query cost is too high, fine-tune. A fine-tuned 7B model often beats a prompted 70B model on domain-specific tasks.

Task Framing: Everything as Generation

Decoder models cast every task as "generate the next tokens":

Classification: "Classify this review as positive or negative: {text}\nLabel:" → generate "positive"
NER: "Extract entities from: {text}\nEntities:" → generate JSON
Translation: "Translate to French: {text}\n" → generate French text
This works surprisingly well but is less efficient than a fine-tuned encoder for pure classification (a 110M BERT is faster and cheaper than a 7B LLM for sentiment analysis).

Context Length & Token Efficiency

Original GPT-2/3: 1K–2K token context with absolute positional embeddings
LLaMA 2: 4K context with RoPE (extends to 32K+ with interpolation)
LLaMA 3: 128K context natively
Cost scales with context: Self-attention is O(n²). A 128K context query costs 1,000× more compute than a 4K query. Use only what you need.

Architecture

One key difference from encoders: The causal attention mask. Each token can only attend to tokens that came before it (and itself). This enables autoregressive generation — at each step, the model has never seen the future.

Key Architectural Innovations

Modern decoder-only models (LLaMA, Mistral, Phi) share a common set of architectural upgrades over the original GPT. These are the building blocks worth understanding.

1 KV Cache

Inference optimization

During autoregressive generation, the model generates one token at a time. Without KV cache, generating token N requires recomputing attention over all N previous tokens — O(n²) total work. With KV cache, we store the Key and Value projections from previous tokens and only compute the new token's Q, K, V. Each new token is O(n) instead of O(n²).

2 Grouped-Query Attention (GQA)

LLaMA 2/3, Mistral

Standard multi-head attention (MHA) has separate K, V projections per head. This means KV cache grows linearly with the number of heads. GQA shares K, V across groups of query heads, reducing KV cache size by 4–8× with minimal quality loss.

3 RoPE (Rotary Position Embeddings)

LLaMA, Mistral, Phi, most modern LLMs

Unlike absolute positional embeddings (GPT-2) or learned positions, RoPE encodes position by rotating the query and key vectors. The dot product between rotated Q and K depends only on their relative distance, not absolute position. This enables length generalization — train on 4K context, extend to 128K with interpolation (NTK-aware RoPE, YaRN).

How: Split each head dimension into pairs. Rotate each pair by an angle proportional to its position × a frequency. Different dimension pairs use different frequencies.
Why it works: The attention score between positions i and j depends only on (i - j), making the model naturally translation-invariant.

4 SwiGLU Activation

LLaMA, PaLM, Mistral

Replaces the standard ReLU or GELU FFN with a gated linear unit using Swish activation:

FFN(x) = (Swish(xW₁) ⊙ xW₃)W₂

The gate (xW₃) controls information flow element-wise. This adds one more weight matrix but consistently outperforms ReLU/GELU by 1–2% on benchmarks. The FFN hidden dimension is typically 8/3 × model dimension (e.g., 11,008 for LLaMA 7B's hidden dim of 4,096).

5 Mixture of Experts (MoE)

Mixtral, Switch Transformer, DeepSeek

Replace the single FFN with N expert FFNs (typically 8). A router network selects the top-K experts (usually 2) for each token. Only the selected experts compute — total compute per token is 2/8 = 25% of the dense equivalent, but the model has 8× the parameters.

Training Pipeline

Modern LLMs follow a three-stage training process. Each stage uses different data, objectives, and compute budgets.

Stage 1: Pretraining

99% of compute

Pure next-token prediction on trillions of tokens from the web. The model learns language, facts, reasoning patterns, code syntax, and world knowledge. This is the most expensive stage — LLaMA 3 70B required ~6M GPU-hours on H100s. The key decisions are data mixture ratios (web, code, math, multilingual) and data quality filtering.

Stage 2: Supervised Fine-Tuning (SFT)

Hours to days

Train on (instruction, response) pairs to teach the model to follow instructions, answer questions, and format outputs. This transforms the base model from a text completer into an assistant. Key datasets: FLAN (1.8M multi-task), OpenAssistant, ShareGPT. LIMA showed 1K high-quality examples can suffice.

Stage 3: RLHF / DPO

Alignment

Align the model with human preferences. Two dominant approaches:

RLHF: Train a reward model on human preference data. Use PPO to optimize the LLM against the reward model. Complex but proven (ChatGPT, Claude).
DPO: Direct Preference Optimization skips the reward model. Directly optimize the policy from preference pairs using a classification-like loss. Simpler, no RL infrastructure needed. Used by LLaMA 3, Zephyr.

Key Models

GPT-2: Language Models are Unsupervised Multitask Learners

Radford et al., OpenAI 2019 · 1.5B params · Showed scaling works · Zero-shot task transfer

GPT-3: Language Models are Few-Shot Learners

Brown et al., OpenAI 2020 · 175B params · 300B tokens · In-context learning emerges at scale

LLaMA: Open and Efficient Foundation Language Models

Touvron et al., Meta 2023 · 7B–65B · RoPE + SwiGLU + RMSNorm · Proved smaller models trained longer beat larger ones

LLaMA 2 & 3: Open Foundation and Fine-Tuned Chat Models

Meta 2023/2024 · GQA + 128K context + 15T tokens · RLHF alignment · Competitive with GPT-3.5/4

Mistral 7B / Mixtral 8×7B

Mistral AI 2023/2024 · Sliding window attention + MoE · 7B that beats LLaMA 2 13B · 8×7B (47B total, 13B active)

Phi-3 / Phi-4: Small Language Models

Microsoft 2024/2025 · 3.8B–14B · "Textbook-quality" synthetic data · Small models punching above their weight

Model Comparison

Model	Params	Training Data	Context	Key Innovation	Best For
GPT-3	175B	300B tokens	2K	In-context learning	Historical reference
LLaMA 2	7–70B	2T tokens	4K	GQA, open weights	Open-source baseline
LLaMA 3	8–405B	15T tokens	128K	Data scale + quality	Best open model
Mistral 7B	7B	Undisclosed	32K	Sliding window attn	Efficient inference
Mixtral 8×7B	47B (13B active)	Undisclosed	32K	Sparse MoE	Quality/cost ratio
Phi-3 Mini	3.8B	3.3T tokens	128K	Data curation	On-device / edge

When NOT to Use Decoder-Only

Decoder-only is not always the answer. The LLM hype cycle makes it tempting to use a 7B model for everything, but smaller specialized models are often better, faster, and cheaper.

Pure Classification with Small Labeled Data

If you have 5K labeled examples and need to classify text into 10 categories, a fine-tuned DeBERTa (400M params) will match or beat a prompted LLaMA 70B at 1/175th the compute. Encoders are purpose-built for this. Don't use a cannon to kill a mosquito.

Retrieval & Embedding Tasks

Dense retrieval, semantic search, sentence similarity. Encoder-only models (Sentence-BERT, E5, GTE) produce better embeddings than decoder models because bidirectional attention captures meaning from both directions. A 110M encoder for embeddings + a vector database is the standard retrieval stack.

Latency-Critical Understanding-Only Tasks

If you only need to understand text (not generate it) and latency matters, encoder-only is faster. A single forward pass through BERT takes ~5ms. Generating 100 tokens from an LLM takes 500ms–2s. For real-time classification pipelines processing millions of items, this difference is decisive.

Decision Framework

The practical test: If your output is generated text (any length), you need a decoder. If your output is a label, span, or embedding, an encoder is likely better. If you need both understanding and generation, a decoder can do both — but at higher cost.

Choose Decoder-Only If...

Your task requires generating text, code, or structured output
You need conversational / multi-turn interaction
Complex reasoning or chain-of-thought is required
You want few-shot / zero-shot learning without fine-tuning
The task requires world knowledge or commonsense reasoning
You need flexibility across many task types (one model for everything)
You're building an agent that uses tools, calls APIs, or writes code
Your budget allows for the higher compute cost

Use Cases in Practice

Code Completion & Generation

Copilot, Cursor

Developer writes a function signature and docstring. The model autoregressively generates the implementation. CodeLlama, StarCoder, and DeepSeek-Coder are trained on hundreds of billions of code tokens. Fill-in-the-middle (FIM) training allows completion at any cursor position, not just the end.

Customer Support Chatbot

Conversational

Fine-tune a 7B–13B model on your company's support conversations. RAG retrieves relevant docs from a knowledge base. The model generates human-like responses grounded in your documentation. LoRA fine-tuning on 10K conversations takes hours on a single A100. Serve at ~50 tokens/sec per user.

Document Summarization

Generation

Input: long document (legal filing, research paper, earnings call transcript). Output: concise summary. With 128K context (LLaMA 3), most documents fit in a single prompt. For longer documents, hierarchical summarization: summarize chunks, then summarize summaries.

Structured Data Extraction via Prompting

JSON generation

Extract structured information from unstructured text without training a custom model. "Extract the following fields from this invoice: vendor_name, date, total, line_items. Output as JSON." With constrained decoding or tool calling, the output is guaranteed to be valid JSON. Replaces custom NER pipelines for many extraction tasks.

Scaling & Emergent Abilities

Capabilities emerge with scale. Many abilities appear suddenly as model size increases, rather than improving gradually. Understanding these thresholds helps you choose the right model size.

What Emerges at Each Scale

1–3B params: Fluent text generation, basic instruction following, simple code completion. Good for on-device use. (Phi-3 Mini, LLaMA 3 3B)
7–13B params: Reliable multi-turn conversation, moderate reasoning, consistent code generation, basic math. The sweet spot for fine-tuning. (LLaMA 2 13B, Mistral 7B)
30–70B params: Complex reasoning, chain-of-thought, multi-step problem solving, strong code generation, nuanced instruction following. (LLaMA 3 70B)
100B+ params: Advanced mathematical reasoning, sophisticated multi-step planning, reliable tool use, strong multilingual performance. (GPT-4, Claude, LLaMA 3 405B)

The Chinchilla Insight

Hoffmann et al. (DeepMind 2022) showed that for a fixed compute budget, you should scale model size and data equally. A 70B model trained on 1.4T tokens outperforms a 175B model trained on 300B tokens. LLaMA's success was building on this insight — training smaller models on much more data than conventional wisdom suggested.

The Data Wall

The current frontier challenge: we may be running out of high-quality training data. LLaMA 3 used 15T tokens — approaching the total amount of quality text on the internet. The response: synthetic data (Phi-3 was trained on textbook-quality synthetic data), multimodal data (images, video, audio as additional training signal), and inference-time compute (spending more time thinking per query rather than pretraining longer).

The Trajectory

Decoder-only won. The original Transformer paper (2017) used an encoder-decoder architecture. GPT-1 (2018) showed decoder-only works for pretraining. GPT-3 (2020) showed it scales. LLaMA (2023) democratized it. Today, every frontier LLM is decoder-only. The simplicity of "just predict the next token" turned out to be the most scalable and general-purpose learning objective we've found.

The open questions now are about efficiency (MoE, speculative decoding, quantization), alignment (how to make models reliably safe and helpful), and capability (test-time compute, reasoning, tool use, agents). The architecture itself has largely stabilized — the innovations are in data, training, and inference.