Transformer Distillation
Knowledge distillation transfers the learned behavior of a large "teacher" model into a smaller "student" model. In the transformer era, this means compressing 70B+ parameter LLMs into models that are 5–50x smaller while retaining most of their capability. Distillation has become the primary method for creating efficient, deployable language models — from DistilBERT to Orca to the Phi series.
When to Use Distillation
Deploying LLMs at Lower Cost
Compute & MemoryA 70B model requires multiple GPUs and costs $2–5 per million tokens to serve. A distilled 7B student can run on a single GPU at 10–20x lower cost while retaining 85–95% of the teacher's quality on targeted tasks.
Latency-Critical Serving
Real-Time InferenceChat applications, autocomplete, code suggestions, search ranking — anywhere users expect sub-200ms responses. Smaller students generate tokens faster due to fewer parameters, fewer attention heads, and shorter forward passes.
Edge & Mobile Inference
On-DeviceRunning LLMs on phones, laptops, or embedded devices. Models like Phi-3-mini (3.8B) and Gemma-2B are designed for on-device deployment, built using distillation and careful data curation to maximize capability per parameter.
Specializing for a Domain
Task FocusA general-purpose 70B model knows everything but is expensive. Distill its knowledge for medical Q&A, legal analysis, or code review into a 7B specialist that outperforms the teacher on that narrow domain while being deployable on modest hardware.
Reducing API Dependency
Open ModelsDistill a proprietary model's outputs (GPT-4, Claude) into an open-weight student you control. Eliminates vendor lock-in, recurring API costs, and latency from network calls. The Alpaca and Vicuna projects demonstrated this pattern: GPT-3.5/4 outputs used to fine-tune LLaMA.
On-Device Privacy
Data SovereigntySensitive data (medical records, legal documents, financial data) cannot leave the device or network. A distilled on-device model processes everything locally — no data sent to external APIs, full compliance with data residency requirements.
What to Look for in Your Data
Teacher-Generated Synthetic Data
Primary SignalThe teacher generates responses to a diverse set of prompts. These outputs — including the full probability distributions (logits) over the vocabulary — become the student's training data. The soft targets contain "dark knowledge": which alternative tokens the teacher considered plausible.
Prompt Diversity
CoverageThe student can only learn what it sees. Diverse prompts across topics, difficulty levels, formats, and languages ensure broad coverage. Orca used 5M diverse prompts from FLAN; Phi used synthetically generated "textbook" exercises spanning math, code, and reasoning.
Chain-of-Thought Traces
Reasoning SignalWhen the teacher explains its reasoning step-by-step, the student learns how to think, not just what to answer. Orca showed that training on CoT explanations from GPT-4 dramatically improved a 13B model's reasoning. The traces serve as an implicit curriculum.
Instruction-Response Pairs
Behavioral AlignmentStructured pairs of (instruction, response) teach the student to follow instructions. Alpaca used 52K instruction-response pairs generated by GPT-3.5. The format matters: system prompts, multi-turn conversations, and varied instruction styles all contribute to robust instruction-following.
Quality vs. Quantity Tradeoffs
Data CurationMore data is not always better. Phi-1 demonstrated that 1.3B parameters trained on 6B tokens of "textbook quality" synthetic data outperformed models trained on 100x more web data. Careful filtering, deduplication, and curation matter more than raw volume.
Filtering & Curation Strategies
Phi ApproachMicrosoft's Phi series pioneered aggressive data curation: use a strong model to classify and filter web data by educational value, then generate synthetic "textbook" data to fill gaps. The key insight: a small model trained on excellent data beats a larger model trained on mediocre data.
Architecture
The core distillation setup: a large teacher transformer generates soft logits and optionally chain-of-thought traces. The student transformer processes the same input and is trained to match the teacher's output distribution via KL divergence, with optional hidden-state alignment between corresponding layers.
Key Distillation Methods
Logit Distillation for Transformers
logits: [B, L, V] · V = 32K–128KThe foundational method: the student is trained to match the teacher's softened output distribution. Both models produce logits of shape [B, L, V] — batch B, sequence length L, vocabulary size V. The teacher's logits zT are divided by temperature τ before softmax, producing a smoother distribution that reveals inter-token relationships:
LKD = τ² · KL( softmax(zT/τ) ∥ softmax(zS/τ) )
The KL divergence is averaged over all B · L positions. The τ² weighting maintains gradient magnitude (the softmax gradient has a 1/τ factor baked in, which squared cancels out).
For LLMs with vocabularies of 32K–128K tokens, the soft distribution over the full vocabulary provides a massively richer training signal than a single ground-truth token label.
Chain-of-Thought Distillation
Orca / WizardLM PatternThe teacher generates step-by-step reasoning traces, and the student is trained to reproduce both the reasoning process and the final answer. Microsoft's Orca showed that a 13B model trained on GPT-4's chain-of-thought explanations matched GPT-3.5's performance on reasoning benchmarks. The key: prompt the teacher with "explain your reasoning step by step" to extract detailed traces.
Instruction Tuning as Distillation
Alpaca / Vicuna PatternUse a large model to generate (instruction, response) pairs, then fine-tune a smaller model on those pairs. Stanford's Alpaca: 52K instructions generated by GPT-3.5, used to fine-tune LLaMA-7B. Vicuna: 70K ShareGPT conversations used to fine-tune LLaMA-13B to ~90% of ChatGPT quality. This is "black-box" distillation — you only need the teacher's text outputs, not its logits.
- Self-Instruct pipeline: Seed with a few examples, generate diverse instructions, filter for quality, generate responses
- Key limitation: Without access to the teacher's logits, you lose the dark knowledge in the soft distribution
- Practical advantage: Works with any teacher, including proprietary APIs where logits are unavailable
Progressive / Layer-wise Distillation
hidden: [B, L, d_S] ↔ [B, L, d_T]Rather than distilling all at once, transfer knowledge layer by layer or in stages. TinyBERT aligns each student layer to a corresponding teacher layer, matching both attention maps and hidden states. Typical dimensions: student has dS=312–768, teacher has dT=768–1024, H=8–16 attention heads.
- Layer mapping: student layer i maps to teacher layer f(i) — typically evenly spaced (e.g., 6-to-12 maps student layer 3 to teacher layer 6)
- Attention transfer: MSE between attention matrices AS, AT ∈ [B, H, L, L] at mapped layers
- Hidden-state transfer: MSE between hidden states HS ∈ [B, L, dS] and Wproj · HT ∈ [B, L, dS], where Wproj ∈ [dS, dT] aligns the dimensions
Self-Distillation in Transformers
Model as Its Own TeacherA model distills knowledge into itself across training stages. The model at epoch N becomes the teacher for epoch N+1. Born-Again Networks showed this improves accuracy even without compression. In LLMs, self-distillation appears in iterative refinement: the model generates, evaluates, and retrains on its own improved outputs. Variants include deeper layers teaching shallower layers within the same forward pass.
Mean Teacher / EMA Parameter Updates
Momentum-Based Self-DistillationInstead of training a separate teacher, the teacher is the student — but a slow-moving exponential moving average (EMA) of its parameters. After each gradient step on the student, the teacher weights are updated as: θT ← m · θT + (1 − m) · θS, where m is a momentum coefficient (typically 0.996–0.999). This gives the teacher a smoother, more stable representation than any single training snapshot. Introduced as Mean Teacher (Tarvainen & Valpola, 2017) for semi-supervised learning, this pattern became foundational in self-supervised methods like BYOL and DINO, where the momentum teacher provides stable targets that prevent representation collapse without requiring negative pairs.
- DINO / DINOv2: Vision Transformer self-distillation using a momentum teacher (m=0.996→1.0 cosine schedule), centering, and multi-crop augmentation — no labels required
- BYOL: Bootstrap Your Own Latent uses an EMA teacher to provide regression targets, proving negative pairs are unnecessary for contrastive learning
- Practical benefit: The EMA teacher is smoother than any checkpoint — it averages over the noisy optimization trajectory, acting as a form of ensemble without extra cost
On-Policy Distillation (GKD)
Student Generates, Teacher ScoresStandard distillation is "off-policy": the student learns from the teacher's outputs. In on-policy / Generalized Knowledge Distillation, the student generates outputs, and the teacher provides feedback on those specific outputs. This avoids the train-test distribution mismatch that occurs when the student only ever sees the teacher's generations during training.
Training Pipeline
1 Choose or Access Teacher Model
Select a high-quality teacher: GPT-4, Claude, LLaMA-70B, Mixtral-8x22B. If you have API-only access (no logits), you'll use black-box distillation (instruction tuning on teacher outputs). If you have full model weights, white-box distillation with logit matching is more effective.
2 Generate Synthetic Dataset
Design diverse prompts spanning your target use cases. Run the teacher to generate high-quality responses. For reasoning tasks, request chain-of-thought. For factual tasks, request citations. Volume: 50K–5M examples depending on domain breadth. Filter for quality, remove duplicates, balance topic distribution.
3 Design Student Architecture
Typical compression: reduce layers from 80 to 32, heads from 64 to 32, hidden dim from 8192 to 4096. Common ratios: 10:1 to 50:1 parameter reduction. Initialize from a pretrained base (e.g., LLaMA-7B) rather than from scratch — the student needs a foundation of language understanding to absorb the teacher's knowledge effectively.
4 Train with Distillation Loss
Combine KL divergence on soft targets with cross-entropy on hard labels. Start with high alpha (0.7–0.9) to emphasize the teacher signal, then optionally anneal toward hard labels. Monitor the student's performance on held-out examples from the target distribution, not just the training loss.
5 Fine-Tune on Task-Specific Data
After general distillation, fine-tune on your specific downstream task with real (non-synthetic) data. This "sharpens" the student for your use case. Use a lower learning rate (1/10th of distillation LR) to avoid catastrophic forgetting. Evaluate on real-world benchmarks relevant to your deployment.
Key Models & Papers
| Model | Teacher | Student Size | Method | Key Result |
|---|---|---|---|---|
| DistilBERT | BERT-base (110M) | 66M | Logit + embedding | 97% of BERT, 60% faster |
| TinyBERT | BERT-base (110M) | 14.5M | Attention + hidden state | 96% of BERT, 9.4x faster |
| Alpaca | GPT-3.5 (API) | 7B (LLaMA) | Instruction tuning | ~80% of ChatGPT quality |
| Orca | GPT-4 (API) | 13B (LLaMA) | CoT distillation | Matched GPT-3.5 on reasoning |
| Phi-3-mini | Larger Phi + synthetic | 3.8B | Data curation + distillation | Matched Mixtral-8x7B |
| Minitron-8B | Nemotron-15B | 8B | Pruning + distillation | 95% of teacher accuracy |
When NOT to Use Distillation
When the Teacher Isn't Good Enough
Distillation transfers the teacher's behavior — including its errors and biases. If the teacher hallucinates on medical questions, the student will too. For safety-critical domains, distillation must be paired with careful evaluation and alignment. You cannot distill capabilities the teacher doesn't have.
When You Need Frontier-Level Performance
Distillation inherently involves a quality loss. If your application requires the absolute best possible quality (competitive coding, advanced math, nuanced legal reasoning), the distilled student will always trail the teacher. For these cases, serve the large model directly and optimize with quantization instead.
When Quantization or Pruning Suffices
If you only need 2–4x compression, quantization (INT8, INT4, GPTQ, AWQ) is simpler, faster, and requires no training data or teacher access. Distillation shines at 10x+ compression ratios where quantization alone cannot maintain quality.
| Method | Compression | Quality Loss | Training Needed | Best For |
|---|---|---|---|---|
| Distillation | 10–50x | Moderate (5–20%) | Extensive (days–weeks) | Massive compression, domain specialization |
| Quantization | 2–4x | Minimal (1–5%) | None or minimal | Quick deployment, same architecture |
| Pruning | 2–10x | Low-Moderate | Moderate (fine-tuning) | Structured removal of redundancy |
| Train from Scratch | N/A | Depends on data/scale | Massive (weeks–months) | When you have abundant data and compute |
Decision Framework
Choose Distillation If You Need 10x+ Compression
When quantization alone cannot give you the speedup or memory savings you need. If you must go from 70B to 7B or from 13B to 1.3B, distillation is the right tool. The student model is a fundamentally different (smaller) architecture.
Choose Distillation If You Have a Clear Target Domain
A general-purpose 7B model trained from scratch is mediocre at everything. A 7B model distilled from a 70B teacher specifically for your domain can be excellent at that domain. The teacher's knowledge is focused through the lens of your domain-specific training data.
Choose Distillation If You Have Teacher Access
You need either (a) API access to a strong model to generate training data (black-box), or (b) full model weights to extract logits (white-box). Without teacher access, you're training from scratch. White-box distillation is strictly better but requires open-weight teachers.
Choose Distillation If Latency Matters More Than Peak Quality
If you'd rather have 90% quality at 10ms than 100% quality at 200ms, distillation is for you. This is the right tradeoff for production systems serving millions of requests: search ranking, content filtering, chat, autocomplete.
Choose Distillation If You Want to Own Your Model
API dependency means variable costs, rate limits, potential discontinuation, and no control over model updates. Distilling into your own model gives you a fixed asset you control: predictable costs, offline capability, version stability, and the ability to further fine-tune.
Practical Recipes
Recipe 1: Domain-Specific GPT-4-Level 7B Model
Most Common Use CaseGoal: Compress GPT-4-class performance on a narrow domain (e.g., customer support, medical triage) into a deployable 7B model.
- Collect 10K domain-specific questions/scenarios
- Generate GPT-4 responses with chain-of-thought for each
- Start with a pretrained 7B base (LLaMA-3, Mistral, Qwen)
- Fine-tune with standard cross-entropy on teacher outputs
- Evaluate on held-out domain data; iterate on prompt diversity
- Expected: 85–95% of GPT-4 quality on your domain at 1/50th the cost
Recipe 2: Fast Inference from Large Ensemble
Multi-TeacherGoal: Combine knowledge from multiple large models into a single fast student.
- Run 3–5 teacher models on the same prompts
- Average their logit distributions (or take majority vote for hard labels)
- Train student on the ensembled soft targets
- The student often outperforms any individual teacher because the ensemble smooths out individual model errors
Recipe 3: Task-Specific Small Model
Focused CompressionGoal: Create a tiny (1–3B) model that does one thing exceptionally well (e.g., sentiment analysis, entity extraction, SQL generation).
- Generate 100K–500K task-specific examples from a large teacher
- Use a very small student architecture (1–3B, 12–24 layers)
- Heavy data augmentation: paraphrases, edge cases, adversarial examples
- Distilled small models often beat general-purpose models 10x their size on the target task
Recipe 4: Progressive Distillation for Generation Speedup
Diffusion / Iterative ModelsGoal: Reduce the number of sampling steps in diffusion or iterative generation models.
- Teacher: original model running N steps (e.g., 128 diffusion steps)
- Student: same architecture but trained to produce equivalent output in N/2 steps
- Repeat: distill N/2 into N/4, then N/4 into N/8
- Result: 8–16x generation speedup with minimal quality loss
- Applied successfully in Stable Diffusion distillation and speculative decoding for LLMs
Hyperparameter Guide
| Hyperparameter | Typical Range | Guidance |
|---|---|---|
| Temperature (T) | 2–20 | Start at T=4. Higher T for diverse tasks, lower T for factual/classification tasks. T=1 reduces to standard training. |
| Alpha (α) | 0.5–0.9 | Weight of distillation loss vs. hard-label loss. Start at 0.7. Higher alpha = more teacher reliance. Reduce alpha late in training. |
| Student Architecture | 1/4 to 1/10 of teacher | Common: halve layers and hidden dim. 70B → 7B, 13B → 1.3B. Use teacher's tokenizer for vocab compatibility. |
| Learning Rate | 1e-5 to 5e-4 | Lower than pretraining LR. Use cosine schedule with warmup. Typical: 2e-5 for fine-tuning, 1e-4 for full distillation. |
| Batch Size | 64–512 | Larger batches stabilize KL divergence optimization. Scale with gradient accumulation if GPU-limited. |
| Loss Type | Logit-only vs. hidden-state | Start logit-only (simpler, fewer hyperparams). Add hidden-state alignment only if logit-only plateaus. Requires layer mapping. |
| Data Volume | 50K–5M examples | Domain-specific: 50K–500K. General-purpose: 1M–5M. Quality > quantity. Filter aggressively. |
The Trajectory
Synthetic data scaling: As teacher models improve, the quality of synthetic training data improves. This creates a virtuous cycle: better teachers produce better synthetic data, which trains better students, which in turn can become teachers for the next generation. The Phi series demonstrated this can work across multiple generations.
Distillation-aware pretraining: Future models may be pretrained with distillation in mind — explicitly designed to be good teachers by producing informative output distributions. This means training objectives that encourage richer soft targets and more transferable hidden representations.
Open-weight ecosystem: The availability of strong open-weight models (LLaMA, Mistral, Qwen, Gemma) as both teachers and student bases has democratized distillation. Any organization can now create competitive domain-specific models without training from scratch.
The smaller-is-better trend: Phi-3-mini (3.8B) matching Mixtral-8x7B (46.7B) showed that the floor for "useful" model size keeps dropping. With better distillation techniques and data curation, we may see 1–3B models that match today's 7–13B models within a year. The practical implication: on-device LLMs are becoming viable for an expanding range of tasks.