Transformer Distillation

Student-Teacher Methods for Compressing Large Language Models

Knowledge distillation transfers the learned behavior of a large "teacher" model into a smaller "student" model. In the transformer era, this means compressing 70B+ parameter LLMs into models that are 5–50x smaller while retaining most of their capability. Distillation has become the primary method for creating efficient, deployable language models — from DistilBERT to Orca to the Phi series.

When to Use Distillation

The core insight: A large teacher model encodes rich knowledge in its output distribution — the relative probabilities across all tokens contain far more information than the hard label alone. A student trained on these soft targets learns faster and generalizes better than one trained only on ground-truth data.

Deploying LLMs at Lower Cost

Compute & Memory

A 70B model requires multiple GPUs and costs $2–5 per million tokens to serve. A distilled 7B student can run on a single GPU at 10–20x lower cost while retaining 85–95% of the teacher's quality on targeted tasks.

Latency-Critical Serving

Real-Time Inference

Chat applications, autocomplete, code suggestions, search ranking — anywhere users expect sub-200ms responses. Smaller students generate tokens faster due to fewer parameters, fewer attention heads, and shorter forward passes.

Edge & Mobile Inference

On-Device

Running LLMs on phones, laptops, or embedded devices. Models like Phi-3-mini (3.8B) and Gemma-2B are designed for on-device deployment, built using distillation and careful data curation to maximize capability per parameter.

Specializing for a Domain

Task Focus

A general-purpose 70B model knows everything but is expensive. Distill its knowledge for medical Q&A, legal analysis, or code review into a 7B specialist that outperforms the teacher on that narrow domain while being deployable on modest hardware.

Reducing API Dependency

Open Models

Distill a proprietary model's outputs (GPT-4, Claude) into an open-weight student you control. Eliminates vendor lock-in, recurring API costs, and latency from network calls. The Alpaca and Vicuna projects demonstrated this pattern: GPT-3.5/4 outputs used to fine-tune LLaMA.

On-Device Privacy

Data Sovereignty

Sensitive data (medical records, legal documents, financial data) cannot leave the device or network. A distilled on-device model processes everything locally — no data sent to external APIs, full compliance with data residency requirements.

What to Look for in Your Data

Teacher-Generated Synthetic Data

Primary Signal

The teacher generates responses to a diverse set of prompts. These outputs — including the full probability distributions (logits) over the vocabulary — become the student's training data. The soft targets contain "dark knowledge": which alternative tokens the teacher considered plausible.

Prompt Diversity

Coverage

The student can only learn what it sees. Diverse prompts across topics, difficulty levels, formats, and languages ensure broad coverage. Orca used 5M diverse prompts from FLAN; Phi used synthetically generated "textbook" exercises spanning math, code, and reasoning.

Chain-of-Thought Traces

Reasoning Signal

When the teacher explains its reasoning step-by-step, the student learns how to think, not just what to answer. Orca showed that training on CoT explanations from GPT-4 dramatically improved a 13B model's reasoning. The traces serve as an implicit curriculum.

Instruction-Response Pairs

Behavioral Alignment

Structured pairs of (instruction, response) teach the student to follow instructions. Alpaca used 52K instruction-response pairs generated by GPT-3.5. The format matters: system prompts, multi-turn conversations, and varied instruction styles all contribute to robust instruction-following.

Quality vs. Quantity Tradeoffs

Data Curation

More data is not always better. Phi-1 demonstrated that 1.3B parameters trained on 6B tokens of "textbook quality" synthetic data outperformed models trained on 100x more web data. Careful filtering, deduplication, and curation matter more than raw volume.

Filtering & Curation Strategies

Phi Approach

Microsoft's Phi series pioneered aggressive data curation: use a strong model to classify and filter web data by educational value, then generate synthetic "textbook" data to fill gaps. The key insight: a small model trained on excellent data beats a larger model trained on mediocre data.

Rule of thumb: Start with 50K–500K high-quality teacher-generated examples for a domain-specific distillation. For general-purpose student models, 1M–5M diverse examples are typical. Always validate on held-out data from the target distribution.

Architecture

The core distillation setup: a large teacher transformer generates soft logits and optionally chain-of-thought traces. The student transformer processes the same input and is trained to match the teacher's output distribution via KL divergence, with optional hidden-state alignment between corresponding layers.

TEACHER (e.g. 70B LLM) Input Tokens: x1, x2, ..., xn Multi-Head Attention (64h) FFN Block (Large d_ff) x N layers (e.g. 80) Hidden States hT Softmax(zT / T) → Soft Targets Chain-of-Thought Traces (optional) STUDENT (e.g. 7B LLM) Input Tokens: x1, x2, ..., xn Multi-Head Attention (32h) FFN Block (Smaller d_ff) x M layers (e.g. 32) Hidden States hS Softmax(zS / T) → Student Dist. Hidden-State Alignment (MSE) KL Divergence Combined Loss L = α · T² · KL(pT || pS) + (1 − α) · CE(y, pS) T = temperature, α = distillation weight
Temperature scaling: Higher T (2–20) softens the teacher's output distribution, exposing more of the "dark knowledge" — which tokens the teacher considered plausible alternatives. At T=1, the distribution is peaked; at T=10, it reveals subtle relationships between tokens that help the student generalize.

Key Distillation Methods

Logit Distillation for Transformers

logits: [B, L, V] · V = 32K–128K

The foundational method: the student is trained to match the teacher's softened output distribution. Both models produce logits of shape [B, L, V] — batch B, sequence length L, vocabulary size V. The teacher's logits zT are divided by temperature τ before softmax, producing a smoother distribution that reveals inter-token relationships:

LKD = τ² · KL( softmax(zT/τ) ∥ softmax(zS/τ) )

The KL divergence is averaged over all B · L positions. The τ² weighting maintains gradient magnitude (the softmax gradient has a 1/τ factor baked in, which squared cancels out).

T = 1 (Hard) Peaked: one dominant token ÷ T T = 5 (Soft) Smooth: reveals dark knowledge

For LLMs with vocabularies of 32K–128K tokens, the soft distribution over the full vocabulary provides a massively richer training signal than a single ground-truth token label.

Chain-of-Thought Distillation

Orca / WizardLM Pattern

The teacher generates step-by-step reasoning traces, and the student is trained to reproduce both the reasoning process and the final answer. Microsoft's Orca showed that a 13B model trained on GPT-4's chain-of-thought explanations matched GPT-3.5's performance on reasoning benchmarks. The key: prompt the teacher with "explain your reasoning step by step" to extract detailed traces.

Teacher (70B) Q: What is 47 x 23? Step 1: 47 x 20 = 940 Step 2: 47 x 3 = 141 Step 3: 940 + 141 = 1081 Answer: 1081 Train on full trace Student (7B) Learns to reason: Break into parts Compute sub-products Combine results Produces CoT! Result Small model that reasons step-by-step like the teacher

Instruction Tuning as Distillation

Alpaca / Vicuna Pattern

Use a large model to generate (instruction, response) pairs, then fine-tune a smaller model on those pairs. Stanford's Alpaca: 52K instructions generated by GPT-3.5, used to fine-tune LLaMA-7B. Vicuna: 70K ShareGPT conversations used to fine-tune LLaMA-13B to ~90% of ChatGPT quality. This is "black-box" distillation — you only need the teacher's text outputs, not its logits.

  • Self-Instruct pipeline: Seed with a few examples, generate diverse instructions, filter for quality, generate responses
  • Key limitation: Without access to the teacher's logits, you lose the dark knowledge in the soft distribution
  • Practical advantage: Works with any teacher, including proprietary APIs where logits are unavailable

Progressive / Layer-wise Distillation

hidden: [B, L, d_S][B, L, d_T]

Rather than distilling all at once, transfer knowledge layer by layer or in stages. TinyBERT aligns each student layer to a corresponding teacher layer, matching both attention maps and hidden states. Typical dimensions: student has dS=312–768, teacher has dT=768–1024, H=8–16 attention heads.

  • Layer mapping: student layer i maps to teacher layer f(i) — typically evenly spaced (e.g., 6-to-12 maps student layer 3 to teacher layer 6)
  • Attention transfer: MSE between attention matrices AS, AT ∈ [B, H, L, L] at mapped layers
  • Hidden-state transfer: MSE between hidden states HS ∈ [B, L, dS] and Wproj · HT ∈ [B, L, dS], where Wproj ∈ [dS, dT] aligns the dimensions

Self-Distillation in Transformers

Model as Its Own Teacher

A model distills knowledge into itself across training stages. The model at epoch N becomes the teacher for epoch N+1. Born-Again Networks showed this improves accuracy even without compression. In LLMs, self-distillation appears in iterative refinement: the model generates, evaluates, and retrains on its own improved outputs. Variants include deeper layers teaching shallower layers within the same forward pass.

Mean Teacher / EMA Parameter Updates

Momentum-Based Self-Distillation

Instead of training a separate teacher, the teacher is the student — but a slow-moving exponential moving average (EMA) of its parameters. After each gradient step on the student, the teacher weights are updated as: θT ← m · θT + (1 − m) · θS, where m is a momentum coefficient (typically 0.996–0.999). This gives the teacher a smoother, more stable representation than any single training snapshot. Introduced as Mean Teacher (Tarvainen & Valpola, 2017) for semi-supervised learning, this pattern became foundational in self-supervised methods like BYOL and DINO, where the momentum teacher provides stable targets that prevent representation collapse without requiring negative pairs.

Student θS Updated by gradients (SGD / Adam) EMA Update m·θT + (1−m)·θS Teacher θT No gradients — only EMA of student Stable Pseudo-Targets Loss: match teacher targets Momentum Schedule cosine: 0.996 → 1.0 over training duration
  • DINO / DINOv2: Vision Transformer self-distillation using a momentum teacher (m=0.996→1.0 cosine schedule), centering, and multi-crop augmentation — no labels required
  • BYOL: Bootstrap Your Own Latent uses an EMA teacher to provide regression targets, proving negative pairs are unnecessary for contrastive learning
  • Practical benefit: The EMA teacher is smoother than any checkpoint — it averages over the noisy optimization trajectory, acting as a form of ensemble without extra cost

On-Policy Distillation (GKD)

Student Generates, Teacher Scores

Standard distillation is "off-policy": the student learns from the teacher's outputs. In on-policy / Generalized Knowledge Distillation, the student generates outputs, and the teacher provides feedback on those specific outputs. This avoids the train-test distribution mismatch that occurs when the student only ever sees the teacher's generations during training.

Student Generates output student output Teacher Scores/corrects student output feedback Update Student On own distribution

Training Pipeline

1 Choose Teacher Model (API or local) 2 Generate Synthetic Data (diverse prompts) 3 Design Student Arch. (layers, heads, dim) 4 Train with Distillation Loss (KL + CE) 5 Fine-Tune Task-Specific (optional SFT)

1 Choose or Access Teacher Model

Select a high-quality teacher: GPT-4, Claude, LLaMA-70B, Mixtral-8x22B. If you have API-only access (no logits), you'll use black-box distillation (instruction tuning on teacher outputs). If you have full model weights, white-box distillation with logit matching is more effective.

2 Generate Synthetic Dataset

Design diverse prompts spanning your target use cases. Run the teacher to generate high-quality responses. For reasoning tasks, request chain-of-thought. For factual tasks, request citations. Volume: 50K–5M examples depending on domain breadth. Filter for quality, remove duplicates, balance topic distribution.

3 Design Student Architecture

Typical compression: reduce layers from 80 to 32, heads from 64 to 32, hidden dim from 8192 to 4096. Common ratios: 10:1 to 50:1 parameter reduction. Initialize from a pretrained base (e.g., LLaMA-7B) rather than from scratch — the student needs a foundation of language understanding to absorb the teacher's knowledge effectively.

4 Train with Distillation Loss

Combine KL divergence on soft targets with cross-entropy on hard labels. Start with high alpha (0.7–0.9) to emphasize the teacher signal, then optionally anneal toward hard labels. Monitor the student's performance on held-out examples from the target distribution, not just the training loss.

5 Fine-Tune on Task-Specific Data

After general distillation, fine-tune on your specific downstream task with real (non-synthetic) data. This "sharpens" the student for your use case. Use a lower learning rate (1/10th of distillation LR) to avoid catastrophic forgetting. Evaluate on real-world benchmarks relevant to your deployment.

Key Models & Papers

DistilBERT: A distilled version of BERT
Sanh et al. (HuggingFace), 2019 — The first major transformer distillation. 6-layer student from 12-layer BERT teacher. 40% smaller, 60% faster, retains 97% of BERT's performance. Used triple loss: masked LM, distillation, cosine embedding.
TinyBERT: Distilling BERT for Natural Language Understanding
Jiao et al., 2020 — Went beyond logit matching to align attention matrices and hidden states between teacher and student at every layer. 4-layer TinyBERT matched BERT-base while being 7.5x smaller and 9.4x faster.
Alpaca: A Strong, Replicable Instruction-Following Model
Stanford, 2023 — Demonstrated that 52K GPT-3.5-generated instruction-response pairs could fine-tune LLaMA-7B into a surprisingly capable instruction follower. Cost: under $600. Opened the floodgates for LLM distillation research.
Orca / Orca 2: Progressive Learning from Complex Explanation Traces
Microsoft Research, 2023 — Chain-of-thought distillation from GPT-4. Orca-13B matched GPT-3.5 on reasoning benchmarks. Orca 2 introduced "cautious reasoning" — the student learns when to use different reasoning strategies (step-by-step, recall-then-generate, etc.).
Phi-1 / Phi-2 / Phi-3 / Phi-4 (Microsoft Research)
2023–2024 — Proved that data quality trumps model size. Phi-1 (1.3B) outperformed models 10x larger on code benchmarks using "textbook quality" synthetic data. Phi-3-mini (3.8B) matched Mixtral-8x7B. The Phi series combines aggressive data curation with distillation from larger models.
Gemma (Google) / Minitron (NVIDIA)
2024 — Gemma models use distillation from larger Gemini models plus careful data curation. NVIDIA's Minitron combines structured pruning with knowledge distillation: start with a large model, prune width/depth, then distill to recover accuracy. Minitron-8B from Nemotron-15B retained 95% of performance.
Model Teacher Student Size Method Key Result
DistilBERT BERT-base (110M) 66M Logit + embedding 97% of BERT, 60% faster
TinyBERT BERT-base (110M) 14.5M Attention + hidden state 96% of BERT, 9.4x faster
Alpaca GPT-3.5 (API) 7B (LLaMA) Instruction tuning ~80% of ChatGPT quality
Orca GPT-4 (API) 13B (LLaMA) CoT distillation Matched GPT-3.5 on reasoning
Phi-3-mini Larger Phi + synthetic 3.8B Data curation + distillation Matched Mixtral-8x7B
Minitron-8B Nemotron-15B 8B Pruning + distillation 95% of teacher accuracy

When NOT to Use Distillation

When the Teacher Isn't Good Enough

Distillation transfers the teacher's behavior — including its errors and biases. If the teacher hallucinates on medical questions, the student will too. For safety-critical domains, distillation must be paired with careful evaluation and alignment. You cannot distill capabilities the teacher doesn't have.

When You Need Frontier-Level Performance

Distillation inherently involves a quality loss. If your application requires the absolute best possible quality (competitive coding, advanced math, nuanced legal reasoning), the distilled student will always trail the teacher. For these cases, serve the large model directly and optimize with quantization instead.

When Quantization or Pruning Suffices

If you only need 2–4x compression, quantization (INT8, INT4, GPTQ, AWQ) is simpler, faster, and requires no training data or teacher access. Distillation shines at 10x+ compression ratios where quantization alone cannot maintain quality.

Method Compression Quality Loss Training Needed Best For
Distillation 10–50x Moderate (5–20%) Extensive (days–weeks) Massive compression, domain specialization
Quantization 2–4x Minimal (1–5%) None or minimal Quick deployment, same architecture
Pruning 2–10x Low-Moderate Moderate (fine-tuning) Structured removal of redundancy
Train from Scratch N/A Depends on data/scale Massive (weeks–months) When you have abundant data and compute
Distillation + Quantization is powerful. Distill a 70B teacher into a 7B student (10x), then quantize the student to INT4 (4x). Net result: ~40x compression. This combination is how most production LLM deployments work in practice.

Decision Framework

Choose Distillation If You Need 10x+ Compression

When quantization alone cannot give you the speedup or memory savings you need. If you must go from 70B to 7B or from 13B to 1.3B, distillation is the right tool. The student model is a fundamentally different (smaller) architecture.

Choose Distillation If You Have a Clear Target Domain

A general-purpose 7B model trained from scratch is mediocre at everything. A 7B model distilled from a 70B teacher specifically for your domain can be excellent at that domain. The teacher's knowledge is focused through the lens of your domain-specific training data.

Choose Distillation If You Have Teacher Access

You need either (a) API access to a strong model to generate training data (black-box), or (b) full model weights to extract logits (white-box). Without teacher access, you're training from scratch. White-box distillation is strictly better but requires open-weight teachers.

Choose Distillation If Latency Matters More Than Peak Quality

If you'd rather have 90% quality at 10ms than 100% quality at 200ms, distillation is for you. This is the right tradeoff for production systems serving millions of requests: search ranking, content filtering, chat, autocomplete.

Choose Distillation If You Want to Own Your Model

API dependency means variable costs, rate limits, potential discontinuation, and no control over model updates. Distilling into your own model gives you a fixed asset you control: predictable costs, offline capability, version stability, and the ability to further fine-tune.

Practical Recipes

Recipe 1: Domain-Specific GPT-4-Level 7B Model

Most Common Use Case

Goal: Compress GPT-4-class performance on a narrow domain (e.g., customer support, medical triage) into a deployable 7B model.

  • Collect 10K domain-specific questions/scenarios
  • Generate GPT-4 responses with chain-of-thought for each
  • Start with a pretrained 7B base (LLaMA-3, Mistral, Qwen)
  • Fine-tune with standard cross-entropy on teacher outputs
  • Evaluate on held-out domain data; iterate on prompt diversity
  • Expected: 85–95% of GPT-4 quality on your domain at 1/50th the cost

Recipe 2: Fast Inference from Large Ensemble

Multi-Teacher

Goal: Combine knowledge from multiple large models into a single fast student.

  • Run 3–5 teacher models on the same prompts
  • Average their logit distributions (or take majority vote for hard labels)
  • Train student on the ensembled soft targets
  • The student often outperforms any individual teacher because the ensemble smooths out individual model errors

Recipe 3: Task-Specific Small Model

Focused Compression

Goal: Create a tiny (1–3B) model that does one thing exceptionally well (e.g., sentiment analysis, entity extraction, SQL generation).

  • Generate 100K–500K task-specific examples from a large teacher
  • Use a very small student architecture (1–3B, 12–24 layers)
  • Heavy data augmentation: paraphrases, edge cases, adversarial examples
  • Distilled small models often beat general-purpose models 10x their size on the target task

Recipe 4: Progressive Distillation for Generation Speedup

Diffusion / Iterative Models

Goal: Reduce the number of sampling steps in diffusion or iterative generation models.

  • Teacher: original model running N steps (e.g., 128 diffusion steps)
  • Student: same architecture but trained to produce equivalent output in N/2 steps
  • Repeat: distill N/2 into N/4, then N/4 into N/8
  • Result: 8–16x generation speedup with minimal quality loss
  • Applied successfully in Stable Diffusion distillation and speculative decoding for LLMs

Hyperparameter Guide

Temperature intuition: Temperature T controls how much "dark knowledge" flows from teacher to student. At T=1, the teacher says "the answer is definitely 'cat'." At T=10, the teacher says "it's probably 'cat', but 'kitten', 'feline', and 'pet' are reasonable too." The student learns richer representations from the softer distribution.
Hyperparameter Typical Range Guidance
Temperature (T) 2–20 Start at T=4. Higher T for diverse tasks, lower T for factual/classification tasks. T=1 reduces to standard training.
Alpha (α) 0.5–0.9 Weight of distillation loss vs. hard-label loss. Start at 0.7. Higher alpha = more teacher reliance. Reduce alpha late in training.
Student Architecture 1/4 to 1/10 of teacher Common: halve layers and hidden dim. 70B → 7B, 13B → 1.3B. Use teacher's tokenizer for vocab compatibility.
Learning Rate 1e-5 to 5e-4 Lower than pretraining LR. Use cosine schedule with warmup. Typical: 2e-5 for fine-tuning, 1e-4 for full distillation.
Batch Size 64–512 Larger batches stabilize KL divergence optimization. Scale with gradient accumulation if GPU-limited.
Loss Type Logit-only vs. hidden-state Start logit-only (simpler, fewer hyperparams). Add hidden-state alignment only if logit-only plateaus. Requires layer mapping.
Data Volume 50K–5M examples Domain-specific: 50K–500K. General-purpose: 1M–5M. Quality > quantity. Filter aggressively.

The Trajectory

Distillation is becoming the default way to deploy LLMs. The frontier labs train one massive model, then distill it into a family of smaller models for different deployment targets. GPT-4 → GPT-4o-mini, Gemini Ultra → Gemini Flash, Claude 3.5 Sonnet → Claude 3.5 Haiku. The "train big, deploy small" paradigm is now standard.

Synthetic data scaling: As teacher models improve, the quality of synthetic training data improves. This creates a virtuous cycle: better teachers produce better synthetic data, which trains better students, which in turn can become teachers for the next generation. The Phi series demonstrated this can work across multiple generations.

Distillation-aware pretraining: Future models may be pretrained with distillation in mind — explicitly designed to be good teachers by producing informative output distributions. This means training objectives that encourage richer soft targets and more transferable hidden representations.

Open-weight ecosystem: The availability of strong open-weight models (LLaMA, Mistral, Qwen, Gemma) as both teachers and student bases has democratized distillation. Any organization can now create competitive domain-specific models without training from scratch.

The smaller-is-better trend: Phi-3-mini (3.8B) matching Mixtral-8x7B (46.7B) showed that the floor for "useful" model size keeps dropping. With better distillation techniques and data curation, we may see 1–3B models that match today's 7–13B models within a year. The practical implication: on-device LLMs are becoming viable for an expanding range of tasks.