DINO: Self-Distillation with No Labels

2021–2025 — Meta AI (FAIR)
Self-Supervised Learning Vision Transformer Self-Distillation Foundation Model Feature Learning Meta AI

1 — The Problem

Supervised learning in computer vision has a fundamental bottleneck: it requires enormous volumes of human-annotated labels. ImageNet provides 1.2 million labeled images across 1,000 classes, but the features learned are often tied to those specific categories. Can a model learn universal visual representations — features that transfer to any downstream task — without any labels at all?

Self-supervised methods in NLP (BERT, GPT) had already demonstrated that pre-training on unlabeled text produces powerful general-purpose representations. Vision was lagging behind. Contrastive learning methods like MoCo and SimCLR made progress but typically relied on CNNs and required careful negative sampling or large memory banks.

The DINO insight: Vision Transformers, when trained with self-distillation (a student network learning from a momentum-updated teacher), spontaneously learn features that segment objects, understand scene structure, and transfer broadly — all without ever seeing a single label. The name says it all: self-DIstillation with NO labels.

Three Generations

The DINO family spans three generations, each pushing the frontier of self-supervised vision:

2 — DINOv1 Architecture

DINOv1 introduced a self-distillation framework where a student network learns to match the output distribution of a teacher network. Crucially, the teacher is not a separate pre-trained model — it is an exponential moving average (EMA) of the student's own weights, updated after each training step.

Input Image global crops local + global crops Teacher ViT backbone (EMA) Student ViT backbone (trained) EMA update Projection Head Projection Head Centering & Sharpening P_t (softmax) P_s (softmax) Cross-Entropy Loss gradient update (student only)

1Multi-Crop Strategy

2 global crops (224×224) + N local crops (96×96)

Each input image is augmented into multiple views: two global crops covering a large portion of the image (>50%), and several local crops covering smaller regions (~5–25%). The teacher sees only global crops, while the student sees all crops. This asymmetry forces the student to learn that local patches belong to the same global concept — encouraging learning of both local texture and global structure.

2Vision Transformer Backbone

ViT-S/16: [B, 197, 384]

Both student and teacher share the same ViT architecture. The image is split into 16×16 patches, linearly embedded, and processed through transformer blocks with multi-head self-attention. A [CLS] token is prepended and its final representation serves as the image-level feature. DINOv1 supports ViT-S (21M), ViT-B (86M), and ViT-L (304M) variants.

3Projection Head

[B, 384][B, K] (K=65536)

The [CLS] token embedding passes through a 3-layer MLP projection head followed by l2 normalization. The output is a K-dimensional vector (K=65,536 by default) interpreted as a probability distribution over K prototype dimensions. Both student and teacher have their own projection heads.

4Centering & Sharpening

Collapse prevention — teacher output only

Without supervision, self-distillation can collapse to trivial solutions (all images mapped to the same output). DINO prevents this with two mechanisms applied to the teacher output only:

  • Centering: Subtracts a running mean of teacher outputs, preventing any single dimension from dominating
  • Sharpening: Uses a low temperature (τt = 0.04) in the teacher softmax to produce peaked distributions, encouraging confident predictions

The student uses a higher temperature (τs = 0.1), making its distribution softer. The cross-entropy loss between the sharp teacher and soft student drives meaningful feature learning.

Emergent segmentation: A striking discovery in DINOv1 is that the self-attention maps of the final ViT layer naturally highlight object boundaries and segment foreground from background — despite never being trained with segmentation labels. This property is unique to ViTs trained with DINO; supervised ViTs and contrastive CNN methods do not exhibit this behavior.

3 — DINOv2: Scaling to Universal Features

DINOv2 asked a bold question: can we build a visual feature extractor so good that it works as a universal backbone across classification, segmentation, depth estimation, and retrieval — all without any task-specific fine-tuning? The answer was a resounding yes, achieved through three pillars: better data, better training, and larger models.

1LVD-142M: Curated Data at Scale

142 million images — automatic curation pipeline

Previous self-supervised methods typically trained on ImageNet-1K (1.2M images) or uncurated web-scale datasets. DINOv2 introduced an automatic data curation pipeline that built LVD-142M from diverse web sources. The pipeline uses a pre-trained embedding model to de-duplicate images, balance concepts, and remove near-duplicates — producing a dataset that is both large and diverse without requiring manual annotation.

The pipeline works in three stages: retrieval of candidate images using embedding similarity to curated seed data, de-duplication via copy detection, and rebalancing to ensure concept diversity. This is far more effective than simply scraping billions of uncurated images.

2Combined Training Objective

DINO loss + iBOT masked image modeling loss

DINOv2 does not rely on the DINO self-distillation loss alone. It combines two complementary objectives:

  • DINO loss — the original [CLS] token self-distillation between student and teacher, encouraging global image understanding
  • iBOT loss — a masked image modeling objective applied to patch tokens. Random patches are masked in the student input, and the student must reconstruct the teacher's patch-level representations for those positions

This combination gives the model both global (image-level) and local (patch-level) understanding. The iBOT loss ensures the patch tokens contain rich spatial information, which is critical for dense prediction tasks like segmentation and depth estimation.

3KoLeo Regularizer

Uniform feature distribution on the hypersphere

A novel KoLeo regularizer (based on the Kozachenko-Leonenko differential entropy estimator) encourages features within each batch to be uniformly distributed on the hypersphere. This prevents feature collapse more robustly than centering alone and ensures the model uses the full capacity of its embedding space. The regularizer is applied to the normalized features before the projection head.

4Model Distillation

ViT-g/14 (1.1B) → ViT-S/L/B via distillation

The largest DINOv2 model is a ViT-g/14 with 1.1 billion parameters. To make the features accessible at different compute budgets, Meta distills the giant model into smaller variants (ViT-S, ViT-B, ViT-L) using knowledge distillation. The distilled smaller models significantly outperform models trained from scratch at the same size, inheriting much of the giant model's representation quality.

Universal features: DINOv2 features work as frozen, general-purpose features. A simple linear head on top of frozen DINOv2 ViT-g/14 features achieves 84.5% on ImageNet classification, state-of-the-art on ADE20K segmentation, competitive depth estimation on NYUd, and strong image retrieval — all without fine-tuning the backbone. This is the promise of a true visual foundation model.

4 — DINOv3: The 7B Frontier

DINOv3 represents the next leap in scaling vision foundation models. Building on the DINOv2 recipe, it pushes to 7 billion parameters trained on 1.7 billion images, confronting the fundamental challenges that emerge when self-supervised training meets extreme scale.

1The Scaling Challenge

Training instability at 7B parameters

Naively scaling the DINOv2 recipe to 7B parameters causes training instabilities: loss spikes, gradient explosions, and eventual divergence. These issues are well-known in large language model training but manifest differently in self-supervised vision models because the teacher-student dynamics create additional feedback loops. Small perturbations in the student propagate through EMA to the teacher, which then amplifies them back to the student.

2Gram Anchoring

Stabilizing training at extreme scale

The key innovation in DINOv3 is Gram Anchoring, a technique that stabilizes self-supervised training at very large scale. It works by periodically anchoring the Gram matrix (the matrix of feature-feature correlations) to a reference state, preventing the representation geometry from drifting into degenerate configurations during training.

Gram Anchoring acts as a soft constraint on the feature space's covariance structure. When the correlations between features begin to shift too far from the anchor, a regularization term pulls them back. This is gentler than hard resets and more principled than ad-hoc learning rate adjustments.

3Expanded Data & Architecture

1.7B images — ViT with 7B parameters

DINOv3 trains on a substantially expanded dataset of 1.7 billion images, roughly 12x the LVD-142M dataset. The architecture remains a Vision Transformer but scaled to 7B parameters with wider hidden dimensions and more transformer layers. The multi-crop strategy and combined DINO + iBOT losses from v2 are retained.

Scaling laws for vision: DINOv3 provides empirical evidence that self-supervised vision models follow scaling laws similar to LLMs — performance on downstream tasks improves predictably as a function of model size and training data, provided training stability is maintained. Gram Anchoring is the key enabler that unlocks this scaling regime.

5 — Evolution: v1 → v2 → v3

DINOv1 2021 Self-distillation Momentum teacher Multi-crop strategy Centering & sharpening ViT-S/B (21M–86M) ImageNet-1K (1.2M imgs) 77.0% k-NN (ViT-S/16) Emergent segmentation [CLS] token features +data +iBOT DINOv2 2023 DINO + iBOT losses KoLeo regularizer Auto data curation LVD-142M (142M imgs) ViT-g/14 (1.1B params) Model distillation 84.5% linear (ViT-g) Universal features Dense prediction tasks +scale +Gram DINOv3 2025 Gram Anchoring 7B parameters 1.7B images Stable extreme-scale DINOv2 recipe retained Vision scaling laws SOTA universal features Scaling law evidence Foundation model era
Attribute DINOv1 (2021) DINOv2 (2023) DINOv3 (2025)
Training Loss DINO self-distillation DINO + iBOT DINO + iBOT + Gram Anchoring
Largest Model ViT-B/16 (86M) ViT-g/14 (1.1B) ViT (7B)
Training Data ImageNet-1K (1.2M) LVD-142M (142M) ~1.7B images
ImageNet Accuracy 77.0% (k-NN, ViT-S/16) 84.5% (linear, ViT-g/14) Further improvements
Feature Tokens [CLS] only [CLS] + patch tokens [CLS] + patch tokens
Collapse Prevention Centering + sharpening Centering + KoLeo Centering + KoLeo + Gram Anchoring
Key Contribution Emergent ViT properties Universal visual features Stable training at extreme scale
Downstream Tasks Classification, retrieval Classification, segmentation, depth, retrieval All tasks with improved quality

6 — Training Details

DINOv1 Training

Optimization Setup

16 V100 GPUs — 300–800 epochs on ImageNet

Optimizer: AdamW with weight decay 0.04–0.4 (cosine schedule). Learning rate: linearly warmed up for 10 epochs, then cosine decay. Batch size: 1024. Multi-crop: 2 global crops at 224×224 + 10 local crops at 96×96. Teacher EMA: momentum starts at 0.996 and increases to 1.0 with a cosine schedule during training.

Teacher temperature: τt = 0.04 (warmed up from 0.04 to 0.07 over 30 epochs). Student temperature: τs = 0.1. The centering vector c is updated with momentum 0.9 as an exponential moving average of the teacher's output batch means.

DINOv2 Training

Large-Scale Setup

ViT-g/14 on LVD-142M — A100 GPU cluster

Optimizer: AdamW. Batch size: scaled up to 3072 for larger models. Resolution: 224×224 during pre-training, with a short 518×518 fine-tuning phase for high-resolution features. Mixed precision: bfloat16 for memory efficiency. Training duration: 625,000 iterations for the ViT-g/14 model.

Loss weighting: The DINO (image-level) and iBOT (patch-level) losses are combined with equal weight. The iBOT masking follows a block-wise random strategy masking ~50% of patches. The KoLeo regularizer weight is tuned to balance uniform distribution without disrupting the primary losses.

Distillation recipe: After training the ViT-g teacher, smaller models (ViT-S/14, ViT-B/14, ViT-L/14) are trained via knowledge distillation by matching the teacher's patch and CLS token representations. This takes significantly less compute than training from scratch and produces models that are notably stronger than self-supervised training at the same scale.

DINOv3 Training

Extreme-Scale Setup

7B parameters — massive GPU cluster

DINOv3 training requires large clusters of high-bandwidth GPUs (H100 or equivalent). The Gram Anchoring mechanism adds minimal computational overhead — periodically computing and comparing Gram matrices is inexpensive relative to the forward/backward pass of a 7B model. The anchoring interval and regularization strength are key hyperparameters that balance stability against constraining the model's learning capacity.

7 — Results

Image Classification

Model Params ImageNet k-NN ImageNet Linear Method
DINOv1 ViT-S/16 21M 77.0% 77.0% Self-distillation only
DINOv1 ViT-B/16 86M 78.3% 78.2% Self-distillation only
DINOv2 ViT-S/14 21M 81.1% Distilled from ViT-g
DINOv2 ViT-B/14 86M 82.1% Distilled from ViT-g
DINOv2 ViT-L/14 304M 83.5% Distilled from ViT-g
DINOv2 ViT-g/14 1.1B 84.5% DINO + iBOT + KoLeo

Dense Prediction (DINOv2 ViT-g/14, frozen backbone)

Task Dataset Metric DINOv2 (linear) Notes
Semantic Segmentation ADE20K mIoU 49.0 Linear head on frozen features
Depth Estimation NYUd RMSE 0.326 Linear head on frozen features
Image Retrieval Oxford / Paris mAP SOTA Nearest neighbor, no fine-tuning
The universality gap: Prior self-supervised methods (e.g., MAE, MoCo) required fine-tuning to match supervised baselines. DINOv2 is the first model where frozen features with a simple linear head match or exceed fine-tuned supervised models across multiple tasks simultaneously. This closes the gap between self-supervised and supervised pre-training.

Emergent Properties (DINOv1)

Beyond quantitative benchmarks, DINOv1 revealed qualitative properties that were not present in supervised models or prior self-supervised approaches:

8 — Key Takeaways

Self-Distillation Is All You Need

DINO demonstrated that a momentum teacher-student framework, without negative pairs, contrastive losses, or cluster assignments, is sufficient to learn excellent visual representations. The simplicity of the approach — student matches teacher, teacher is EMA of student — belies its power. Centering and sharpening are the only additions needed to prevent collapse.

ViTs Unlock Emergent Properties

The emergent segmentation in DINOv1 attention maps was a watershed moment for the field. It showed that Vision Transformers, when freed from supervised label constraints, learn to decompose scenes into semantic parts. This property does not emerge with CNNs or with supervised ViT training — it is unique to the combination of ViTs and self-supervised self-distillation.

Data Curation Matters as Much as Scale

DINOv2's LVD-142M dataset demonstrated that curated data consistently outperforms uncurated data at the same scale. The automatic pipeline for retrieval, de-duplication, and rebalancing is as important as the training algorithm itself. This lesson carries forward into DINOv3, where the dataset scales to 1.7B images while maintaining curation quality.

Vision Foundation Models Are Here

The DINO family established that self-supervised vision models can serve as universal feature extractors, much like large language models serve as universal text processors. DINOv2 features work across classification, segmentation, depth estimation, and retrieval with frozen backbones. DINOv3 pushes this to 7B parameters, suggesting that vision scaling laws parallel those in NLP.

Stability Is the Bottleneck at Scale

DINOv3's Gram Anchoring highlights that the primary challenge in scaling self-supervised vision models is not compute or data but training stability. The teacher-student feedback loop in self-distillation creates unique instability modes that worsen with scale. Solving this problem unlocks the next order of magnitude in model size and downstream performance.

Impact on the field: DINO features have become the default visual backbone for numerous downstream systems. DINOv2 is widely used in robotics (visual representations for manipulation), autonomous driving (depth and segmentation), medical imaging (transfer learning without domain-specific labels), and multimodal models (visual encoders for VLMs). The DINO family fundamentally shifted computer vision from supervised pre-training to self-supervised foundation models.

9 — References & Further Reading