DINO: Self-Distillation with No Labels

2021–2025 — Meta AI (FAIR)

Self-Supervised Learning Vision Transformer Self-Distillation Foundation Model Feature Learning Meta AI

DINOv1: arXiv:2104.14294 | DINOv2: arXiv:2304.07193 | Code: facebookresearch/dinov2

1 — The Problem

Supervised learning in computer vision has a fundamental bottleneck: it requires enormous volumes of human-annotated labels. ImageNet provides 1.2 million labeled images across 1,000 classes, but the features learned are often tied to those specific categories. Can a model learn universal visual representations — features that transfer to any downstream task — without any labels at all?

Self-supervised methods in NLP (BERT, GPT) had already demonstrated that pre-training on unlabeled text produces powerful general-purpose representations. Vision was lagging behind. Contrastive learning methods like MoCo and SimCLR made progress but typically relied on CNNs and required careful negative sampling or large memory banks.

The DINO insight: Vision Transformers, when trained with self-distillation (a student network learning from a momentum-updated teacher), spontaneously learn features that segment objects, understand scene structure, and transfer broadly — all without ever seeing a single label. The name says it all: self-DIstillation with NO labels.

Three Generations

The DINO family spans three generations, each pushing the frontier of self-supervised vision:

DINOv1 (2021) — proved self-distillation with ViTs produces emergent segmentation and strong features
DINOv2 (2023) — scaled to 142M curated images and 1.1B parameters, producing universal visual features
DINOv3 (2025) — pushed to 7B parameters and 1.7B images with Gram Anchoring for stable extreme-scale training

2 — DINOv1 Architecture

DINOv1 introduced a self-distillation framework where a student network learns to match the output distribution of a teacher network. Crucially, the teacher is not a separate pre-trained model — it is an exponential moving average (EMA) of the student's own weights, updated after each training step.

1Multi-Crop Strategy

2 global crops (224×224) + N local crops (96×96)

Each input image is augmented into multiple views: two global crops covering a large portion of the image (>50%), and several local crops covering smaller regions (~5–25%). The teacher sees only global crops, while the student sees all crops. This asymmetry forces the student to learn that local patches belong to the same global concept — encouraging learning of both local texture and global structure.

2Vision Transformer Backbone

ViT-S/16: [B, 197, 384]

Both student and teacher share the same ViT architecture. The image is split into 16×16 patches, linearly embedded, and processed through transformer blocks with multi-head self-attention. A [CLS] token is prepended and its final representation serves as the image-level feature. DINOv1 supports ViT-S (21M), ViT-B (86M), and ViT-L (304M) variants.

3Projection Head

[B, 384] → [B, K] (K=65536)

The [CLS] token embedding passes through a 3-layer MLP projection head followed by l₂ normalization. The output is a K-dimensional vector (K=65,536 by default) interpreted as a probability distribution over K prototype dimensions. Both student and teacher have their own projection heads.

4Centering & Sharpening

Collapse prevention — teacher output only

Without supervision, self-distillation can collapse to trivial solutions (all images mapped to the same output). DINO prevents this with two mechanisms applied to the teacher output only:

Centering: Subtracts a running mean of teacher outputs, preventing any single dimension from dominating
Sharpening: Uses a low temperature (τ_t = 0.04) in the teacher softmax to produce peaked distributions, encouraging confident predictions

The student uses a higher temperature (τ_s = 0.1), making its distribution softer. The cross-entropy loss between the sharp teacher and soft student drives meaningful feature learning.

Emergent segmentation: A striking discovery in DINOv1 is that the self-attention maps of the final ViT layer naturally highlight object boundaries and segment foreground from background — despite never being trained with segmentation labels. This property is unique to ViTs trained with DINO; supervised ViTs and contrastive CNN methods do not exhibit this behavior.

3 — DINOv2: Scaling to Universal Features

DINOv2 asked a bold question: can we build a visual feature extractor so good that it works as a universal backbone across classification, segmentation, depth estimation, and retrieval — all without any task-specific fine-tuning? The answer was a resounding yes, achieved through three pillars: better data, better training, and larger models.

1LVD-142M: Curated Data at Scale

142 million images — automatic curation pipeline

Previous self-supervised methods typically trained on ImageNet-1K (1.2M images) or uncurated web-scale datasets. DINOv2 introduced an automatic data curation pipeline that built LVD-142M from diverse web sources. The pipeline uses a pre-trained embedding model to de-duplicate images, balance concepts, and remove near-duplicates — producing a dataset that is both large and diverse without requiring manual annotation.

The pipeline works in three stages: retrieval of candidate images using embedding similarity to curated seed data, de-duplication via copy detection, and rebalancing to ensure concept diversity. This is far more effective than simply scraping billions of uncurated images.

2Combined Training Objective

DINO loss + iBOT masked image modeling loss

DINOv2 does not rely on the DINO self-distillation loss alone. It combines two complementary objectives:

DINO loss — the original [CLS] token self-distillation between student and teacher, encouraging global image understanding
iBOT loss — a masked image modeling objective applied to patch tokens. Random patches are masked in the student input, and the student must reconstruct the teacher's patch-level representations for those positions

This combination gives the model both global (image-level) and local (patch-level) understanding. The iBOT loss ensures the patch tokens contain rich spatial information, which is critical for dense prediction tasks like segmentation and depth estimation.

3KoLeo Regularizer

Uniform feature distribution on the hypersphere

A novel KoLeo regularizer (based on the Kozachenko-Leonenko differential entropy estimator) encourages features within each batch to be uniformly distributed on the hypersphere. This prevents feature collapse more robustly than centering alone and ensures the model uses the full capacity of its embedding space. The regularizer is applied to the normalized features before the projection head.

4Model Distillation

ViT-g/14 (1.1B) → ViT-S/L/B via distillation

The largest DINOv2 model is a ViT-g/14 with 1.1 billion parameters. To make the features accessible at different compute budgets, Meta distills the giant model into smaller variants (ViT-S, ViT-B, ViT-L) using knowledge distillation. The distilled smaller models significantly outperform models trained from scratch at the same size, inheriting much of the giant model's representation quality.

Universal features: DINOv2 features work as frozen, general-purpose features. A simple linear head on top of frozen DINOv2 ViT-g/14 features achieves 84.5% on ImageNet classification, state-of-the-art on ADE20K segmentation, competitive depth estimation on NYUd, and strong image retrieval — all without fine-tuning the backbone. This is the promise of a true visual foundation model.

4 — DINOv3: The 7B Frontier

DINOv3 represents the next leap in scaling vision foundation models. Building on the DINOv2 recipe, it pushes to 7 billion parameters trained on 1.7 billion images, confronting the fundamental challenges that emerge when self-supervised training meets extreme scale.

1The Scaling Challenge

Training instability at 7B parameters

Naively scaling the DINOv2 recipe to 7B parameters causes training instabilities: loss spikes, gradient explosions, and eventual divergence. These issues are well-known in large language model training but manifest differently in self-supervised vision models because the teacher-student dynamics create additional feedback loops. Small perturbations in the student propagate through EMA to the teacher, which then amplifies them back to the student.

2Gram Anchoring

Stabilizing training at extreme scale

The key innovation in DINOv3 is Gram Anchoring, a technique that stabilizes self-supervised training at very large scale. It works by periodically anchoring the Gram matrix (the matrix of feature-feature correlations) to a reference state, preventing the representation geometry from drifting into degenerate configurations during training.

Gram Anchoring acts as a soft constraint on the feature space's covariance structure. When the correlations between features begin to shift too far from the anchor, a regularization term pulls them back. This is gentler than hard resets and more principled than ad-hoc learning rate adjustments.

3Expanded Data & Architecture

1.7B images — ViT with 7B parameters

DINOv3 trains on a substantially expanded dataset of 1.7 billion images, roughly 12x the LVD-142M dataset. The architecture remains a Vision Transformer but scaled to 7B parameters with wider hidden dimensions and more transformer layers. The multi-crop strategy and combined DINO + iBOT losses from v2 are retained.

Scaling laws for vision: DINOv3 provides empirical evidence that self-supervised vision models follow scaling laws similar to LLMs — performance on downstream tasks improves predictably as a function of model size and training data, provided training stability is maintained. Gram Anchoring is the key enabler that unlocks this scaling regime.

5 — Evolution: v1 → v2 → v3

Attribute	DINOv1 (2021)	DINOv2 (2023)	DINOv3 (2025)
Training Loss	DINO self-distillation	DINO + iBOT	DINO + iBOT + Gram Anchoring
Largest Model	ViT-B/16 (86M)	ViT-g/14 (1.1B)	ViT (7B)
Training Data	ImageNet-1K (1.2M)	LVD-142M (142M)	~1.7B images
ImageNet Accuracy	77.0% (k-NN, ViT-S/16)	84.5% (linear, ViT-g/14)	Further improvements
Feature Tokens	[CLS] only	[CLS] + patch tokens	[CLS] + patch tokens
Collapse Prevention	Centering + sharpening	Centering + KoLeo	Centering + KoLeo + Gram Anchoring
Key Contribution	Emergent ViT properties	Universal visual features	Stable training at extreme scale
Downstream Tasks	Classification, retrieval	Classification, segmentation, depth, retrieval	All tasks with improved quality

6 — Training Details

DINOv1 Training

Optimization Setup

16 V100 GPUs — 300–800 epochs on ImageNet

Optimizer: AdamW with weight decay 0.04–0.4 (cosine schedule). Learning rate: linearly warmed up for 10 epochs, then cosine decay. Batch size: 1024. Multi-crop: 2 global crops at 224×224 + 10 local crops at 96×96. Teacher EMA: momentum starts at 0.996 and increases to 1.0 with a cosine schedule during training.

Teacher temperature: τ_t = 0.04 (warmed up from 0.04 to 0.07 over 30 epochs). Student temperature: τ_s = 0.1. The centering vector c is updated with momentum 0.9 as an exponential moving average of the teacher's output batch means.

DINOv2 Training

Large-Scale Setup

ViT-g/14 on LVD-142M — A100 GPU cluster

Optimizer: AdamW. Batch size: scaled up to 3072 for larger models. Resolution: 224×224 during pre-training, with a short 518×518 fine-tuning phase for high-resolution features. Mixed precision: bfloat16 for memory efficiency. Training duration: 625,000 iterations for the ViT-g/14 model.

Loss weighting: The DINO (image-level) and iBOT (patch-level) losses are combined with equal weight. The iBOT masking follows a block-wise random strategy masking ~50% of patches. The KoLeo regularizer weight is tuned to balance uniform distribution without disrupting the primary losses.

Distillation recipe: After training the ViT-g teacher, smaller models (ViT-S/14, ViT-B/14, ViT-L/14) are trained via knowledge distillation by matching the teacher's patch and CLS token representations. This takes significantly less compute than training from scratch and produces models that are notably stronger than self-supervised training at the same scale.

DINOv3 Training

Extreme-Scale Setup

7B parameters — massive GPU cluster

DINOv3 training requires large clusters of high-bandwidth GPUs (H100 or equivalent). The Gram Anchoring mechanism adds minimal computational overhead — periodically computing and comparing Gram matrices is inexpensive relative to the forward/backward pass of a 7B model. The anchoring interval and regularization strength are key hyperparameters that balance stability against constraining the model's learning capacity.

7 — Results

Image Classification

Model	Params	ImageNet k-NN	ImageNet Linear	Method
DINOv1 ViT-S/16	21M	77.0%	77.0%	Self-distillation only
DINOv1 ViT-B/16	86M	78.3%	78.2%	Self-distillation only
DINOv2 ViT-S/14	21M	—	81.1%	Distilled from ViT-g
DINOv2 ViT-B/14	86M	—	82.1%	Distilled from ViT-g
DINOv2 ViT-L/14	304M	—	83.5%	Distilled from ViT-g
DINOv2 ViT-g/14	1.1B	—	84.5%	DINO + iBOT + KoLeo

Dense Prediction (DINOv2 ViT-g/14, frozen backbone)

Task	Dataset	Metric	DINOv2 (linear)	Notes
Semantic Segmentation	ADE20K	mIoU	49.0	Linear head on frozen features
Depth Estimation	NYUd	RMSE	0.326	Linear head on frozen features
Image Retrieval	Oxford / Paris	mAP	SOTA	Nearest neighbor, no fine-tuning

The universality gap: Prior self-supervised methods (e.g., MAE, MoCo) required fine-tuning to match supervised baselines. DINOv2 is the first model where frozen features with a simple linear head match or exceed fine-tuned supervised models across multiple tasks simultaneously. This closes the gap between self-supervised and supervised pre-training.

Emergent Properties (DINOv1)

Beyond quantitative benchmarks, DINOv1 revealed qualitative properties that were not present in supervised models or prior self-supervised approaches:

Object segmentation from attention: The self-attention heads in the last ViT layer attend to semantically meaningful regions, effectively segmenting objects without any segmentation training
k-NN classification: Simple k-nearest-neighbor search in the DINO feature space achieves competitive ImageNet accuracy (77.0% for ViT-S/16), demonstrating the features form well-separated clusters
Copy detection: DINO features are excellent for near-duplicate and copy detection tasks, outperforming purpose-built methods

8 — Key Takeaways

Self-Distillation Is All You Need

DINO demonstrated that a momentum teacher-student framework, without negative pairs, contrastive losses, or cluster assignments, is sufficient to learn excellent visual representations. The simplicity of the approach — student matches teacher, teacher is EMA of student — belies its power. Centering and sharpening are the only additions needed to prevent collapse.

ViTs Unlock Emergent Properties

The emergent segmentation in DINOv1 attention maps was a watershed moment for the field. It showed that Vision Transformers, when freed from supervised label constraints, learn to decompose scenes into semantic parts. This property does not emerge with CNNs or with supervised ViT training — it is unique to the combination of ViTs and self-supervised self-distillation.

Data Curation Matters as Much as Scale

DINOv2's LVD-142M dataset demonstrated that curated data consistently outperforms uncurated data at the same scale. The automatic pipeline for retrieval, de-duplication, and rebalancing is as important as the training algorithm itself. This lesson carries forward into DINOv3, where the dataset scales to 1.7B images while maintaining curation quality.

Vision Foundation Models Are Here

The DINO family established that self-supervised vision models can serve as universal feature extractors, much like large language models serve as universal text processors. DINOv2 features work across classification, segmentation, depth estimation, and retrieval with frozen backbones. DINOv3 pushes this to 7B parameters, suggesting that vision scaling laws parallel those in NLP.

Stability Is the Bottleneck at Scale

DINOv3's Gram Anchoring highlights that the primary challenge in scaling self-supervised vision models is not compute or data but training stability. The teacher-student feedback loop in self-distillation creates unique instability modes that worsen with scale. Solving this problem unlocks the next order of magnitude in model size and downstream performance.

Impact on the field: DINO features have become the default visual backbone for numerous downstream systems. DINOv2 is widely used in robotics (visual representations for manipulation), autonomous driving (depth and segmentation), medical imaging (transfer learning without domain-specific labels), and multimodal models (visual encoders for VLMs). The DINO family fundamentally shifted computer vision from supervised pre-training to self-supervised foundation models.

9 — References & Further Reading

Emerging Properties in Self-Supervised Vision Transformers (DINOv1) — Caron, Touvron, Misra, Jegou, Mairal, Bojanowski, Joulin — ICCV 2021
DINOv2: Learning Robust Visual Features without Supervision — Oquab, Darcet, Moutakanni, Vo, Szafraniec, Khalidov, Fernandez, Haziza, Massa, El-Nouby, Assran, Ballas, Galuba, Howes, Huang, Li, Misra, Rabbat, Sharma, Synnaeve, Xu, Jegou, Mairal, Joulin, Bojanowski — TMLR 2024
Official DINOv2 GitHub Repository — facebookresearch/dinov2
Official DINOv1 GitHub Repository — facebookresearch/dino
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) — Dosovitskiy et al. — ICLR 2021
iBOT: Image BERT Pre-Training with Online Tokenizer — Zhou et al. — ICLR 2022
Our ViT Walkthrough — for background on the Vision Transformer architecture