JEPA: Joint-Embedding Predictive Architectures
1 — The Problem
Self-supervised vision models generally fall into two camps: invariance-based methods (DINO, SimCLR, MoCo) that pull different views of the same image together in embedding space, and generative methods (MAE, BEiT) that reconstruct missing pixels. Each has a characteristic flaw. Invariance methods require carefully hand-crafted augmentations and can learn features too invariant for dense prediction. Generative methods waste capacity modeling high-frequency pixel noise that carries no semantic content.
Yann LeCun's Joint-Embedding Predictive Architecture proposes a third path: predict the representation of one part of the signal from the representation of another part, inside a learned embedding space. The network is never asked to generate pixels, only to fill in features. This sidesteps both hand-crafted augmentations and pixel-level reconstruction.
Three Instances of the Same Idea
The JEPA family applies this recipe across three data regimes:
- I-JEPA (2023) — predicts masked image blocks in embedding space; no augmentations
- V-JEPA (2024) — extends the block-prediction recipe to spatio-temporal video tubes
- MC-JEPA (2023) — jointly learns content (JEPA) and motion (self-supervised optical flow) in one shared encoder
2 — I-JEPA: Image JEPA
I-JEPA (Assran et al., CVPR 2023) is the canonical JEPA instantiation. Given a single image, it masks out several target blocks and asks a small predictor network to reconstruct their encoded representations from the representations of a separately sampled context block. No pixels are ever decoded; no color-jitter or crop augmentations are used.
1Multi-Block Masking
The image is split into 14×14 patches (for a 224×224 input with patch size 16). Four target blocks are sampled with scale 0.15–0.20 of the image and aspect ratios in [0.75, 1.5]. A single large context block is sampled with scale 0.85–1.0; any patches overlapping a target are removed from the context to prevent information leakage. This asymmetry — small scattered targets vs. one large context — forces genuinely semantic prediction rather than local copy-paste.
2Context Encoder (ViT)
The context patches (typically ~50–100 tokens depending on sampling) pass through a standard Vision Transformer. I-JEPA uses ViT-B (86M), ViT-L (307M), and ViT-H (632M) variants with patch size 14 or 16. The encoder output is a sequence of patch embeddings used both as a source of prediction signal and, after training, as the general-purpose feature extractor for downstream tasks.
3Predictor
The predictor is a small Vision Transformer (much narrower than the encoder) that takes (a) the context encoder output and (b) learnable mask tokens — one per patch position inside each target block, augmented with positional embeddings. It outputs predicted feature vectors ŷ for every masked position. Crucially, the predictor conditions on the target position, so the model must learn a conditional distribution over features, not just a constant mean.
4Target Encoder (EMA)
The target encoder is an exponential moving average of the context encoder's weights, updated after every step with momentum typically 0.996 → 1.0 on a cosine schedule. It encodes the full image (not just the context) and provides the target vectors y for the masked positions. Its output is stop-gradient: gradients flow only through the context encoder and predictor. This asymmetry plus feature-space prediction is what prevents collapse — no centering, sharpening, or negative pairs required.
5Loss
The loss is the mean squared error (or smooth-L1) between predicted and target patch embeddings, averaged over all masked positions across all four target blocks:
L = (1/M) Σi∈targets ||ŷi − sg(yi)||22
where sg(·) denotes stop-gradient. No contrastive term, no reconstruction term, no pixel loss.
3 — V-JEPA: Video JEPA
V-JEPA (Bardes et al., 2024) carries the I-JEPA recipe into the temporal domain. Instead of 2D patches of a single image, the input is a short video clip tokenized into 3D spatio-temporal tubes, and the masking is done on tubes rather than 2D blocks.
1Spatio-Temporal Tokenization
A video clip is a stack of RGB frames. V-JEPA tokenizes it into 3D patches of size 2×16×16 (2 frames × 16×16 pixels), giving 8 temporal steps × 14×14 spatial positions = 1,568 tube tokens per clip. Each tube is linearly embedded into the transformer dimension along with a 3D positional encoding.
2Tube Masking Strategy
V-JEPA samples two kinds of masks per clip: short-range masks (a rectangular region held fixed across all frames — tests if the model understands temporal extension of a spatial region) and long-range masks (a larger rectangular region also held across time). Roughly 90% of tubes are masked, forcing the model to predict broad regions from a sparse context — far more aggressive than I-JEPA's ~50% mask rate.
3Feature Prediction, Not Pixel Prediction
Exactly as in I-JEPA: the context encoder sees the unmasked tubes, the predictor ingests the encoder output plus learnable mask tokens at the missing positions, and the loss is an L1 distance between the predictor's output and the EMA target encoder's full-clip output at the masked positions. V-JEPA explicitly ablates against a pixel-reconstruction variant (MVP-style) and shows that feature-space prediction produces markedly better motion and action features.
4VideoMix2M Pre-training
V-JEPA is pre-trained on a mixture of ~2 million unlabeled videos pooled from Kinetics-710, Something-Something-v2, and HowTo100M. The paper trains ViT-L/16 and ViT-H/16 3D-ViT variants. Training runs for ~90K iterations on H100/A100 GPU clusters with mixed precision.
4 — MC-JEPA: Motion & Content
MC-JEPA (Bardes et al., 2023) is a different axis of extension. Rather than scale to video clips, it asks whether a single shared encoder can learn both content representations (semantic, object-level) and motion representations (pixel-level optical flow) at the same time — two objectives that are usually served by separate networks.
1Shared Encoder, Two Heads
A single ConvNeXt-style multi-scale encoder produces features for both frames t and t+1. Those features are fed into two task heads that share no parameters beyond the backbone. The content head applies a dense, patch-level JEPA objective; the motion head regresses dense optical flow between the two frames using a classical self-supervised flow objective (photometric reconstruction + smoothness).
2Content Branch: Dense JEPA (VICReg-style)
The content head applies JEPA not at the image level but per spatial patch: predict the feature of a masked patch from its neighbors. Collapse is prevented by a VICReg-style regularization — explicit variance and covariance terms on the feature embeddings force each dimension to be active and decorrelated. This replaces the EMA target trick of I-JEPA and is cheaper since both frames go through the online encoder.
3Motion Branch: Self-Supervised Optical Flow
A correlation volume is built between the shared-encoder features of frames t and t+1. A small flow head (iterative refinement, RAFT-style) produces a dense per-pixel flow field. Training uses the classical self-supervised flow recipe: warp frame t+1 back to t using the predicted flow, minimize a photometric (census / SSIM) reconstruction loss on the warped result, and add an edge-aware smoothness prior. No ground-truth flow is used anywhere.
4Joint Optimization
The two losses are summed with a scalar weight λ. The crucial empirical finding: the shared encoder trained on both tasks produces better content features and better flow than encoders trained on either task alone. Content training regularizes flow against degenerate solutions; flow training forces content features to respect temporal geometry. The model beats VICReg (content-only) on ImageNet linear probe and competes with supervised RAFT on flow benchmarks (Sintel, KITTI) despite never seeing flow labels.
5 — Comparison: I-JEPA / V-JEPA / MC-JEPA
| Attribute | I-JEPA (2023) | V-JEPA (2024) | MC-JEPA (2023) |
|---|---|---|---|
| Input | Single image | Video clip (16 frames) | Frame pair (t, t+1) |
| Tokenization | 2D patches (14 or 16) | 3D tubes (2×16×16) | ConvNeXt multi-scale |
| Masking | 1 context + 4 target blocks | Short/long-range tubes (~90%) | Patch-level dense |
| Prediction space | Patch features | Tube features | Patch features + flow |
| Anti-collapse | EMA target encoder | EMA target encoder | VICReg variance/covariance |
| Augmentations | None | None | Minimal |
| Backbone | ViT-B/L/H | 3D ViT-L/H | ConvNeXt |
| Key extension | Canonical JEPA | Temporal dimension | Joint content + motion |
6 — Training Details
I-JEPA Training
Optimization
Optimizer: AdamW, weight decay 0.04 → 0.4 cosine. LR: warm-up 15 epochs, base 1e-3, cosine decay. Batch size: 2048. EMA: momentum 0.996 → 1.0 on cosine. Masking: 4 target blocks per image, scale 0.15–0.20, aspect 0.75–1.5; context block scale 0.85–1.0 with overlapping patches removed. Predictor: 6 layers, 384 embed dim, 12 heads. No color-jitter, no horizontal flip, no crop — only standard resize.
V-JEPA Training
Video Pre-training
Clip length: 16 frames at 224×224, sampled with random temporal stride. Tube size: 2×16×16. Mask ratio: ~90% of tubes (short- and long-range masks combined). Optimizer: AdamW with bf16 mixed precision. Iterations: 90K with cosine LR schedule and 12K warm-up. Loss: smooth-L1 over masked tubes, no reconstruction loss. Evaluation: frozen backbone + attentive probe on Kinetics, SSv2, ImageNet.
MC-JEPA Training
Joint Content + Flow
Input: two adjacent frames at ~384×832. Backbone: ConvNeXt-style encoder shared between both branches. Content loss: dense JEPA + VICReg (λvar, λcov) to enforce non-collapse. Flow loss: census-transform photometric + edge-aware smoothness + occlusion-aware masking. Combined: L = Lcontent + λ · Lflow, with λ tuned on validation. Optimizer: Adam, LR 1e-4 cosine.
7 — Results
I-JEPA — ImageNet Linear Probe
| Model | Params | Pre-training | IN-1K Linear | IN-1K 1% Few-shot |
|---|---|---|---|---|
| MAE ViT-H/14 | 632M | IN-1K, 800 ep | 76.6% | — |
| I-JEPA ViT-H/14 | 632M | IN-1K, 300 ep | 77.3% | 65.2% |
| I-JEPA ViT-H/16 | 632M | IN-22K, 300 ep | 77.5% | — |
I-JEPA matches or beats pixel-reconstruction methods (MAE, data2vec) at comparable scale while using fewer pre-training epochs and no augmentations.
V-JEPA — Frozen Evaluation
| Task | Dataset | Metric | V-JEPA ViT-H/16 | Notes |
|---|---|---|---|---|
| Action classification | Kinetics-400 | Top-1 (frozen) | 81.9% | Attentive probe, frozen backbone |
| Action classification | Something-Something-v2 | Top-1 (frozen) | 72.2% | Motion-heavy benchmark |
| Image classification | ImageNet-1K | Top-1 (frozen) | 77.4% | Appearance transfer from video |
MC-JEPA — Content & Flow
| Task | Benchmark | Metric | Score |
|---|---|---|---|
| Image classification (frozen) | ImageNet-1K | Linear top-1 | Competitive w/ VICReg / BYOL |
| Optical flow (zero-shot) | Sintel clean | EPE (lower = better) | On par with self-sup SOTA |
| Optical flow (zero-shot) | KITTI 2015 | EPE | Strong self-sup flow |
8 — Key Takeaways
Predict in Feature Space, Not Pixel Space
The unifying thesis of the JEPA family is that self-supervised learning should predict abstract features rather than raw pixels. Pixel-level reconstruction wastes model capacity on unpredictable high-frequency detail (noise, exact textures, lighting). Feature-space prediction lets the encoder discard that entropy and focus on the structured, predictable part of the signal — which turns out to be the semantic part.
No Augmentations Needed
Unlike SimCLR, MoCo, and DINO, I-JEPA and V-JEPA use no color-jitter, no random-resized-crops, and no horizontal flips. Prediction between different spatial regions of the same image is the augmentation. This removes a significant inductive-bias knob — one that had been quietly doing a lot of the work in invariance-based SSL — and makes results more portable across domains where the standard augmentations don't make sense (e.g., medical imaging, remote sensing).
EMA + Stop-Gradient Prevents Collapse
I-JEPA and V-JEPA train stably without centering, sharpening, negative pairs, or variance regularization. The combination of (a) a stop-gradient EMA target, (b) a position-conditioned predictor, and (c) diverse mask locations is enough. MC-JEPA demonstrates a complementary recipe — VICReg-style variance/covariance regularization — that also works, removing the need for an EMA copy.
The Same Recipe Crosses Modalities
The JEPA recipe — encode context, predict masked target features, EMA stop-gradient target — ports cleanly from 2D images (I-JEPA) to 3D video tubes (V-JEPA) with only surface-level changes to tokenization and masking. MC-JEPA shows it also composes with other self-supervised tasks (optical flow) under a shared encoder. This generality is what makes JEPA a candidate recipe rather than just a single model.
JEPA as a Path Toward World Models
LeCun has argued that JEPA-style predictive architectures are a necessary ingredient for AI systems that build world models — learning to predict future states of the world in an abstract feature space where planning is tractable. V-JEPA is the first step in that direction: a video model that predicts how the world looks next, in features rather than pixels. Whether that path delivers on the larger ambition remains open, but I-JEPA, V-JEPA, and MC-JEPA together establish that feature-space prediction is a robust and general recipe.
9 — References & Further Reading
- Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA) — Assran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas — CVPR 2023
- Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) — Bardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, Ballas — 2024
- MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features — Bardes, Ponce, LeCun — 2023
- A Path Towards Autonomous Machine Intelligence — Yann LeCun position paper — 2022
- Official I-JEPA GitHub Repository — facebookresearch/ijepa
- Official V-JEPA GitHub Repository — facebookresearch/jepa
- Our DINO Walkthrough — companion self-supervised vision model (self-distillation rather than predictive)
- Our ViT Walkthrough — background on Vision Transformer architecture