JEPA: Joint-Embedding Predictive Architectures

2023–2024 — Meta AI (FAIR) — I-JEPA, V-JEPA, MC-JEPA
Self-Supervised Learning Vision Transformer Predictive Architecture Video Understanding Optical Flow Meta AI

1 — The Problem

Self-supervised vision models generally fall into two camps: invariance-based methods (DINO, SimCLR, MoCo) that pull different views of the same image together in embedding space, and generative methods (MAE, BEiT) that reconstruct missing pixels. Each has a characteristic flaw. Invariance methods require carefully hand-crafted augmentations and can learn features too invariant for dense prediction. Generative methods waste capacity modeling high-frequency pixel noise that carries no semantic content.

Yann LeCun's Joint-Embedding Predictive Architecture proposes a third path: predict the representation of one part of the signal from the representation of another part, inside a learned embedding space. The network is never asked to generate pixels, only to fill in features. This sidesteps both hand-crafted augmentations and pixel-level reconstruction.

The JEPA insight: Prediction in representation space is both easier and more useful than prediction in pixel space. The encoder can discard unpredictable low-level detail (textures, exact pixel values) and focus capacity on semantic content — the part of the signal that is predictable from context.

Three Instances of the Same Idea

The JEPA family applies this recipe across three data regimes:

2 — I-JEPA: Image JEPA

I-JEPA (Assran et al., CVPR 2023) is the canonical JEPA instantiation. Given a single image, it masks out several target blocks and asks a small predictor network to reconstruct their encoded representations from the representations of a separately sampled context block. No pixels are ever decoded; no color-jitter or crop augmentations are used.

Input Image ctx tgt tgt tgt Context Encoder ViT — trained Target Encoder EMA of context EMA stop-grad Predictor narrow ViT + target mask tokens ŷ (predicted) y (target) L2 / smooth-L1 (in feature space) gradient → context encoder + predictor only

1Multi-Block Masking

1 context + 4 target blocks per image

The image is split into 14×14 patches (for a 224×224 input with patch size 16). Four target blocks are sampled with scale 0.15–0.20 of the image and aspect ratios in [0.75, 1.5]. A single large context block is sampled with scale 0.85–1.0; any patches overlapping a target are removed from the context to prevent information leakage. This asymmetry — small scattered targets vs. one large context — forces genuinely semantic prediction rather than local copy-paste.

2Context Encoder (ViT)

[B, N_ctx, D] — ViT-B/14, L/14, or H/14

The context patches (typically ~50–100 tokens depending on sampling) pass through a standard Vision Transformer. I-JEPA uses ViT-B (86M), ViT-L (307M), and ViT-H (632M) variants with patch size 14 or 16. The encoder output is a sequence of patch embeddings used both as a source of prediction signal and, after training, as the general-purpose feature extractor for downstream tasks.

3Predictor

Narrow ViT — 384 dims, 6 layers

The predictor is a small Vision Transformer (much narrower than the encoder) that takes (a) the context encoder output and (b) learnable mask tokens — one per patch position inside each target block, augmented with positional embeddings. It outputs predicted feature vectors ŷ for every masked position. Crucially, the predictor conditions on the target position, so the model must learn a conditional distribution over features, not just a constant mean.

4Target Encoder (EMA)

EMA of context encoder — no gradient

The target encoder is an exponential moving average of the context encoder's weights, updated after every step with momentum typically 0.996 → 1.0 on a cosine schedule. It encodes the full image (not just the context) and provides the target vectors y for the masked positions. Its output is stop-gradient: gradients flow only through the context encoder and predictor. This asymmetry plus feature-space prediction is what prevents collapse — no centering, sharpening, or negative pairs required.

5Loss

Average L2 over masked patch positions

The loss is the mean squared error (or smooth-L1) between predicted and target patch embeddings, averaged over all masked positions across all four target blocks:

L = (1/M) Σi∈targets ||ŷi − sg(yi)||22

where sg(·) denotes stop-gradient. No contrastive term, no reconstruction term, no pixel loss.

Why no collapse? With an EMA target and feature-space prediction, the trivial solution "output a constant vector" is achievable — but the predictor is conditioned on position and must produce different features for different target locations. The optimization landscape favors informative features that let the predictor succeed, not collapsed ones. Empirically, I-JEPA trains stably without centering, sharpening, or any auxiliary anti-collapse term.

3 — V-JEPA: Video JEPA

V-JEPA (Bardes et al., 2024) carries the I-JEPA recipe into the temporal domain. Instead of 2D patches of a single image, the input is a short video clip tokenized into 3D spatio-temporal tubes, and the masking is done on tubes rather than 2D blocks.

video clip (16 frames) + spatio-temporal tube mask time Context Encoder 3D ViT-L or ViT-H Target Encoder EMA (full clip) Predictor narrow ViT + mask tokens Feature L1 over masked tubes

1Spatio-Temporal Tokenization

Clip: 16 × 224 × 2248 × 16 × 16 tubes

A video clip is a stack of RGB frames. V-JEPA tokenizes it into 3D patches of size 2×16×16 (2 frames × 16×16 pixels), giving 8 temporal steps × 14×14 spatial positions = 1,568 tube tokens per clip. Each tube is linearly embedded into the transformer dimension along with a 3D positional encoding.

2Tube Masking Strategy

Short-range + long-range masks — ~90% of tubes

V-JEPA samples two kinds of masks per clip: short-range masks (a rectangular region held fixed across all frames — tests if the model understands temporal extension of a spatial region) and long-range masks (a larger rectangular region also held across time). Roughly 90% of tubes are masked, forcing the model to predict broad regions from a sparse context — far more aggressive than I-JEPA's ~50% mask rate.

3Feature Prediction, Not Pixel Prediction

Loss: L1 between predictor output and EMA-target features

Exactly as in I-JEPA: the context encoder sees the unmasked tubes, the predictor ingests the encoder output plus learnable mask tokens at the missing positions, and the loss is an L1 distance between the predictor's output and the EMA target encoder's full-clip output at the masked positions. V-JEPA explicitly ablates against a pixel-reconstruction variant (MVP-style) and shows that feature-space prediction produces markedly better motion and action features.

4VideoMix2M Pre-training

2M videos — Kinetics-710, SSv2, HowTo100M

V-JEPA is pre-trained on a mixture of ~2 million unlabeled videos pooled from Kinetics-710, Something-Something-v2, and HowTo100M. The paper trains ViT-L/16 and ViT-H/16 3D-ViT variants. Training runs for ~90K iterations on H100/A100 GPU clusters with mixed precision.

Frozen-feature evaluation: V-JEPA features are evaluated by freezing the backbone and training only a small attentive pooler + linear head. Under this protocol, V-JEPA ViT-H/16 matches or beats pixel-prediction methods (VideoMAE, OmniMAE) on action classification (Kinetics-400, SSv2) and appearance-based tasks (ImageNet linear probe), demonstrating that feature-space video prediction produces more transferable representations than pixel-space reconstruction.

4 — MC-JEPA: Motion & Content

MC-JEPA (Bardes et al., 2023) is a different axis of extension. Rather than scale to video clips, it asks whether a single shared encoder can learn both content representations (semantic, object-level) and motion representations (pixel-level optical flow) at the same time — two objectives that are usually served by separate networks.

frame t frame t+1 Shared Encoder ConvNeXt / ViT-like one set of weights Content: Dense JEPA VICReg-style variance/covariance predictive loss on features Motion: Self-Sup Flow photometric + smoothness loss predicts dense optical flow L = L_content + λ · L_flow

1Shared Encoder, Two Heads

Single backbone → content JEPA head + optical-flow head

A single ConvNeXt-style multi-scale encoder produces features for both frames t and t+1. Those features are fed into two task heads that share no parameters beyond the backbone. The content head applies a dense, patch-level JEPA objective; the motion head regresses dense optical flow between the two frames using a classical self-supervised flow objective (photometric reconstruction + smoothness).

2Content Branch: Dense JEPA (VICReg-style)

Per-patch prediction with variance/covariance regularization

The content head applies JEPA not at the image level but per spatial patch: predict the feature of a masked patch from its neighbors. Collapse is prevented by a VICReg-style regularization — explicit variance and covariance terms on the feature embeddings force each dimension to be active and decorrelated. This replaces the EMA target trick of I-JEPA and is cheaper since both frames go through the online encoder.

3Motion Branch: Self-Supervised Optical Flow

Photometric + smoothness loss — no flow labels

A correlation volume is built between the shared-encoder features of frames t and t+1. A small flow head (iterative refinement, RAFT-style) produces a dense per-pixel flow field. Training uses the classical self-supervised flow recipe: warp frame t+1 back to t using the predicted flow, minimize a photometric (census / SSIM) reconstruction loss on the warped result, and add an edge-aware smoothness prior. No ground-truth flow is used anywhere.

4Joint Optimization

L = L_content + λ · L_flow

The two losses are summed with a scalar weight λ. The crucial empirical finding: the shared encoder trained on both tasks produces better content features and better flow than encoders trained on either task alone. Content training regularizes flow against degenerate solutions; flow training forces content features to respect temporal geometry. The model beats VICReg (content-only) on ImageNet linear probe and competes with supervised RAFT on flow benchmarks (Sintel, KITTI) despite never seeing flow labels.

Why multi-task works here: The two objectives are complementary, not redundant. Optical flow is a low-level geometric task that demands fine-grained, localized features; JEPA is a high-level semantic task that demands invariant, object-level features. Asking a single encoder to serve both ends of that spectrum forces it to produce features that are simultaneously fine-grained and semantically organized — which is exactly what downstream vision tasks want.

5 — Comparison: I-JEPA / V-JEPA / MC-JEPA

I-JEPA 2023 — images Context + target blocks EMA target encoder Predictor = narrow ViT No augmentations ViT-B / L / H ImageNet / IN-22K 77.5% LP (ViT-H/16) Strong few-shot Fast pre-training V-JEPA 2024 — video Spatio-temporal tubes EMA target + tube mask Feature-space L1 ~90% mask rate 3D ViT-L / ViT-H VideoMix2M Strong motion features Beats pixel-MAE SOTA frozen K400 MC-JEPA 2023 — multi-task Shared encoder Dense JEPA + SSL flow VICReg anti-collapse No EMA needed ConvNeXt backbone Frame pairs (t, t+1) Content + motion Competitive flow One model, two tasks +time +flow
Attribute I-JEPA (2023) V-JEPA (2024) MC-JEPA (2023)
Input Single image Video clip (16 frames) Frame pair (t, t+1)
Tokenization 2D patches (14 or 16) 3D tubes (2×16×16) ConvNeXt multi-scale
Masking 1 context + 4 target blocks Short/long-range tubes (~90%) Patch-level dense
Prediction space Patch features Tube features Patch features + flow
Anti-collapse EMA target encoder EMA target encoder VICReg variance/covariance
Augmentations None None Minimal
Backbone ViT-B/L/H 3D ViT-L/H ConvNeXt
Key extension Canonical JEPA Temporal dimension Joint content + motion

6 — Training Details

I-JEPA Training

Optimization

16–64 A100 GPUs — 300 epochs on ImageNet / IN-22K

Optimizer: AdamW, weight decay 0.04 → 0.4 cosine. LR: warm-up 15 epochs, base 1e-3, cosine decay. Batch size: 2048. EMA: momentum 0.996 → 1.0 on cosine. Masking: 4 target blocks per image, scale 0.15–0.20, aspect 0.75–1.5; context block scale 0.85–1.0 with overlapping patches removed. Predictor: 6 layers, 384 embed dim, 12 heads. No color-jitter, no horizontal flip, no crop — only standard resize.

V-JEPA Training

Video Pre-training

~2M clips — multi-node H100 cluster

Clip length: 16 frames at 224×224, sampled with random temporal stride. Tube size: 2×16×16. Mask ratio: ~90% of tubes (short- and long-range masks combined). Optimizer: AdamW with bf16 mixed precision. Iterations: 90K with cosine LR schedule and 12K warm-up. Loss: smooth-L1 over masked tubes, no reconstruction loss. Evaluation: frozen backbone + attentive probe on Kinetics, SSv2, ImageNet.

MC-JEPA Training

Joint Content + Flow

Adjacent frame pairs from KITTI / YouTube-VOS

Input: two adjacent frames at ~384×832. Backbone: ConvNeXt-style encoder shared between both branches. Content loss: dense JEPA + VICReg (λvar, λcov) to enforce non-collapse. Flow loss: census-transform photometric + edge-aware smoothness + occlusion-aware masking. Combined: L = Lcontent + λ · Lflow, with λ tuned on validation. Optimizer: Adam, LR 1e-4 cosine.

What's conspicuously missing: there are no color augmentations, no contrastive losses, no reconstruction pixel losses, no clustering assignments, and — for I-JEPA and V-JEPA — no hand-engineered anti-collapse tricks beyond the EMA target. The recipe is deliberately minimal: mask, encode, predict in feature space, stop-gradient the target.

7 — Results

I-JEPA — ImageNet Linear Probe

Model Params Pre-training IN-1K Linear IN-1K 1% Few-shot
MAE ViT-H/14 632M IN-1K, 800 ep 76.6%
I-JEPA ViT-H/14 632M IN-1K, 300 ep 77.3% 65.2%
I-JEPA ViT-H/16 632M IN-22K, 300 ep 77.5%

I-JEPA matches or beats pixel-reconstruction methods (MAE, data2vec) at comparable scale while using fewer pre-training epochs and no augmentations.

V-JEPA — Frozen Evaluation

Task Dataset Metric V-JEPA ViT-H/16 Notes
Action classification Kinetics-400 Top-1 (frozen) 81.9% Attentive probe, frozen backbone
Action classification Something-Something-v2 Top-1 (frozen) 72.2% Motion-heavy benchmark
Image classification ImageNet-1K Top-1 (frozen) 77.4% Appearance transfer from video

MC-JEPA — Content & Flow

Task Benchmark Metric Score
Image classification (frozen) ImageNet-1K Linear top-1 Competitive w/ VICReg / BYOL
Optical flow (zero-shot) Sintel clean EPE (lower = better) On par with self-sup SOTA
Optical flow (zero-shot) KITTI 2015 EPE Strong self-sup flow
The common thread: across all three variants, prediction in feature space produces representations that transfer as frozen features. No fine-tuning of the backbone is required for competitive results — which is exactly the promise of a foundation model.

8 — Key Takeaways

Predict in Feature Space, Not Pixel Space

The unifying thesis of the JEPA family is that self-supervised learning should predict abstract features rather than raw pixels. Pixel-level reconstruction wastes model capacity on unpredictable high-frequency detail (noise, exact textures, lighting). Feature-space prediction lets the encoder discard that entropy and focus on the structured, predictable part of the signal — which turns out to be the semantic part.

No Augmentations Needed

Unlike SimCLR, MoCo, and DINO, I-JEPA and V-JEPA use no color-jitter, no random-resized-crops, and no horizontal flips. Prediction between different spatial regions of the same image is the augmentation. This removes a significant inductive-bias knob — one that had been quietly doing a lot of the work in invariance-based SSL — and makes results more portable across domains where the standard augmentations don't make sense (e.g., medical imaging, remote sensing).

EMA + Stop-Gradient Prevents Collapse

I-JEPA and V-JEPA train stably without centering, sharpening, negative pairs, or variance regularization. The combination of (a) a stop-gradient EMA target, (b) a position-conditioned predictor, and (c) diverse mask locations is enough. MC-JEPA demonstrates a complementary recipe — VICReg-style variance/covariance regularization — that also works, removing the need for an EMA copy.

The Same Recipe Crosses Modalities

The JEPA recipe — encode context, predict masked target features, EMA stop-gradient target — ports cleanly from 2D images (I-JEPA) to 3D video tubes (V-JEPA) with only surface-level changes to tokenization and masking. MC-JEPA shows it also composes with other self-supervised tasks (optical flow) under a shared encoder. This generality is what makes JEPA a candidate recipe rather than just a single model.

JEPA as a Path Toward World Models

LeCun has argued that JEPA-style predictive architectures are a necessary ingredient for AI systems that build world models — learning to predict future states of the world in an abstract feature space where planning is tractable. V-JEPA is the first step in that direction: a video model that predicts how the world looks next, in features rather than pixels. Whether that path delivers on the larger ambition remains open, but I-JEPA, V-JEPA, and MC-JEPA together establish that feature-space prediction is a robust and general recipe.

Context in the field: JEPA sits alongside DINO (self-distillation), MAE (pixel reconstruction), and data2vec (target from teacher at multiple layers) as one of a handful of dominant self-supervised recipes. Each has strengths; JEPA's are data efficiency, no hand-crafted augmentations, and clean generalization across modalities.

9 — References & Further Reading