FoundationStereo & Fast-FoundationStereo

Zero-Shot Stereo Matching — From Accurate to Real-Time
Stereo Matching Depth Estimation Zero-Shot Foundation Model NVIDIA CVPR 2025 & 2026

1 — The Problem to Solve

Stereo matching is the task of estimating depth from two images taken simultaneously by a pair of cameras (like human eyes). For each pixel in the left image, find the corresponding pixel in the right image. The horizontal displacement between them — called disparity — is inversely proportional to depth: close objects have large disparity, far objects have small disparity.

Left Image Object at position x_L Right Image Same object at x_R disparity = x_L - x_R compute Disparity Map near (bright) far (dark) depth = f·B / d 3D Point Cloud Robotics, AR, autonomous driving

Previous stereo matching models worked well on specific benchmarks but failed on new domains — a model trained on indoor scenes would struggle with outdoor driving, and vice versa. FoundationStereo achieves zero-shot generalization: train once, deploy anywhere, without fine-tuning.

Why "foundation model"? Like GPT for language or SAM for segmentation, FoundationStereo achieves strong performance across diverse domains without task-specific fine-tuning. It does this through two pillars: massive diverse synthetic training data and injecting pre-trained monocular depth priors.

2 — Architecture Overview

FoundationStereo follows the RAFT-Stereo paradigm — extract features, build a cost volume, then iteratively refine disparity estimates — but adds three key innovations: a Side-Tuning Adapter (STA) for the encoder, a hybrid cost volume, and Attentive Hybrid Cost Filtering (AHCF).

FEATURE ENCODER (STA) COST VOLUME COST FILTERING (AHCF) ITERATIVE REFINEMENT Left img Right img EdgeNeXt-S CNN backbone multi-scale DepthAny V2 (ViT) FROZEN cat CNN +ViT Hybrid Cost Volume Correlation (8 groups) + Feature concat at shifted disparities 4D: H x W x D x C APC Axial-Planar 3D Conv DT Disparity Transformer spatial + disparity filtering ConvGRU Coarse-to-fine x32 iterations Disparity Map H x W ↓ depth = f·B / disparity 3D Point Cloud

3 — Layer-by-Layer Walkthrough

1 Feature Encoder: Side-Tuning Adapter (STA)

EdgeNeXt-S CNN + Frozen DepthAnythingV2 ViT

input [B, 3, H, W] → pyramid at 1/4, 1/8, 1/16, 1/32

Both left and right images pass through two parallel encoders (weights shared across left/right):

EdgeNeXt-S (trainable CNN) produces a feature pyramid at 1/4, 1/8, 1/16, and 1/32 resolutions. Concretely for a 540×960 input: [B, 48, 135, 240], [B, 96, 68, 120], [B, 160, 34, 60], [B, 304, 17, 30].

DepthAnythingV2 (frozen ViT-Large) outputs a single-scale patch grid. For 540×960 it produces [B, 1024, 38, 68] tokens (patch 14) which are bilinearly resized and projected to each pyramid scale, then concatenated with the CNN features along the channel axis. Resulting fused pyramid channels are 48+128, 96+128, 160+128, 304+128.

Side-Tuning: Best of both worlds EdgeNeXt-S trainable CNN Learns stereo cues DepthAnythingV2 (frozen) cat Rich Features stereo + monocular Why side-tuning? CNN: learns domain-specific stereo matching from synthetic data ViT: provides real-world semantic priors from internet-scale training Together: bridges sim-to-real gap
Why freeze the ViT? DepthAnythingV2 learned rich depth priors from millions of real images. Fine-tuning it on synthetic stereo data would destroy this knowledge (catastrophic forgetting). By keeping it frozen, FoundationStereo preserves real-world understanding while the CNN learns stereo-specific patterns.

2 Hybrid Cost Volume

Correlation + Feature Concatenation

fused features → cost volume [B, C_cv, D, H/4, W/4]

The cost volume measures how well left and right image features match at every possible disparity. Built at 1/4 resolution with D = 192/4 = 48 candidate disparities. FoundationStereo uses a hybrid cost volume combining two signals along the channel dimension:

  • Group-wise correlation — split the 176-channel features into 8 groups, compute dot product per group across all disparity shifts. Output shape: [B, 8, 48, H/4, W/4].
  • Feature concatenation — concatenate left and right features at each disparity shift. Output shape: [B, 2·176, 48, H/4, W/4].

Final hybrid cost volume: [B, 8 + 352, 48, H/4, W/4] ≈ [B, 360, 48, 135, 240]. The concatenation branch preserves DepthAnythingV2's monocular priors; the correlation branch carries direct matching similarity.

3 Attentive Hybrid Cost Filtering (AHCF)

APC + Disparity Transformer

cost volume in/out: [B, C, D, H/4, W/4]

Raw cost volumes are noisy — reflections, textureless surfaces, and repetitive patterns create matching ambiguities. AHCF filters the cost volume with two complementary mechanisms:

Axial-Planar Convolution (APC) decomposes a full 3D convolution (Ks, Ks, Kd) into a spatial conv (Ks, Ks, 1) followed by a disparity conv (1, 1, Kd). Parameter count drops from Ks²·Kd·C² to (Ks² + Kd)·C² — roughly a 5× reduction at K=5.

Disparity Transformer (DT) reshapes the cost volume to [B·H/4·W/4, D, C] and applies self-attention across the D=48 disparity tokens. When a textureless wall creates ambiguity across many disparities, global attention lets the model reason about the overall disparity distribution instead of just local maxima. Attention is O(D²) = 2304 per spatial location — tractable because D is small.

4 Iterative Refinement (ConvGRU)

Coarse-to-Fine GRU Updates

disparity: [B, 1, H/4, W/4][B, 1, H, W]

Starting from an initial disparity estimate d0 ∈ [B, 1, H/4, W/4], a ConvGRU iteratively refines the prediction. At each iteration, a lookup kernel samples the filtered cost volume at the current disparity's neighborhood, producing a local cost vector [B, C_lookup, H/4, W/4]; the GRU cell combines this with its hidden state ht ∈ [B, C_h, H/4, W/4] to produce a residual update Δdt.

22 iterations during training, 32 at inference. The final disparity is upsampled 4× via learned convex combination to full resolution [B, 1, H, W].

4 — Training Data Strategy

FoundationStereo Dataset (FSD): 1 Million Synthetic Pairs

The key to zero-shot generalization is diverse, high-quality training data. NVIDIA generated 1 million stereo pairs using Omniverse with RTX path tracing (32-128 samples per pixel for photorealism). The dataset covers:

  • Structured indoor and outdoor scenes (rooms, streets, warehouses)
  • Navigation, driving, and manipulation scenarios
  • Randomized "flying object" scenes for extreme variety
  • Varying camera intrinsics and baselines
  • Challenging conditions: reflections, transparency, low texture, heavy occlusion
Self-curation: An automatic pipeline trains a model, runs inference on its own training set, and removes samples where the bad-pixel rate exceeds 60% — filtering out ambiguous or mislabeled examples that would confuse training.

5 — Results

Zero-shot performance (no fine-tuning on target domains):

BenchmarkMetricFoundationStereoDomain
MiddleburyBP-21.2%Indoor close-range
ETH3DBP-11.4%Indoor/outdoor
KITTI-12D11.9%Driving
KITTI-15D12.2%Driving
SceneFlowEPE0.33 (prev: 0.41)Synthetic
1st place on both Middlebury and ETH3D leaderboards — as a zero-shot model competing against methods that were fine-tuned specifically for each benchmark. This demonstrates true foundation model generalization.

6 — Fast-FoundationStereo (CVPR 2026)

Wen et al. — NVIDIA — GitHub — Making FoundationStereo Real-Time

FoundationStereo achieved state-of-the-art accuracy but is too slow for robotics and real-time applications. Fast-FoundationStereo achieves a ~10× speedup while closely matching accuracy through three complementary techniques: knowledge distillation, Neural Architecture Search (NAS), and structured pruning.

The core challenge: FoundationStereo's accuracy comes from expensive components — a large ViT encoder (DepthAnythingV2), a heavy hybrid cost volume, and 32 GRU iterations. Fast-FoundationStereo systematically compresses each component while preserving the knowledge that makes the model generalize.

1 Knowledge Distillation: Teacher-Student Training

FoundationStereo as Teacher

Large model (teacher) → trains → small model (student)

The full FoundationStereo model serves as a teacher. A smaller, faster student network is trained to match the teacher's outputs — not just the final disparity map, but also intermediate features and cost volume representations. This transfers the teacher's generalization ability to the student without requiring the expensive ViT encoder at inference.

Knowledge Distillation: Teacher → Student Teacher: FoundationStereo EdgeNeXt-S + DepthAnythingV2 Full cost volume + 32 GRU iters distill Feature-level distillation Cost volume distillation Output disparity distillation Student: Fast-FoundationStereo Lightweight backbone, fewer iterations Student retains ~98% of accuracy

2 Neural Architecture Search (NAS)

Automated Model Compression

Search over backbone, cost volume, and GRU configurations

Rather than manually designing the student architecture, NAS automatically discovers the optimal configuration that balances speed and accuracy. The search space includes: backbone width/depth, cost volume resolution, number of GRU iterations, and channel dimensions throughout the network. This finds architectures that a human designer would miss — sometimes counterintuitive choices like wider-but-shallower networks outperform narrow-but-deep ones at the same latency.

3 Structured Pruning

Remove Redundant Channels

Identify and prune low-importance filters

Structured pruning removes entire channels (filters) from convolutional and linear layers based on importance scores. Unlike unstructured pruning (which creates sparse weights that don't speed up on GPUs), structured pruning gives real speedups because it reduces matrix dimensions. The pruned model is then fine-tuned to recover accuracy.

FoundationStereo vs. Fast-FoundationStereo

Accuracy vs. Speed: The Compression Trade-off Latency (ms) → slower Accuracy (AP) → FoundationStereo Best accuracy, slow Fast-FoundationStereo ~98% accuracy, 10x faster ~10× speedup RAFT-Stereo CREStereo AANet
FeatureFoundationStereo (CVPR 2025)Fast-FoundationStereo (CVPR 2026)
EncoderEdgeNeXt-S + frozen DepthAnythingV2 ViTCompressed encoder (NAS-optimized, no ViT at inference)
Cost VolumeFull hybrid (correlation + concat)Lightweight cost volume (distilled)
Cost FilteringAPC + Disparity TransformerPruned filtering network
GRU Iterations32Fewer iterations (NAS-discovered)
Key TechniquesSide-tuning, AHCF, synthetic dataKnowledge distillation, NAS, structured pruning
SpeedBaseline~10× faster
AccuracySOTA (1st on Middlebury/ETH3D)~98% of FoundationStereo accuracy
Target Use CaseOffline processing, benchmarksReal-time robotics, AR, autonomous driving
Pattern recognition: The FoundationStereo → Fast-FoundationStereo progression mirrors a common pattern in deep learning: first build the most accurate model possible (regardless of cost), then compress it for deployment. The same pattern appears in LLMs (GPT-4 → distilled models), detection (DETR → RT-DETR), and segmentation (SAM → EfficientSAM).

7 — References & Further Reading