FoundationStereo & Fast-FoundationStereo

Zero-Shot Stereo Matching — From Accurate to Real-Time
Stereo Matching Depth Estimation Zero-Shot Foundation Model NVIDIA CVPR 2025 & 2026

1 — The Problem to Solve

Stereo matching is the task of estimating depth from two images taken simultaneously by a pair of cameras (like human eyes). For each pixel in the left image, find the corresponding pixel in the right image. The horizontal displacement between them — called disparity — is inversely proportional to depth: close objects have large disparity, far objects have small disparity.

Left Image Object at position x_L Right Image Same object at x_R disparity = x_L - x_R compute Disparity Map near (bright) far (dark) depth = f·B / d 3D Point Cloud Robotics, AR, autonomous driving

Previous stereo matching models worked well on specific benchmarks but failed on new domains — a model trained on indoor scenes would struggle with outdoor driving, and vice versa. FoundationStereo achieves zero-shot generalization: train once, deploy anywhere, without fine-tuning.

Why "foundation model"? Like GPT for language or SAM for segmentation, FoundationStereo achieves strong performance across diverse domains without task-specific fine-tuning. It does this through two pillars: massive diverse synthetic training data and injecting pre-trained monocular depth priors.

2 — Architecture Overview

FoundationStereo follows the RAFT-Stereo paradigm — extract features, build a cost volume, then iteratively refine disparity estimates — but adds three key innovations: a Side-Tuning Adapter (STA) for the encoder, a hybrid cost volume, and Attentive Hybrid Cost Filtering (AHCF).

FEATURE ENCODER (STA) COST VOLUME COST FILTERING (AHCF) ITERATIVE REFINEMENT Left img Right img EdgeNeXt-S CNN backbone multi-scale DepthAny V2 (ViT) FROZEN cat CNN +ViT Hybrid Cost Volume Correlation (8 groups) + Feature concat at shifted disparities 4D: H x W x D x C APC Axial-Planar 3D Conv DT Disparity Transformer spatial + disparity filtering ConvGRU Coarse-to-fine x32 iterations Disparity Map H x W ↓ depth = f·B / disparity 3D Point Cloud

3 — Layer-by-Layer Walkthrough

1 Feature Encoder: Side-Tuning Adapter (STA)

EdgeNeXt-S CNN + Frozen DepthAnythingV2 ViT

Stereo pair → multi-scale feature pyramids

Both left and right images pass through two parallel encoders:

EdgeNeXt-S — a lightweight CNN that produces multi-level feature pyramids at 1/4, 1/8, 1/16, and 1/32 scales. This CNN is trained end-to-end and learns stereo-specific features.

DepthAnythingV2 — a frozen (non-trainable) ViT foundation model pre-trained on massive real-world monocular depth data. Its features are downscaled and concatenated with the CNN features at corresponding scales.

Side-Tuning: Best of both worlds EdgeNeXt-S trainable CNN Learns stereo cues DepthAnythingV2 (frozen) cat Rich Features stereo + monocular Why side-tuning? CNN: learns domain-specific stereo matching from synthetic data ViT: provides real-world semantic priors from internet-scale training Together: bridges sim-to-real gap
Why freeze the ViT? DepthAnythingV2 learned rich depth priors from millions of real images. Fine-tuning it on synthetic stereo data would destroy this knowledge (catastrophic forgetting). By keeping it frozen, FoundationStereo preserves real-world understanding while the CNN learns stereo-specific patterns.

2 Hybrid Cost Volume

Correlation + Feature Concatenation

Left/right features → 4D cost volume (H × W × D × C)

The cost volume measures how well left and right image features match at every possible disparity. FoundationStereo uses a hybrid approach combining two signals:

  • Group-wise correlation (8 groups) — dot product between left and shifted-right features, split across channel groups. This captures matching similarity.
  • Feature concatenation — raw left and right features concatenated at each disparity. This preserves the monocular depth priors from DepthAnythingV2 that would be lost in a pure correlation volume.

3 Attentive Hybrid Cost Filtering (AHCF)

APC + Disparity Transformer

4D cost volume → filtered cost volume

Raw cost volumes are noisy — reflections, textureless surfaces, and repetitive patterns create matching ambiguities. AHCF filters the cost volume with two complementary mechanisms:

Axial-Planar Convolution (APC) decomposes a full 3D convolution into a spatial conv (Ks×Ks×1) and a disparity conv (1×1×Kd). This reduces memory dramatically while capturing local spatial and disparity patterns.

Disparity Transformer (DT) applies self-attention across the entire disparity dimension, providing global reasoning. When a textureless wall creates ambiguity at many disparities, the transformer can reason about the overall disparity distribution to pick the correct match.

4 Iterative Refinement (ConvGRU)

Coarse-to-Fine GRU Updates

Initial estimate → 32 iterations → final disparity

Starting from a coarse disparity estimate, a ConvGRU (Convolutional Gated Recurrent Unit) iteratively refines the prediction. Each iteration looks up the cost volume at the current disparity estimate, computes a correction, and updates. 22 iterations during training, 32 at inference. The coarse-to-fine structure starts at low resolution and progressively refines details.

4 — Training Data Strategy

FoundationStereo Dataset (FSD): 1 Million Synthetic Pairs

The key to zero-shot generalization is diverse, high-quality training data. NVIDIA generated 1 million stereo pairs using Omniverse with RTX path tracing (32-128 samples per pixel for photorealism). The dataset covers:

  • Structured indoor and outdoor scenes (rooms, streets, warehouses)
  • Navigation, driving, and manipulation scenarios
  • Randomized "flying object" scenes for extreme variety
  • Varying camera intrinsics and baselines
  • Challenging conditions: reflections, transparency, low texture, heavy occlusion
Self-curation: An automatic pipeline trains a model, runs inference on its own training set, and removes samples where the bad-pixel rate exceeds 60% — filtering out ambiguous or mislabeled examples that would confuse training.

5 — Results

Zero-shot performance (no fine-tuning on target domains):

BenchmarkMetricFoundationStereoDomain
MiddleburyBP-21.2%Indoor close-range
ETH3DBP-11.4%Indoor/outdoor
KITTI-12D11.9%Driving
KITTI-15D12.2%Driving
SceneFlowEPE0.33 (prev: 0.41)Synthetic
1st place on both Middlebury and ETH3D leaderboards — as a zero-shot model competing against methods that were fine-tuned specifically for each benchmark. This demonstrates true foundation model generalization.

6 — Fast-FoundationStereo (CVPR 2026)

Wen et al. — NVIDIA — GitHub — Making FoundationStereo Real-Time

FoundationStereo achieved state-of-the-art accuracy but is too slow for robotics and real-time applications. Fast-FoundationStereo achieves a ~10× speedup while closely matching accuracy through three complementary techniques: knowledge distillation, Neural Architecture Search (NAS), and structured pruning.

The core challenge: FoundationStereo's accuracy comes from expensive components — a large ViT encoder (DepthAnythingV2), a heavy hybrid cost volume, and 32 GRU iterations. Fast-FoundationStereo systematically compresses each component while preserving the knowledge that makes the model generalize.

1 Knowledge Distillation: Teacher-Student Training

FoundationStereo as Teacher

Large model (teacher) → trains → small model (student)

The full FoundationStereo model serves as a teacher. A smaller, faster student network is trained to match the teacher's outputs — not just the final disparity map, but also intermediate features and cost volume representations. This transfers the teacher's generalization ability to the student without requiring the expensive ViT encoder at inference.

Knowledge Distillation: Teacher → Student Teacher: FoundationStereo EdgeNeXt-S + DepthAnythingV2 Full cost volume + 32 GRU iters distill Feature-level distillation Cost volume distillation Output disparity distillation Student: Fast-FoundationStereo Lightweight backbone, fewer iterations Student retains ~98% of accuracy

2 Neural Architecture Search (NAS)

Automated Model Compression

Search over backbone, cost volume, and GRU configurations

Rather than manually designing the student architecture, NAS automatically discovers the optimal configuration that balances speed and accuracy. The search space includes: backbone width/depth, cost volume resolution, number of GRU iterations, and channel dimensions throughout the network. This finds architectures that a human designer would miss — sometimes counterintuitive choices like wider-but-shallower networks outperform narrow-but-deep ones at the same latency.

3 Structured Pruning

Remove Redundant Channels

Identify and prune low-importance filters

Structured pruning removes entire channels (filters) from convolutional and linear layers based on importance scores. Unlike unstructured pruning (which creates sparse weights that don't speed up on GPUs), structured pruning gives real speedups because it reduces matrix dimensions. The pruned model is then fine-tuned to recover accuracy.

FoundationStereo vs. Fast-FoundationStereo

Accuracy vs. Speed: The Compression Trade-off Latency (ms) → slower Accuracy (AP) → FoundationStereo Best accuracy, slow Fast-FoundationStereo ~98% accuracy, 10x faster ~10× speedup RAFT-Stereo CREStereo AANet
FeatureFoundationStereo (CVPR 2025)Fast-FoundationStereo (CVPR 2026)
EncoderEdgeNeXt-S + frozen DepthAnythingV2 ViTCompressed encoder (NAS-optimized, no ViT at inference)
Cost VolumeFull hybrid (correlation + concat)Lightweight cost volume (distilled)
Cost FilteringAPC + Disparity TransformerPruned filtering network
GRU Iterations32Fewer iterations (NAS-discovered)
Key TechniquesSide-tuning, AHCF, synthetic dataKnowledge distillation, NAS, structured pruning
SpeedBaseline~10× faster
AccuracySOTA (1st on Middlebury/ETH3D)~98% of FoundationStereo accuracy
Target Use CaseOffline processing, benchmarksReal-time robotics, AR, autonomous driving
Pattern recognition: The FoundationStereo → Fast-FoundationStereo progression mirrors a common pattern in deep learning: first build the most accurate model possible (regardless of cost), then compress it for deployment. The same pattern appears in LLMs (GPT-4 → distilled models), detection (DETR → RT-DETR), and segmentation (SAM → EfficientSAM).

7 — References & Further Reading