FoundationStereo & Fast-FoundationStereo

Zero-Shot Stereo Matching — From Accurate to Real-Time

Stereo Matching Depth Estimation Zero-Shot Foundation Model NVIDIA CVPR 2025 & 2026

FoundationStereo: arXiv:2501.09898 | Fast-FoundationStereo: GitHub | Code: NVlabs/FoundationStereo

1 — The Problem to Solve

Stereo matching is the task of estimating depth from two images taken simultaneously by a pair of cameras (like human eyes). For each pixel in the left image, find the corresponding pixel in the right image. The horizontal displacement between them — called disparity — is inversely proportional to depth: close objects have large disparity, far objects have small disparity.

Previous stereo matching models worked well on specific benchmarks but failed on new domains — a model trained on indoor scenes would struggle with outdoor driving, and vice versa. FoundationStereo achieves zero-shot generalization: train once, deploy anywhere, without fine-tuning.

Why "foundation model"? Like GPT for language or SAM for segmentation, FoundationStereo achieves strong performance across diverse domains without task-specific fine-tuning. It does this through two pillars: massive diverse synthetic training data and injecting pre-trained monocular depth priors.

2 — Architecture Overview

FoundationStereo follows the RAFT-Stereo paradigm — extract features, build a cost volume, then iteratively refine disparity estimates — but adds three key innovations: a Side-Tuning Adapter (STA) for the encoder, a hybrid cost volume, and Attentive Hybrid Cost Filtering (AHCF).

3 — Layer-by-Layer Walkthrough

1 Feature Encoder: Side-Tuning Adapter (STA)

EdgeNeXt-S CNN + Frozen DepthAnythingV2 ViT

input [B, 3, H, W] → pyramid at 1/4, 1/8, 1/16, 1/32

Both left and right images pass through two parallel encoders (weights shared across left/right):

EdgeNeXt-S (trainable CNN) produces a feature pyramid at 1/4, 1/8, 1/16, and 1/32 resolutions. Concretely for a 540×960 input: [B, 48, 135, 240], [B, 96, 68, 120], [B, 160, 34, 60], [B, 304, 17, 30].

DepthAnythingV2 (frozen ViT-Large) outputs a single-scale patch grid. For 540×960 it produces [B, 1024, 38, 68] tokens (patch 14) which are bilinearly resized and projected to each pyramid scale, then concatenated with the CNN features along the channel axis. Resulting fused pyramid channels are 48+128, 96+128, 160+128, 304+128.

Why freeze the ViT? DepthAnythingV2 learned rich depth priors from millions of real images. Fine-tuning it on synthetic stereo data would destroy this knowledge (catastrophic forgetting). By keeping it frozen, FoundationStereo preserves real-world understanding while the CNN learns stereo-specific patterns.

2 Hybrid Cost Volume

Correlation + Feature Concatenation

fused features → cost volume [B, C_cv, D, H/4, W/4]

The cost volume measures how well left and right image features match at every possible disparity. Built at 1/4 resolution with D = 192/4 = 48 candidate disparities. FoundationStereo uses a hybrid cost volume combining two signals along the channel dimension:

Group-wise correlation — split the 176-channel features into 8 groups, compute dot product per group across all disparity shifts. Output shape: [B, 8, 48, H/4, W/4].
Feature concatenation — concatenate left and right features at each disparity shift. Output shape: [B, 2·176, 48, H/4, W/4].

Final hybrid cost volume: [B, 8 + 352, 48, H/4, W/4] ≈ [B, 360, 48, 135, 240]. The concatenation branch preserves DepthAnythingV2's monocular priors; the correlation branch carries direct matching similarity.

3 Attentive Hybrid Cost Filtering (AHCF)

APC + Disparity Transformer

cost volume in/out: [B, C, D, H/4, W/4]

Raw cost volumes are noisy — reflections, textureless surfaces, and repetitive patterns create matching ambiguities. AHCF filters the cost volume with two complementary mechanisms:

Axial-Planar Convolution (APC) decomposes a full 3D convolution (K_s, K_s, K_d) into a spatial conv (K_s, K_s, 1) followed by a disparity conv (1, 1, K_d). Parameter count drops from K_s²·K_d·C² to (K_s² + K_d)·C² — roughly a 5× reduction at K=5.

Disparity Transformer (DT) reshapes the cost volume to [B·H/4·W/4, D, C] and applies self-attention across the D=48 disparity tokens. When a textureless wall creates ambiguity across many disparities, global attention lets the model reason about the overall disparity distribution instead of just local maxima. Attention is O(D²) = 2304 per spatial location — tractable because D is small.

4 Iterative Refinement (ConvGRU)

Coarse-to-Fine GRU Updates

disparity: [B, 1, H/4, W/4] → [B, 1, H, W]

Starting from an initial disparity estimate d₀ ∈ [B, 1, H/4, W/4], a ConvGRU iteratively refines the prediction. At each iteration, a lookup kernel samples the filtered cost volume at the current disparity's neighborhood, producing a local cost vector [B, C_lookup, H/4, W/4]; the GRU cell combines this with its hidden state h_t ∈ [B, C_h, H/4, W/4] to produce a residual update Δd_t.

22 iterations during training, 32 at inference. The final disparity is upsampled 4× via learned convex combination to full resolution [B, 1, H, W].

4 — Training Data Strategy

FoundationStereo Dataset (FSD): 1 Million Synthetic Pairs

The key to zero-shot generalization is diverse, high-quality training data. NVIDIA generated 1 million stereo pairs using Omniverse with RTX path tracing (32-128 samples per pixel for photorealism). The dataset covers:

Structured indoor and outdoor scenes (rooms, streets, warehouses)
Navigation, driving, and manipulation scenarios
Randomized "flying object" scenes for extreme variety
Varying camera intrinsics and baselines
Challenging conditions: reflections, transparency, low texture, heavy occlusion

Self-curation: An automatic pipeline trains a model, runs inference on its own training set, and removes samples where the bad-pixel rate exceeds 60% — filtering out ambiguous or mislabeled examples that would confuse training.

5 — Results

Zero-shot performance (no fine-tuning on target domains):

Benchmark	Metric	FoundationStereo	Domain
Middlebury	BP-2	1.2%	Indoor close-range
ETH3D	BP-1	1.4%	Indoor/outdoor
KITTI-12	D1	1.9%	Driving
KITTI-15	D1	2.2%	Driving
SceneFlow	EPE	0.33 (prev: 0.41)	Synthetic

1st place on both Middlebury and ETH3D leaderboards — as a zero-shot model competing against methods that were fine-tuned specifically for each benchmark. This demonstrates true foundation model generalization.

6 — Fast-FoundationStereo (CVPR 2026)

Wen et al. — NVIDIA — GitHub — Making FoundationStereo Real-Time

FoundationStereo achieved state-of-the-art accuracy but is too slow for robotics and real-time applications. Fast-FoundationStereo achieves a ~10× speedup while closely matching accuracy through three complementary techniques: knowledge distillation, Neural Architecture Search (NAS), and structured pruning.

The core challenge: FoundationStereo's accuracy comes from expensive components — a large ViT encoder (DepthAnythingV2), a heavy hybrid cost volume, and 32 GRU iterations. Fast-FoundationStereo systematically compresses each component while preserving the knowledge that makes the model generalize.

1 Knowledge Distillation: Teacher-Student Training

FoundationStereo as Teacher

Large model (teacher) → trains → small model (student)

The full FoundationStereo model serves as a teacher. A smaller, faster student network is trained to match the teacher's outputs — not just the final disparity map, but also intermediate features and cost volume representations. This transfers the teacher's generalization ability to the student without requiring the expensive ViT encoder at inference.

2 Neural Architecture Search (NAS)

Automated Model Compression

Search over backbone, cost volume, and GRU configurations

Rather than manually designing the student architecture, NAS automatically discovers the optimal configuration that balances speed and accuracy. The search space includes: backbone width/depth, cost volume resolution, number of GRU iterations, and channel dimensions throughout the network. This finds architectures that a human designer would miss — sometimes counterintuitive choices like wider-but-shallower networks outperform narrow-but-deep ones at the same latency.

3 Structured Pruning

Remove Redundant Channels

Identify and prune low-importance filters

Structured pruning removes entire channels (filters) from convolutional and linear layers based on importance scores. Unlike unstructured pruning (which creates sparse weights that don't speed up on GPUs), structured pruning gives real speedups because it reduces matrix dimensions. The pruned model is then fine-tuned to recover accuracy.

FoundationStereo vs. Fast-FoundationStereo

Feature	FoundationStereo (CVPR 2025)	Fast-FoundationStereo (CVPR 2026)
Encoder	EdgeNeXt-S + frozen DepthAnythingV2 ViT	Compressed encoder (NAS-optimized, no ViT at inference)
Cost Volume	Full hybrid (correlation + concat)	Lightweight cost volume (distilled)
Cost Filtering	APC + Disparity Transformer	Pruned filtering network
GRU Iterations	32	Fewer iterations (NAS-discovered)
Key Techniques	Side-tuning, AHCF, synthetic data	Knowledge distillation, NAS, structured pruning
Speed	Baseline	~10× faster
Accuracy	SOTA (1st on Middlebury/ETH3D)	~98% of FoundationStereo accuracy
Target Use Case	Offline processing, benchmarks	Real-time robotics, AR, autonomous driving

Pattern recognition: The FoundationStereo → Fast-FoundationStereo progression mirrors a common pattern in deep learning: first build the most accurate model possible (regardless of cost), then compress it for deployment. The same pattern appears in LLMs (GPT-4 → distilled models), detection (DETR → RT-DETR), and segmentation (SAM → EfficientSAM).

7 — References & Further Reading

FoundationStereo: Zero-Shot Stereo Matching — Wen et al., NVIDIA, CVPR 2025
Official GitHub Repository
Project Page
Fast-FoundationStereo (CVPR 2026)
RAFT-Stereo — Lipson et al., 2021 (iterative refinement baseline)