FoundationStereo & Fast-FoundationStereo
1 — The Problem to Solve
Stereo matching is the task of estimating depth from two images taken simultaneously by a pair of cameras (like human eyes). For each pixel in the left image, find the corresponding pixel in the right image. The horizontal displacement between them — called disparity — is inversely proportional to depth: close objects have large disparity, far objects have small disparity.
Previous stereo matching models worked well on specific benchmarks but failed on new domains — a model trained on indoor scenes would struggle with outdoor driving, and vice versa. FoundationStereo achieves zero-shot generalization: train once, deploy anywhere, without fine-tuning.
2 — Architecture Overview
FoundationStereo follows the RAFT-Stereo paradigm — extract features, build a cost volume, then iteratively refine disparity estimates — but adds three key innovations: a Side-Tuning Adapter (STA) for the encoder, a hybrid cost volume, and Attentive Hybrid Cost Filtering (AHCF).
3 — Layer-by-Layer Walkthrough
1 Feature Encoder: Side-Tuning Adapter (STA)
EdgeNeXt-S CNN + Frozen DepthAnythingV2 ViT
Both left and right images pass through two parallel encoders:
EdgeNeXt-S — a lightweight CNN that produces multi-level feature pyramids at 1/4, 1/8, 1/16, and 1/32 scales. This CNN is trained end-to-end and learns stereo-specific features.
DepthAnythingV2 — a frozen (non-trainable) ViT foundation model pre-trained on massive real-world monocular depth data. Its features are downscaled and concatenated with the CNN features at corresponding scales.
2 Hybrid Cost Volume
Correlation + Feature Concatenation
The cost volume measures how well left and right image features match at every possible disparity. FoundationStereo uses a hybrid approach combining two signals:
- Group-wise correlation (8 groups) — dot product between left and shifted-right features, split across channel groups. This captures matching similarity.
- Feature concatenation — raw left and right features concatenated at each disparity. This preserves the monocular depth priors from DepthAnythingV2 that would be lost in a pure correlation volume.
3 Attentive Hybrid Cost Filtering (AHCF)
APC + Disparity Transformer
Raw cost volumes are noisy — reflections, textureless surfaces, and repetitive patterns create matching ambiguities. AHCF filters the cost volume with two complementary mechanisms:
Axial-Planar Convolution (APC) decomposes a full 3D convolution into a spatial conv (Ks×Ks×1) and a disparity conv (1×1×Kd). This reduces memory dramatically while capturing local spatial and disparity patterns.
Disparity Transformer (DT) applies self-attention across the entire disparity dimension, providing global reasoning. When a textureless wall creates ambiguity at many disparities, the transformer can reason about the overall disparity distribution to pick the correct match.
4 Iterative Refinement (ConvGRU)
Coarse-to-Fine GRU Updates
Starting from a coarse disparity estimate, a ConvGRU (Convolutional Gated Recurrent Unit) iteratively refines the prediction. Each iteration looks up the cost volume at the current disparity estimate, computes a correction, and updates. 22 iterations during training, 32 at inference. The coarse-to-fine structure starts at low resolution and progressively refines details.
4 — Training Data Strategy
FoundationStereo Dataset (FSD): 1 Million Synthetic Pairs
The key to zero-shot generalization is diverse, high-quality training data. NVIDIA generated 1 million stereo pairs using Omniverse with RTX path tracing (32-128 samples per pixel for photorealism). The dataset covers:
- Structured indoor and outdoor scenes (rooms, streets, warehouses)
- Navigation, driving, and manipulation scenarios
- Randomized "flying object" scenes for extreme variety
- Varying camera intrinsics and baselines
- Challenging conditions: reflections, transparency, low texture, heavy occlusion
5 — Results
Zero-shot performance (no fine-tuning on target domains):
| Benchmark | Metric | FoundationStereo | Domain |
|---|---|---|---|
| Middlebury | BP-2 | 1.2% | Indoor close-range |
| ETH3D | BP-1 | 1.4% | Indoor/outdoor |
| KITTI-12 | D1 | 1.9% | Driving |
| KITTI-15 | D1 | 2.2% | Driving |
| SceneFlow | EPE | 0.33 (prev: 0.41) | Synthetic |
6 — Fast-FoundationStereo (CVPR 2026)
Wen et al. — NVIDIA — GitHub — Making FoundationStereo Real-Time
FoundationStereo achieved state-of-the-art accuracy but is too slow for robotics and real-time applications. Fast-FoundationStereo achieves a ~10× speedup while closely matching accuracy through three complementary techniques: knowledge distillation, Neural Architecture Search (NAS), and structured pruning.
1 Knowledge Distillation: Teacher-Student Training
FoundationStereo as Teacher
The full FoundationStereo model serves as a teacher. A smaller, faster student network is trained to match the teacher's outputs — not just the final disparity map, but also intermediate features and cost volume representations. This transfers the teacher's generalization ability to the student without requiring the expensive ViT encoder at inference.
2 Neural Architecture Search (NAS)
Automated Model Compression
Rather than manually designing the student architecture, NAS automatically discovers the optimal configuration that balances speed and accuracy. The search space includes: backbone width/depth, cost volume resolution, number of GRU iterations, and channel dimensions throughout the network. This finds architectures that a human designer would miss — sometimes counterintuitive choices like wider-but-shallower networks outperform narrow-but-deep ones at the same latency.
3 Structured Pruning
Remove Redundant Channels
Structured pruning removes entire channels (filters) from convolutional and linear layers based on importance scores. Unlike unstructured pruning (which creates sparse weights that don't speed up on GPUs), structured pruning gives real speedups because it reduces matrix dimensions. The pruned model is then fine-tuned to recover accuracy.
FoundationStereo vs. Fast-FoundationStereo
| Feature | FoundationStereo (CVPR 2025) | Fast-FoundationStereo (CVPR 2026) |
|---|---|---|
| Encoder | EdgeNeXt-S + frozen DepthAnythingV2 ViT | Compressed encoder (NAS-optimized, no ViT at inference) |
| Cost Volume | Full hybrid (correlation + concat) | Lightweight cost volume (distilled) |
| Cost Filtering | APC + Disparity Transformer | Pruned filtering network |
| GRU Iterations | 32 | Fewer iterations (NAS-discovered) |
| Key Techniques | Side-tuning, AHCF, synthetic data | Knowledge distillation, NAS, structured pruning |
| Speed | Baseline | ~10× faster |
| Accuracy | SOTA (1st on Middlebury/ETH3D) | ~98% of FoundationStereo accuracy |
| Target Use Case | Offline processing, benchmarks | Real-time robotics, AR, autonomous driving |
7 — References & Further Reading
- FoundationStereo: Zero-Shot Stereo Matching — Wen et al., NVIDIA, CVPR 2025
- Official GitHub Repository
- Project Page
- Fast-FoundationStereo (CVPR 2026)
- RAFT-Stereo — Lipson et al., 2021 (iterative refinement baseline)