FoundationStereo & Fast-FoundationStereo
1 — The Problem to Solve
Stereo matching is the task of estimating depth from two images taken simultaneously by a pair of cameras (like human eyes). For each pixel in the left image, find the corresponding pixel in the right image. The horizontal displacement between them — called disparity — is inversely proportional to depth: close objects have large disparity, far objects have small disparity.
Previous stereo matching models worked well on specific benchmarks but failed on new domains — a model trained on indoor scenes would struggle with outdoor driving, and vice versa. FoundationStereo achieves zero-shot generalization: train once, deploy anywhere, without fine-tuning.
2 — Architecture Overview
FoundationStereo follows the RAFT-Stereo paradigm — extract features, build a cost volume, then iteratively refine disparity estimates — but adds three key innovations: a Side-Tuning Adapter (STA) for the encoder, a hybrid cost volume, and Attentive Hybrid Cost Filtering (AHCF).
3 — Layer-by-Layer Walkthrough
1 Feature Encoder: Side-Tuning Adapter (STA)
EdgeNeXt-S CNN + Frozen DepthAnythingV2 ViT
Both left and right images pass through two parallel encoders (weights shared across left/right):
EdgeNeXt-S (trainable CNN) produces a feature pyramid at 1/4, 1/8, 1/16, and 1/32 resolutions. Concretely for a 540×960 input: [B, 48, 135, 240], [B, 96, 68, 120], [B, 160, 34, 60], [B, 304, 17, 30].
DepthAnythingV2 (frozen ViT-Large) outputs a single-scale patch grid. For 540×960 it produces [B, 1024, 38, 68] tokens (patch 14) which are bilinearly resized and projected to each pyramid scale, then concatenated with the CNN features along the channel axis. Resulting fused pyramid channels are 48+128, 96+128, 160+128, 304+128.
2 Hybrid Cost Volume
Correlation + Feature Concatenation
The cost volume measures how well left and right image features match at every possible disparity. Built at 1/4 resolution with D = 192/4 = 48 candidate disparities. FoundationStereo uses a hybrid cost volume combining two signals along the channel dimension:
- Group-wise correlation — split the 176-channel features into 8 groups, compute dot product per group across all disparity shifts. Output shape: [B, 8, 48, H/4, W/4].
- Feature concatenation — concatenate left and right features at each disparity shift. Output shape: [B, 2·176, 48, H/4, W/4].
Final hybrid cost volume: [B, 8 + 352, 48, H/4, W/4] ≈ [B, 360, 48, 135, 240]. The concatenation branch preserves DepthAnythingV2's monocular priors; the correlation branch carries direct matching similarity.
3 Attentive Hybrid Cost Filtering (AHCF)
APC + Disparity Transformer
Raw cost volumes are noisy — reflections, textureless surfaces, and repetitive patterns create matching ambiguities. AHCF filters the cost volume with two complementary mechanisms:
Axial-Planar Convolution (APC) decomposes a full 3D convolution (Ks, Ks, Kd) into a spatial conv (Ks, Ks, 1) followed by a disparity conv (1, 1, Kd). Parameter count drops from Ks²·Kd·C² to (Ks² + Kd)·C² — roughly a 5× reduction at K=5.
Disparity Transformer (DT) reshapes the cost volume to [B·H/4·W/4, D, C] and applies self-attention across the D=48 disparity tokens. When a textureless wall creates ambiguity across many disparities, global attention lets the model reason about the overall disparity distribution instead of just local maxima. Attention is O(D²) = 2304 per spatial location — tractable because D is small.
4 Iterative Refinement (ConvGRU)
Coarse-to-Fine GRU Updates
Starting from an initial disparity estimate d0 ∈ [B, 1, H/4, W/4], a ConvGRU iteratively refines the prediction. At each iteration, a lookup kernel samples the filtered cost volume at the current disparity's neighborhood, producing a local cost vector [B, C_lookup, H/4, W/4]; the GRU cell combines this with its hidden state ht ∈ [B, C_h, H/4, W/4] to produce a residual update Δdt.
22 iterations during training, 32 at inference. The final disparity is upsampled 4× via learned convex combination to full resolution [B, 1, H, W].
4 — Training Data Strategy
FoundationStereo Dataset (FSD): 1 Million Synthetic Pairs
The key to zero-shot generalization is diverse, high-quality training data. NVIDIA generated 1 million stereo pairs using Omniverse with RTX path tracing (32-128 samples per pixel for photorealism). The dataset covers:
- Structured indoor and outdoor scenes (rooms, streets, warehouses)
- Navigation, driving, and manipulation scenarios
- Randomized "flying object" scenes for extreme variety
- Varying camera intrinsics and baselines
- Challenging conditions: reflections, transparency, low texture, heavy occlusion
5 — Results
Zero-shot performance (no fine-tuning on target domains):
| Benchmark | Metric | FoundationStereo | Domain |
|---|---|---|---|
| Middlebury | BP-2 | 1.2% | Indoor close-range |
| ETH3D | BP-1 | 1.4% | Indoor/outdoor |
| KITTI-12 | D1 | 1.9% | Driving |
| KITTI-15 | D1 | 2.2% | Driving |
| SceneFlow | EPE | 0.33 (prev: 0.41) | Synthetic |
6 — Fast-FoundationStereo (CVPR 2026)
Wen et al. — NVIDIA — GitHub — Making FoundationStereo Real-Time
FoundationStereo achieved state-of-the-art accuracy but is too slow for robotics and real-time applications. Fast-FoundationStereo achieves a ~10× speedup while closely matching accuracy through three complementary techniques: knowledge distillation, Neural Architecture Search (NAS), and structured pruning.
1 Knowledge Distillation: Teacher-Student Training
FoundationStereo as Teacher
The full FoundationStereo model serves as a teacher. A smaller, faster student network is trained to match the teacher's outputs — not just the final disparity map, but also intermediate features and cost volume representations. This transfers the teacher's generalization ability to the student without requiring the expensive ViT encoder at inference.
2 Neural Architecture Search (NAS)
Automated Model Compression
Rather than manually designing the student architecture, NAS automatically discovers the optimal configuration that balances speed and accuracy. The search space includes: backbone width/depth, cost volume resolution, number of GRU iterations, and channel dimensions throughout the network. This finds architectures that a human designer would miss — sometimes counterintuitive choices like wider-but-shallower networks outperform narrow-but-deep ones at the same latency.
3 Structured Pruning
Remove Redundant Channels
Structured pruning removes entire channels (filters) from convolutional and linear layers based on importance scores. Unlike unstructured pruning (which creates sparse weights that don't speed up on GPUs), structured pruning gives real speedups because it reduces matrix dimensions. The pruned model is then fine-tuned to recover accuracy.
FoundationStereo vs. Fast-FoundationStereo
| Feature | FoundationStereo (CVPR 2025) | Fast-FoundationStereo (CVPR 2026) |
|---|---|---|
| Encoder | EdgeNeXt-S + frozen DepthAnythingV2 ViT | Compressed encoder (NAS-optimized, no ViT at inference) |
| Cost Volume | Full hybrid (correlation + concat) | Lightweight cost volume (distilled) |
| Cost Filtering | APC + Disparity Transformer | Pruned filtering network |
| GRU Iterations | 32 | Fewer iterations (NAS-discovered) |
| Key Techniques | Side-tuning, AHCF, synthetic data | Knowledge distillation, NAS, structured pruning |
| Speed | Baseline | ~10× faster |
| Accuracy | SOTA (1st on Middlebury/ETH3D) | ~98% of FoundationStereo accuracy |
| Target Use Case | Offline processing, benchmarks | Real-time robotics, AR, autonomous driving |
7 — References & Further Reading
- FoundationStereo: Zero-Shot Stereo Matching — Wen et al., NVIDIA, CVPR 2025
- Official GitHub Repository
- Project Page
- Fast-FoundationStereo (CVPR 2026)
- RAFT-Stereo — Lipson et al., 2021 (iterative refinement baseline)