π0: A Vision-Language-Action Flow Model
1 — What the Robot Actually Sees
Before understanding the model, you need to understand the inputs. π0 controls real physical robots — it sees through cameras, reads language instructions, and feels its own joint positions. Let's make each input concrete.
Camera Images (2–3 per robot)
[n, 224, 224, 3] RGBEach robot has 2–3 cameras providing different viewpoints: typically an overhead/third-person view showing the workspace and a wrist-mounted camera for close-up manipulation detail. Some bimanual robots add a third camera. Images are 224×224 RGB, processed by a pre-trained SigLIP vision encoder.
Language Instruction
tokenized text → embeddingsNatural language commands like "fold the shirt," "pick up the egg and place it in the carton," or "clear the table." These can be high-level task descriptions or fine-grained segment labels (~2-second sub-instructions). Tokenized by the PaliGemma tokenizer and processed through the VLM backbone.
Proprioceptive State qt
[18] joint angles (zero-padded)The robot's own joint angles — its sense of where its arms are. Standardized to 18 dimensions across all 7 robot embodiments via zero-padding. A 7-DoF single arm uses 8 values (7 joints + gripper); a bimanual 14-DoF system uses 16 values; remaining dimensions are zeros.
2 — What the Model Outputs
π0 doesn't output a single action — it outputs an entire action chunk of H=50 future steps. This is critical for smooth, coordinated motion.
3 — Architecture Overview
π0 combines a pre-trained Vision-Language Model (PaliGemma, 3B params) with a dedicated Action Expert module (300M params). The VLM understands what it sees and reads; the Action Expert generates smooth motor commands via flow matching.
4 — The VLM Backbone (PaliGemma)
π0 builds on PaliGemma, a 3B-parameter vision-language model from Google. This gives π0 a rich understanding of what it sees and reads — pre-trained on billions of image-text pairs from the internet before ever touching a robot.
SigLIP Vision Encoder (per camera)
[B, 3, 224, 224] → [B, 256, 1152]Each 224×224 image is split into 14×14 = 256 patches and encoded by the frozen SigLIP-So400m ViT. For n cameras the outputs are concatenated into a [B, n·256, 1152] token sequence (typically 512–768 tokens total). A learned projection maps these to the 2048-dim width of the VLM backbone: [B, n·256, 2048].
Gemma Language Model
[B, L, 2048] → [B, L, 2048]PaliGemma's 2B Gemma backbone (width = 2048, MLP dim = 8192, 18 layers) jointly processes the projected image tokens and text tokens. Sequence length L is typically ~600–800 (n·256 image tokens + language tokens). Attention is bidirectional within this block (see §5). By starting from a pre-trained VLM, π0 inherits web-scale object and spatial knowledge before seeing a single robot trajectory.
5 — The Action Expert
The Action Expert is π0's secret weapon — a separate 300M-parameter module that lives inside the transformer and is dedicated exclusively to generating smooth, continuous robot actions.
Why a Separate Expert?
300M params, width = 1024, 50 action tokensPrevious VLA models like RT-2 and OpenVLA discretized actions into text tokens (e.g., "move left 0.3"). This is lossy and slow — you're forcing a language model to output numbers as words. π0's Action Expert operates in continuous space via flow matching, producing precise floating-point joint commands at much higher bandwidth. It reads the same 2048-dim residual stream as the VLM but its own weights project through a narrower 1024-dim width, giving it dedicated capacity for motor control without bloating the whole model.
Token Shapes Inside the Transformer
image + language + state + action tokensThe fused token stream at each transformer layer has roughly this structure:
- Image tokens: [B, n·256, 2048] — from SigLIP, ~512–768 total
- Language tokens: [B, L_lang, 2048] — up to ~100 tokens
- State token: [B, 1, 1024] — one token encoding qt (18 dims projected up)
- Action tokens: [B, 50, 1024] — one per future step in the chunk, plus sinusoidal flow-time embedding φ(τ)
State and action tokens live in the 1024-dim Action Expert subspace; image/language tokens live in the 2048-dim VLM subspace. The blockwise causal mask (below) controls which tokens see which.
Blockwise Causal Attention
The transformer uses three attention blocks with a specific causal structure:
Action Expert Dimensions
| Component | VLM Backbone | Action Expert |
|---|---|---|
| Parameters | 3B | 300M |
| Hidden width | 2048 | 1024 |
| MLP dim | 8192 | 4096 |
| Tokens handled | image + language (~600–800) | 1 state + 50 action (51) |
| Token shape | [B, L_vlm, 2048] | [B, 51, 1024] |
| Processes | Images + language | State + noisy actions |
| Initialized from | PaliGemma (internet) | Random |
6 — Flow Matching: From Noise to Actions
The core generative mechanism in π0 is flow matching — a modern alternative to diffusion that learns to transform Gaussian noise into coherent action sequences via a continuous velocity field.
Training Loss: Conditional Flow Matching
tensors: At, Atτ, v, u all [B, 50, 18]L(θ) = E[ || vθ(Atτ, ot) − u(Atτ | At) ||² ]
During training, we know the ground-truth clean actions At ∈ [B, 50, 18] from demonstrations. We sample flow time τ ∼ U(0, 1) and Gaussian noise ε ∈ [B, 50, 18], then form the noisy input Atτ = τ At + (1−τ) ε. The model predicts the velocity field vθ ∈ [B, 50, 18], and the MSE loss matches it against the target direction u = ε − At — both have the same shape, so the loss is a plain elementwise squared error.
Timestep Embedding
Each noisy action token is embedded with both its own value and the flow timestep τ:
7 — Cross-Embodiment: 7 Robots, 1 Model
π0 trains a single model across 7 different robot configurations simultaneously — from single-arm table robots to mobile bimanual platforms.
| Robot | DoF | Cameras | Action Dim | Tasks |
|---|---|---|---|---|
| UR5e (single) | 7 | 2 | 8 | Table tasks |
| Franka (single) | 8 | 2 | 8 | Dexterous manipulation |
| Bimanual UR5e | 14 | 3 | 16 | Folding, assembly |
| Bimanual ALOHA (Trossen) | 14 | 3 | 14 | Dual-arm tasks |
| Bimanual ARX/AgileX | 14 | 3 | 14 | Dual-arm tasks |
| Mobile Trossen/ARX | 14 | 3 | 16 | Mobile manipulation |
| Mobile Fibocom | 14 | 3 | 17 | Holonomic mobile |
8 — Pre-training & Post-training
π0 follows the LLM playbook: massive pre-training on diverse data, then task-specific fine-tuning (which they call "post-training").
Pre-training
903M timesteps, 700K gradient stepsData: ~10,000 hours of robot demonstrations across all 7 embodiments. 106M timesteps from single-arm, 797M from dual-arm. Plus 9.1% open-source data (OXE, Bridge v2, DROID).
Task weighting: Data from task-robot combinations weighted by n0.43 to prevent over-representation of common tasks.
Language labels: Both high-level task names ("fold the shirt") and fine-grained segment annotations (~2-second sub-instructions like "grasp the right sleeve").
Post-training (Fine-tuning)
5–100+ hours per taskSpecializes the pre-trained model to specific downstream tasks using curated, high-quality demonstrations. The key finding: pre-training provides recovery behaviors (what to do when things go wrong) that transfer even to novel tasks, while post-training data teaches efficient task-specific strategies.
9 — Results
| Task | π0 | OpenVLA (7B) | Octo (93M) |
|---|---|---|---|
| Shirt folding | ~95% | ~40% | <10% |
| Table bussing (easy) | ~90% | ~50% | ~20% |
| Table bussing (hard) | ~70% | ~30% | <10% |
| Grocery bagging | ~85% | ~45% | <10% |
| Toast extraction | ~80% | ~35% | <10% |
Inference Speed
| Stage | Time |
|---|---|
| Image encoding (SigLIP) | 14 ms |
| Observation processing (VLM) | 32 ms |
| Flow matching (10 steps) | 27 ms |
| Total on-board (RTX 4090) | 73 ms |
10 — References
Black, K., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164.
Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818.
Kim, M.J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246.
Lipman, Y., et al. (2023). Flow Matching for Generative Modeling. ICLR 2023.
Chi, C., et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS 2024.
Beyer, L., et al. (2024). PaliGemma: A versatile 3B VLM for transfer. arXiv:2407.07726.