π₀: A Vision-Language-Action Flow Model

General Robot Control via Flow Matching

VLA Flow Matching Robot Foundation Model Physical Intelligence 2024

Paper: π₀: A Vision-Language-Action Flow Model for General Robot Control (Black et al., 2024)

1 — What the Robot Actually Sees

Before understanding the model, you need to understand the inputs. π₀ controls real physical robots — it sees through cameras, reads language instructions, and feels its own joint positions. Let's make each input concrete.

Camera Images (2–3 per robot)

[n, 224, 224, 3] RGB

Each robot has 2–3 cameras providing different viewpoints: typically an overhead/third-person view showing the workspace and a wrist-mounted camera for close-up manipulation detail. Some bimanual robots add a third camera. Images are 224×224 RGB, processed by a pre-trained SigLIP vision encoder.

Language Instruction

tokenized text → embeddings

Natural language commands like "fold the shirt," "pick up the egg and place it in the carton," or "clear the table." These can be high-level task descriptions or fine-grained segment labels (~2-second sub-instructions). Tokenized by the PaliGemma tokenizer and processed through the VLM backbone.

Proprioceptive State q_t

[18] joint angles (zero-padded)

The robot's own joint angles — its sense of where its arms are. Standardized to 18 dimensions across all 7 robot embodiments via zero-padding. A 7-DoF single arm uses 8 values (7 joints + gripper); a bimanual 14-DoF system uses 16 values; remaining dimensions are zeros.

2 — What the Model Outputs

π₀ doesn't output a single action — it outputs an entire action chunk of H=50 future steps. This is critical for smooth, coordinated motion.

Open-loop chunking: The robot executes 16 of the 50 predicted actions (0.8 seconds at 20Hz), then re-plans from scratch with a new observation. This provides reactive closed-loop behavior at the re-planning frequency while maintaining smooth, coordinated motion within each chunk. Mobile robots at 50Hz execute 25 steps (0.5s) before re-planning.

3 — Architecture Overview

π₀ combines a pre-trained Vision-Language Model (PaliGemma, 3B params) with a dedicated Action Expert module (300M params). The VLM understands what it sees and reads; the Action Expert generates smooth motor commands via flow matching.

4 — The VLM Backbone (PaliGemma)

π₀ builds on PaliGemma, a 3B-parameter vision-language model from Google. This gives π₀ a rich understanding of what it sees and reads — pre-trained on billions of image-text pairs from the internet before ever touching a robot.

SigLIP Vision Encoder (per camera)

[B, 3, 224, 224] → [B, 256, 1152]

Each 224×224 image is split into 14×14 = 256 patches and encoded by the frozen SigLIP-So400m ViT. For n cameras the outputs are concatenated into a [B, n·256, 1152] token sequence (typically 512–768 tokens total). A learned projection maps these to the 2048-dim width of the VLM backbone: [B, n·256, 2048].

Gemma Language Model

[B, L, 2048] → [B, L, 2048]

PaliGemma's 2B Gemma backbone (width = 2048, MLP dim = 8192, 18 layers) jointly processes the projected image tokens and text tokens. Sequence length L is typically ~600–800 (n·256 image tokens + language tokens). Attention is bidirectional within this block (see §5). By starting from a pre-trained VLM, π₀ inherits web-scale object and spatial knowledge before seeing a single robot trajectory.

Why start from a VLM? Robot demonstration data is scarce (~10,000 hours). Internet data is effectively infinite. By initializing from PaliGemma, π₀ already knows what a "shirt" looks like, understands "left of," and can parse complex language instructions — all before seeing a single robot trajectory. This is the same pre-training/fine-tuning paradigm that made LLMs successful.

5 — The Action Expert

The Action Expert is π₀'s secret weapon — a separate 300M-parameter module that lives inside the transformer and is dedicated exclusively to generating smooth, continuous robot actions.

Why a Separate Expert?

300M params, width = 1024, 50 action tokens

Previous VLA models like RT-2 and OpenVLA discretized actions into text tokens (e.g., "move left 0.3"). This is lossy and slow — you're forcing a language model to output numbers as words. π₀'s Action Expert operates in continuous space via flow matching, producing precise floating-point joint commands at much higher bandwidth. It reads the same 2048-dim residual stream as the VLM but its own weights project through a narrower 1024-dim width, giving it dedicated capacity for motor control without bloating the whole model.

Token Shapes Inside the Transformer

image + language + state + action tokens

The fused token stream at each transformer layer has roughly this structure:

Image tokens: [B, n·256, 2048] — from SigLIP, ~512–768 total
Language tokens: [B, L_lang, 2048] — up to ~100 tokens
State token: [B, 1, 1024] — one token encoding q_t (18 dims projected up)
Action tokens: [B, 50, 1024] — one per future step in the chunk, plus sinusoidal flow-time embedding φ(τ)

State and action tokens live in the 1024-dim Action Expert subspace; image/language tokens live in the 2048-dim VLM subspace. The blockwise causal mask (below) controls which tokens see which.

Blockwise Causal Attention

The transformer uses three attention blocks with a specific causal structure:

Why blockwise causal? The VLM tokens (images + language) never attend to action tokens — this lets us cache the VLM computation across flow matching steps. The action tokens attend to everything (they need full context). The state block bridges the two: it can see the VLM outputs but not the actions, maintaining a clean information flow.

Action Expert Dimensions

Component	VLM Backbone	Action Expert
Parameters	3B	300M
Hidden width	2048	1024
MLP dim	8192	4096
Tokens handled	image + language (~600–800)	1 state + 50 action (51)
Token shape	[B, L_vlm, 2048]	[B, 51, 1024]
Processes	Images + language	State + noisy actions
Initialized from	PaliGemma (internet)	Random

6 — Flow Matching: From Noise to Actions

The core generative mechanism in π₀ is flow matching — a modern alternative to diffusion that learns to transform Gaussian noise into coherent action sequences via a continuous velocity field.

Training Loss: Conditional Flow Matching

tensors: A_t, A_t^τ, v, u all [B, 50, 18]

L(θ) = E[ || v_θ(A_t^τ, o_t) − u(A_t^τ | A_t) ||² ]

During training, we know the ground-truth clean actions A_t ∈ [B, 50, 18] from demonstrations. We sample flow time τ ∼ U(0, 1) and Gaussian noise ε ∈ [B, 50, 18], then form the noisy input A_t^τ = τ A_t + (1−τ) ε. The model predicts the velocity field v_θ ∈ [B, 50, 18], and the MSE loss matches it against the target direction u = ε − A_t — both have the same shape, so the loss is a plain elementwise squared error.

Flow matching vs diffusion: Both transform noise into structured outputs. Diffusion uses a discrete Markov chain with many steps (50–1000). Flow matching defines a continuous ODE that can be integrated with fewer steps (just 10 in π₀). Inference is 10 forward passes through the Action Expert — total inference time is ~73ms on an RTX 4090.

Timestep Embedding

Each noisy action token is embedded with both its own value and the flow timestep τ:

Input embedding: W₃ · swish(W₂ · concat(W₁ · a'_t^τ, φ(τ))) — where φ(τ) is a sinusoidal encoding of the flow timestep. This tells the Action Expert how noisy the current actions are, so it can calibrate the appropriate denoising velocity.

7 — Cross-Embodiment: 7 Robots, 1 Model

π₀ trains a single model across 7 different robot configurations simultaneously — from single-arm table robots to mobile bimanual platforms.

Robot	DoF	Cameras	Action Dim	Tasks
UR5e (single)	7	2	8	Table tasks
Franka (single)	8	2	8	Dexterous manipulation
Bimanual UR5e	14	3	16	Folding, assembly
Bimanual ALOHA (Trossen)	14	3	14	Dual-arm tasks
Bimanual ARX/AgileX	14	3	14	Dual-arm tasks
Mobile Trossen/ARX	14	3	16	Mobile manipulation
Mobile Fibocom	14	3	17	Holonomic mobile

8 — Pre-training & Post-training

π₀ follows the LLM playbook: massive pre-training on diverse data, then task-specific fine-tuning (which they call "post-training").

Pre-training

903M timesteps, 700K gradient steps

Data: ~10,000 hours of robot demonstrations across all 7 embodiments. 106M timesteps from single-arm, 797M from dual-arm. Plus 9.1% open-source data (OXE, Bridge v2, DROID).

Task weighting: Data from task-robot combinations weighted by n^0.43 to prevent over-representation of common tasks.

Language labels: Both high-level task names ("fold the shirt") and fine-grained segment annotations (~2-second sub-instructions like "grasp the right sleeve").

Post-training (Fine-tuning)

5–100+ hours per task

Specializes the pre-trained model to specific downstream tasks using curated, high-quality demonstrations. The key finding: pre-training provides recovery behaviors (what to do when things go wrong) that transfer even to novel tasks, while post-training data teaches efficient task-specific strategies.

9 — Results

Task	π₀	OpenVLA (7B)	Octo (93M)
Shirt folding	~95%	~40%	<10%
Table bussing (easy)	~90%	~50%	~20%
Table bussing (hard)	~70%	~30%	<10%
Grocery bagging	~85%	~45%	<10%
Toast extraction	~80%	~35%	<10%

Complex multi-stage tasks (10+ minutes): π₀ achieves >50% success on laundry folding from arbitrary crumpled configurations, box assembly with deformable cardboard, egg packing (delicate objects), and mobile manipulation — tasks that require dozens of sequential sub-skills coordinated over minutes.

Inference Speed

Stage	Time
Image encoding (SigLIP)	14 ms
Observation processing (VLM)	32 ms
Flow matching (10 steps)	27 ms
Total on-board (RTX 4090)	73 ms

10 — References

Black, K., et al. (2024). π₀: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164.

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818.

Kim, M.J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246.

Lipman, Y., et al. (2023). Flow Matching for Generative Modeling. ICLR 2023.

Chi, C., et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS 2024.

Beyer, L., et al. (2024). PaliGemma: A versatile 3B VLM for transfer. arXiv:2407.07726.

π0: A Vision-Language-Action Flow Model

1 — What the Robot Actually Sees

Camera Images (2–3 per robot)

Language Instruction

Proprioceptive State qt

2 — What the Model Outputs

3 — Architecture Overview

4 — The VLM Backbone (PaliGemma)

SigLIP Vision Encoder (per camera)

Gemma Language Model

5 — The Action Expert

Why a Separate Expert?

Token Shapes Inside the Transformer

Blockwise Causal Attention

Action Expert Dimensions

6 — Flow Matching: From Noise to Actions

Training Loss: Conditional Flow Matching

Timestep Embedding

7 — Cross-Embodiment: 7 Robots, 1 Model

8 — Pre-training & Post-training

Pre-training

Post-training (Fine-tuning)

9 — Results

Inference Speed

10 — References

π₀: A Vision-Language-Action Flow Model

Proprioceptive State q_t