SAFE: Multitask Failure Detection for Vision-Language-Action Models
1 — The Problem: Silent Failures in Robot Manipulation
Vision-Language-Action (VLA) models like π0, OpenVLA, and π0-FAST represent a leap forward in general-purpose robot control. Feed them a camera image and a language instruction — “pick up the red cup” — and they output motor commands. But there is a dangerous gap between generating actions and succeeding at the task.
A robot might reach for an object and miss. It might grasp a cup but drop it halfway through a pour. It might push an item off the table instead of lifting it. In each case, the VLA model continues generating actions as if nothing went wrong. There is no built-in error signal. The model does not know it has failed.
The naive solution is to build task-specific failure detectors: train a classifier to detect “dropped cup” failures, another for “missed grasp” failures, and so on. But this approach does not scale. Every new task or environment requires new labeled failure data and a new detector. What we need is a general-purpose failure monitor that works across tasks — including tasks it has never seen before.
2 — The SAFE Approach: Monitoring Internal Representations
SAFE stands on a simple but powerful insight: the VLA model already knows more than it lets on. Its internal hidden states — the intermediate representations computed at each layer during a forward pass — encode rich information about the scene, the task, and critically, whether the execution is going well or poorly.
Think of it this way: when a VLA model processes a camera image showing a gripper that has clearly missed an object, the internal features at certain layers will differ from those computed when the gripper is firmly holding the object. Even though both scenarios produce action outputs, the path through the network is different. SAFE exploits this difference.
The Three-Step Recipe
1Extract — At each timestep during robot execution, hook into the VLA model and extract hidden state features from one or more intermediate layers.
2Detect — Feed those features into a lightweight detector head (an MLP or LSTM) that outputs a failure probability score between 0 and 1.
3Threshold — Use conformal prediction on a calibration set to choose a detection threshold that provides statistical guarantees on the false negative rate.
3 — Architecture Overview
The diagram above shows the full SAFE pipeline. At every timestep, while the VLA model generates actions for the robot, SAFE extracts hidden state features from an intermediate layer, passes them through a lightweight detector head, and applies a conformally calibrated threshold to make a failure/success decision. The entire monitoring process is parallel to action generation — it does not slow down the robot.
4 — Feature Extraction from VLA Internals
Hidden State Extraction
h_t^(l) ∈ R^d_modelAt each timestep t, the VLA model performs a forward pass: it processes the current camera image, the language instruction, and (optionally) proprioceptive state through its transformer backbone. SAFE hooks into a specific layer l and extracts the hidden state vector h_t^(l).
Which layer to extract from matters. Early layers capture low-level visual features; later layers encode higher-level semantic and task-relevant information. The authors systematically evaluate different layer choices and find that middle-to-late layers tend to produce the most informative features for failure detection.
For models like π0 that use a PaliGemma vision-language backbone, the features come from the language model hidden states after vision-language fusion. For OpenVLA (built on Prismatic + Llama 2), features are extracted from the Llama decoder layers.
Feature Dimensionality
Varies by backboneEach VLA backbone produces different feature dimensions. π0 uses a 2048-dimensional hidden state from PaliGemma. OpenVLA extracts 4096-dimensional features from Llama 2. π0-FAST similarly provides 2048-dimensional features. The detector heads are designed to handle these varying dimensions — the first linear layer of the MLP or LSTM input projection adapts to the specific backbone.
Crucially, the features are extracted without modifying the VLA model. SAFE is a purely observational monitor. It reads internal states through forward hooks but never changes the model weights, gradients, or inference behavior. The VLA model is completely unaware it is being monitored.
5 — Detector Heads: MLP vs. LSTM
Once features are extracted, they are fed into a lightweight classifier head. SAFE evaluates two architectures, each with different trade-offs between simplicity and temporal reasoning.
MLP Detector Head
Single-Timestep Classification1Takes the feature vector h_t from a single timestep as input.
2Passes it through two fully-connected layers with ReLU activation and dropout.
3Outputs a scalar failure probability p(fail | h_t) via a sigmoid activation.
The MLP is fast and simple. It makes an independent prediction at each timestep based solely on the current hidden state. This works surprisingly well because a single frame's hidden state already encodes substantial context about the task state — the VLA's own attention mechanism has already integrated temporal information from prior tokens.
LSTM Detector Head
Temporal Sequence Classification1Maintains a rolling window of the most recent k feature vectors [h_{t-k+1}, ..., h_t].
2Processes the sequence through an LSTM layer that captures temporal dynamics and transitions.
3The final LSTM hidden state is passed through a linear layer + sigmoid to produce p(fail | h_{t-k+1:t}).
The LSTM head excels at detecting failures that unfold over time. Consider a grasp that slowly slips: at any single timestep, the features may look ambiguous, but the trajectory of features over several steps reveals a clear failure pattern. The LSTM captures exactly this kind of temporal signal.
| Property | MLP Head | LSTM Head |
|---|---|---|
| Input | Single timestep feature h_t | Window of features h_{t-k+1:t} |
| Temporal reasoning | None (relies on VLA's own context) | Explicit via recurrent state |
| Latency | Minimal — single forward pass | Slightly higher — sequential over window |
| Best for | Instantaneous failures | Gradual / temporal failures |
| Parameters | ~10K–50K | ~50K–200K |
6 — Conformal Prediction for Safety Guarantees
Getting a failure probability from the detector head is only half the problem. The harder question is: where do you set the threshold? Set it too high and you miss real failures. Set it too low and you trigger false alarms constantly, making the robot unusable. SAFE uses conformal prediction to solve this rigorously.
How Conformal Prediction Works Here
Conformal prediction is a distribution-free framework for uncertainty quantification. It requires only one assumption: that calibration and test data are exchangeable (roughly, they come from the same distribution and their order does not matter). Given this mild assumption, it provides finite-sample guarantees.
Calibration Procedure
One-time setup1Collect calibration data: Run the VLA policy on a set of held-out episodes. Record which episodes succeed and which fail, along with the detector's predicted failure probabilities.
2Compute nonconformity scores: For each failure episode, compute s_i = 1 - p(fail). A high score means the detector assigned low failure probability to an actual failure — i.e., the detector was “wrong.”
3Select the threshold: Sort the scores and pick the ⌈(1 - α)(n + 1)⌉-th smallest value as the threshold τ, where α is the desired false negative rate (e.g., 0.1 for 90% detection) and n is the number of calibration failures.
4Deploy: At test time, flag failure whenever p(fail) > 1 - τ.
7 — Training Details and Experimental Setup
VLA Backbones Evaluated
π0
PaliGemma backbone • 2048-dim featuresPhysical Intelligence's flow-matching VLA model. Uses a PaliGemma vision-language model as the backbone with a flow-matching action head. SAFE extracts features from the PaliGemma language model layers after vision-language fusion has occurred. This model represents the state-of-the-art in generalist robot control.
OpenVLA
Prismatic + Llama 2 backbone • 4096-dim featuresAn open-source VLA model combining a Prismatic vision encoder with a Llama 2 7B language model. Actions are tokenized and generated autoregressively as text tokens. SAFE extracts features from the Llama 2 decoder layers. OpenVLA provides a fully open-source baseline for the community.
π0-FAST
PaliGemma backbone • 2048-dim featuresA variant of π0 that tokenizes actions using a discrete codebook (FAST tokenization) rather than continuous flow matching. This architectural difference means the internal representations may encode task state differently. Testing on π0-FAST validates that SAFE generalizes across action decoding strategies, not just across tasks.
Benchmarks and Tasks
LIBERO Simulation Benchmark
Simulated • 10 task suitesLIBERO provides a diverse set of tabletop manipulation tasks in simulation: picking, placing, stacking, opening drawers, pressing buttons, and more. The key experimental design splits tasks into seen (used for training the detector) and unseen (held out entirely for evaluation). This tests the critical question: can SAFE detect failures on tasks it has never encountered?
Data collection involves rolling out the VLA policy on each task multiple times and recording both the hidden state trajectories and binary success/failure labels (determined by the simulator's ground-truth task completion checker).
Real Robot: Franka Panda
Physical hardware • Manipulation tasksTo validate beyond simulation, SAFE is tested on a physical Franka Emika Panda robot arm performing real manipulation tasks. This bridges the sim-to-real gap: sensor noise, lighting variation, physical dynamics, and genuine manipulation difficulty all come into play. Success/failure labels are assigned by human annotators watching the execution videos.
Training Protocol
The detector heads are trained with binary cross-entropy loss on the seen-task dataset. For the MLP, each timestep is an independent training example. For the LSTM, training examples are windows of consecutive timesteps from a single episode. Training is fast — the detector heads are tiny compared to the VLA backbone — and converges in minutes on a single GPU.
8 — Key Results
Failure Detection Across VLA Backbones
| VLA Backbone | Detector | Seen Tasks (AUROC) | Unseen Tasks (AUROC) | Unseen Detection Rate |
|---|---|---|---|---|
| π0 | MLP | 0.91 | 0.82 | 78% |
| π0 | LSTM | 0.94 | 0.87 | 84% |
| OpenVLA | MLP | 0.88 | 0.79 | 74% |
| OpenVLA | LSTM | 0.92 | 0.84 | 80% |
| π0-FAST | MLP | 0.90 | 0.81 | 76% |
| π0-FAST | LSTM | 0.93 | 0.86 | 82% |
Key Findings
LSTM Outperforms MLP Consistently
Across all three VLA backbones, the LSTM detector head outperforms the MLP head by 3–6 AUROC points on unseen tasks. The temporal modeling provided by the LSTM is especially valuable for detecting gradual failures — cases where a single frame looks ambiguous but the trajectory reveals a clear failure pattern. For instantaneous failures (e.g., complete grasp misses), the gap is smaller.
Generalization to Unseen Tasks
The drop from seen to unseen task performance is modest (5–9 AUROC points), suggesting that VLA hidden states encode task-general failure signatures. When a robot drops an object, the internal representation shifts in a characteristic way regardless of which specific object or task is involved. This is the most important finding: SAFE is not memorizing task-specific failure patterns but learning generalizable failure features.
Conformal Calibration Works
When applying conformal prediction with α = 0.1 (targeting 90% failure detection), the empirical detection rate meets or exceeds the guarantee on held-out test data. This validates that the exchangeability assumption holds well enough in practice for robot manipulation settings, and that the conformal calibration procedure produces reliable, actionable thresholds.
Real Robot Validation
Results on the physical Franka Panda confirm that SAFE works beyond simulation. The sim-to-real gap is present but manageable — detection rates on the real robot are somewhat lower than in LIBERO, but still operationally useful. Importantly, the conformal guarantees transfer: the calibrated thresholds from real-robot calibration data provide the promised coverage.
9 — Which Layer to Monitor?
Not all layers of a VLA model are equally informative for failure detection. SAFE systematically ablates layer choice across all three backbones.
| Layer Position | Feature Type | Detection Quality | Intuition |
|---|---|---|---|
| Early (0–25%) | Low-level visual / token embeddings | Poor | Too low-level; hasn't integrated task semantics |
| Middle (25–60%) | Mid-level representations | Good | Rich mix of visual and semantic information |
| Late-middle (60–80%) | High-level task representations | Best | Task-relevant features before action-specific compression |
| Final (80–100%) | Pre-action / action-specific | Moderate | Over-specialized for action prediction; loses some state info |
10 — Why Internal Monitoring Works
It may seem surprising that a model's own hidden states can reveal its failures. After all, if the model “knew” it was failing, why wouldn't it correct itself? The answer lies in the gap between representation and action.
The Representation-Action Gap
A VLA model's hidden states are trained to represent the world accurately enough to predict useful actions on average. This means the representations encode detailed information about the current state of the scene: where objects are, whether the gripper is holding something, the spatial relationships between the arm and the target.
But the action head is trained to output the most likely next action given this representation. It is not trained to output “I am failing” or “this grasp is slipping.” The failure information is present in the representation but not exposed in the output. SAFE simply trains a small head to read what the action head ignores.
An analogy: imagine a doctor who can accurately describe a patient's symptoms (representation) but has been trained only to prescribe medication (action). The doctor's notes contain all the information needed to detect a misdiagnosis, but the prescription alone does not reveal it. SAFE is like a second opinion that reads the doctor's notes.
11 — Practical Deployment Considerations
Computational Overhead
Minimal impact on inferenceThe detector heads (MLP or LSTM) add negligible computational cost compared to the VLA forward pass itself. A π0 forward pass takes tens of milliseconds on a modern GPU; the MLP detector adds less than 1ms. Feature extraction via forward hooks has zero additional compute cost — the features are already computed as part of the normal VLA forward pass.
What Happens When a Failure Is Detected?
Recovery strategiesSAFE detects failures but does not prescribe a specific recovery strategy. The simplest response is to stop execution and alert a human operator. More sophisticated approaches could include automatic retry (re-attempt the task from the current state), rollback (return to a known-good state), or escalation (switch to a more conservative policy). The choice depends on the deployment context and safety requirements.
Calibration Data Requirements
How much data do you need?Conformal prediction requires a calibration set with both successes and failures. The more calibration failures you have, the tighter the conformal guarantee. In practice, 50–100 failure episodes provide reasonably tight bounds. This is modest: running a VLA policy 200 times on a few tasks will typically yield enough failures for calibration, even for policies with 60–70% success rates.
12 — Key Takeaways
Summary of Contributions
| Contribution | Description |
|---|---|
| General failure detection | Monitor VLA hidden states to detect failures without task-specific training |
| Lightweight detector heads | MLP (single-step) and LSTM (temporal) heads that add minimal overhead |
| Conformal safety guarantees | Calibrated thresholds with provable bounds on false negative rates |
| Cross-architecture evaluation | Validated on π0, OpenVLA, and π0-FAST |
| Sim + real validation | LIBERO benchmark and physical Franka Panda experiments |
| Generalization | 84% failure detection on tasks unseen during detector training |