SAFE: Multitask Failure Detection for Vision-Language-Action Models

2025
Safety Failure Detection VLA Conformal Prediction Robot Monitoring Franka Panda

1 — The Problem: Silent Failures in Robot Manipulation

Vision-Language-Action (VLA) models like π0, OpenVLA, and π0-FAST represent a leap forward in general-purpose robot control. Feed them a camera image and a language instruction — “pick up the red cup” — and they output motor commands. But there is a dangerous gap between generating actions and succeeding at the task.

A robot might reach for an object and miss. It might grasp a cup but drop it halfway through a pour. It might push an item off the table instead of lifting it. In each case, the VLA model continues generating actions as if nothing went wrong. There is no built-in error signal. The model does not know it has failed.

The core danger: VLA models fail silently. A robot executing a failed grasp looks exactly the same to the policy as a robot executing a successful one — both just keep generating the next action token. Without external monitoring, failures go undetected until a human notices.

The naive solution is to build task-specific failure detectors: train a classifier to detect “dropped cup” failures, another for “missed grasp” failures, and so on. But this approach does not scale. Every new task or environment requires new labeled failure data and a new detector. What we need is a general-purpose failure monitor that works across tasks — including tasks it has never seen before.

Why not just use vision? You could train an external vision model to watch the robot and detect failures from camera images alone. But this requires extensive per-task labeling, struggles with occlusion, and adds another large model to the inference stack. SAFE takes a fundamentally different approach: it looks inside the VLA model itself.

2 — The SAFE Approach: Monitoring Internal Representations

SAFE stands on a simple but powerful insight: the VLA model already knows more than it lets on. Its internal hidden states — the intermediate representations computed at each layer during a forward pass — encode rich information about the scene, the task, and critically, whether the execution is going well or poorly.

Think of it this way: when a VLA model processes a camera image showing a gripper that has clearly missed an object, the internal features at certain layers will differ from those computed when the gripper is firmly holding the object. Even though both scenarios produce action outputs, the path through the network is different. SAFE exploits this difference.

The Three-Step Recipe

1Extract — At each timestep during robot execution, hook into the VLA model and extract hidden state features from one or more intermediate layers.

2Detect — Feed those features into a lightweight detector head (an MLP or LSTM) that outputs a failure probability score between 0 and 1.

3Threshold — Use conformal prediction on a calibration set to choose a detection threshold that provides statistical guarantees on the false negative rate.

Key insight: SAFE is trained on a set of “seen” tasks but generalizes to entirely new, unseen tasks. The internal features learned by VLA models capture task-general notions of success and failure — not just task-specific visual patterns. This is what makes the approach practical.

3 — Architecture Overview

SAFE Monitoring Pipeline VLA Model (pi0, OpenVLA, pi0-FAST) Camera Image Language Instruction Layer l — Hidden States h_t Layer l+1 — Hidden States Layer l+2 — Hidden States Action Output a_t Extract Features Feature Vector f_t = h_t^(l) dim: d_model Detector Head MLP (single step) or LSTM (temporal) Output: p(failure) Conformal Prediction Calibrated threshold τ p(failure) > τ → ALERT Guaranteed coverage: 1 - α FAILURE Stop / Retry SUCCESS Continue Decision Actions sent to robot Repeated at every timestep t during execution — monitoring runs in parallel with action generation

The diagram above shows the full SAFE pipeline. At every timestep, while the VLA model generates actions for the robot, SAFE extracts hidden state features from an intermediate layer, passes them through a lightweight detector head, and applies a conformally calibrated threshold to make a failure/success decision. The entire monitoring process is parallel to action generation — it does not slow down the robot.

4 — Feature Extraction from VLA Internals

Hidden State Extraction

h_t^(l) ∈ R^d_model

At each timestep t, the VLA model performs a forward pass: it processes the current camera image, the language instruction, and (optionally) proprioceptive state through its transformer backbone. SAFE hooks into a specific layer l and extracts the hidden state vector h_t^(l).

Which layer to extract from matters. Early layers capture low-level visual features; later layers encode higher-level semantic and task-relevant information. The authors systematically evaluate different layer choices and find that middle-to-late layers tend to produce the most informative features for failure detection.

For models like π0 that use a PaliGemma vision-language backbone, the features come from the language model hidden states after vision-language fusion. For OpenVLA (built on Prismatic + Llama 2), features are extracted from the Llama decoder layers.

Feature Dimensionality

Varies by backbone

Each VLA backbone produces different feature dimensions. π0 uses a 2048-dimensional hidden state from PaliGemma. OpenVLA extracts 4096-dimensional features from Llama 2. π0-FAST similarly provides 2048-dimensional features. The detector heads are designed to handle these varying dimensions — the first linear layer of the MLP or LSTM input projection adapts to the specific backbone.

Crucially, the features are extracted without modifying the VLA model. SAFE is a purely observational monitor. It reads internal states through forward hooks but never changes the model weights, gradients, or inference behavior. The VLA model is completely unaware it is being monitored.

Why internal features, not outputs? The action outputs of a VLA model are low-dimensional motor commands (typically 7-dimensional for a Franka arm: 3 position, 3 orientation, 1 gripper). These commands alone carry very little information about whether the task is succeeding. The internal hidden states, by contrast, are high-dimensional and encode the model's full “understanding” of the scene and task state.

5 — Detector Heads: MLP vs. LSTM

Once features are extracted, they are fed into a lightweight classifier head. SAFE evaluates two architectures, each with different trade-offs between simplicity and temporal reasoning.

MLP Detector Head

Single-Timestep Classification

1Takes the feature vector h_t from a single timestep as input.

2Passes it through two fully-connected layers with ReLU activation and dropout.

3Outputs a scalar failure probability p(fail | h_t) via a sigmoid activation.

The MLP is fast and simple. It makes an independent prediction at each timestep based solely on the current hidden state. This works surprisingly well because a single frame's hidden state already encodes substantial context about the task state — the VLA's own attention mechanism has already integrated temporal information from prior tokens.

LSTM Detector Head

Temporal Sequence Classification

1Maintains a rolling window of the most recent k feature vectors [h_{t-k+1}, ..., h_t].

2Processes the sequence through an LSTM layer that captures temporal dynamics and transitions.

3The final LSTM hidden state is passed through a linear layer + sigmoid to produce p(fail | h_{t-k+1:t}).

The LSTM head excels at detecting failures that unfold over time. Consider a grasp that slowly slips: at any single timestep, the features may look ambiguous, but the trajectory of features over several steps reveals a clear failure pattern. The LSTM captures exactly this kind of temporal signal.

MLP vs. LSTM results: In the paper's experiments, the LSTM head consistently outperforms the MLP head, particularly on tasks where failures are gradual (e.g., an object slowly slipping from the gripper). The MLP is competitive on tasks where failures are instantaneous and visible in a single frame (e.g., a complete grasp miss).
Property MLP Head LSTM Head
Input Single timestep feature h_t Window of features h_{t-k+1:t}
Temporal reasoning None (relies on VLA's own context) Explicit via recurrent state
Latency Minimal — single forward pass Slightly higher — sequential over window
Best for Instantaneous failures Gradual / temporal failures
Parameters ~10K–50K ~50K–200K

6 — Conformal Prediction for Safety Guarantees

Getting a failure probability from the detector head is only half the problem. The harder question is: where do you set the threshold? Set it too high and you miss real failures. Set it too low and you trigger false alarms constantly, making the robot unusable. SAFE uses conformal prediction to solve this rigorously.

Conformal Prediction Threshold Calibration Calibration Set Success traj. Success traj. Failure traj. Failure traj. Success traj. Failure traj. Score Nonconformity Scores s_i = 1 - p(fail | x_i) Sort scores: s_(1) ≤ s_(2) ≤ ... For failure examples only n = |calibration failures| Quantile Threshold Selection τ = s_(⌈(1-α)(n+1)⌉) α = desired false negative rate e.g., α = 0.1 → catch ≥90% of failures Statistical Guarantee P( detecting a true failure ) ≥ 1 - α Holds under exchangeability — no distributional assumptions on the model

How Conformal Prediction Works Here

Conformal prediction is a distribution-free framework for uncertainty quantification. It requires only one assumption: that calibration and test data are exchangeable (roughly, they come from the same distribution and their order does not matter). Given this mild assumption, it provides finite-sample guarantees.

Calibration Procedure

One-time setup

1Collect calibration data: Run the VLA policy on a set of held-out episodes. Record which episodes succeed and which fail, along with the detector's predicted failure probabilities.

2Compute nonconformity scores: For each failure episode, compute s_i = 1 - p(fail). A high score means the detector assigned low failure probability to an actual failure — i.e., the detector was “wrong.”

3Select the threshold: Sort the scores and pick the ⌈(1 - α)(n + 1)⌉-th smallest value as the threshold τ, where α is the desired false negative rate (e.g., 0.1 for 90% detection) and n is the number of calibration failures.

4Deploy: At test time, flag failure whenever p(fail) > 1 - τ.

Why this matters for safety: Traditional ML thresholds are chosen by optimizing a metric on a validation set — say, the F1 score. But this gives no formal guarantee. The threshold might work well on average but fail catastrophically in certain scenarios. Conformal prediction provides a provable bound: the false negative rate will not exceed α on future data (under exchangeability). For safety-critical robotics, this is the difference between “it usually works” and “it is guaranteed to catch at least 90% of failures.”

7 — Training Details and Experimental Setup

VLA Backbones Evaluated

π0

PaliGemma backbone • 2048-dim features

Physical Intelligence's flow-matching VLA model. Uses a PaliGemma vision-language model as the backbone with a flow-matching action head. SAFE extracts features from the PaliGemma language model layers after vision-language fusion has occurred. This model represents the state-of-the-art in generalist robot control.

OpenVLA

Prismatic + Llama 2 backbone • 4096-dim features

An open-source VLA model combining a Prismatic vision encoder with a Llama 2 7B language model. Actions are tokenized and generated autoregressively as text tokens. SAFE extracts features from the Llama 2 decoder layers. OpenVLA provides a fully open-source baseline for the community.

π0-FAST

PaliGemma backbone • 2048-dim features

A variant of π0 that tokenizes actions using a discrete codebook (FAST tokenization) rather than continuous flow matching. This architectural difference means the internal representations may encode task state differently. Testing on π0-FAST validates that SAFE generalizes across action decoding strategies, not just across tasks.

Benchmarks and Tasks

LIBERO Simulation Benchmark

Simulated • 10 task suites

LIBERO provides a diverse set of tabletop manipulation tasks in simulation: picking, placing, stacking, opening drawers, pressing buttons, and more. The key experimental design splits tasks into seen (used for training the detector) and unseen (held out entirely for evaluation). This tests the critical question: can SAFE detect failures on tasks it has never encountered?

Data collection involves rolling out the VLA policy on each task multiple times and recording both the hidden state trajectories and binary success/failure labels (determined by the simulator's ground-truth task completion checker).

Real Robot: Franka Panda

Physical hardware • Manipulation tasks

To validate beyond simulation, SAFE is tested on a physical Franka Emika Panda robot arm performing real manipulation tasks. This bridges the sim-to-real gap: sensor noise, lighting variation, physical dynamics, and genuine manipulation difficulty all come into play. Success/failure labels are assigned by human annotators watching the execution videos.

Training Protocol

The detector heads are trained with binary cross-entropy loss on the seen-task dataset. For the MLP, each timestep is an independent training example. For the LSTM, training examples are windows of consecutive timesteps from a single episode. Training is fast — the detector heads are tiny compared to the VLA backbone — and converges in minutes on a single GPU.

Frozen backbone: The VLA model's weights are never modified during SAFE training. Only the lightweight detector head is trained. This means SAFE can be attached to any pre-trained VLA model as a post-hoc safety monitor without re-training or fine-tuning the base model. It is truly plug-and-play.

8 — Key Results

Failure Detection Across VLA Backbones

VLA Backbone Detector Seen Tasks (AUROC) Unseen Tasks (AUROC) Unseen Detection Rate
π0 MLP 0.91 0.82 78%
π0 LSTM 0.94 0.87 84%
OpenVLA MLP 0.88 0.79 74%
OpenVLA LSTM 0.92 0.84 80%
π0-FAST MLP 0.90 0.81 76%
π0-FAST LSTM 0.93 0.86 82%
84% detection on unseen tasks. The best configuration (π0 + LSTM) detects 84% of failures on tasks the detector has never seen during training. This demonstrates genuine generalization: the failure signatures in VLA hidden states transfer across tasks.

Key Findings

LSTM Outperforms MLP Consistently

Across all three VLA backbones, the LSTM detector head outperforms the MLP head by 3–6 AUROC points on unseen tasks. The temporal modeling provided by the LSTM is especially valuable for detecting gradual failures — cases where a single frame looks ambiguous but the trajectory reveals a clear failure pattern. For instantaneous failures (e.g., complete grasp misses), the gap is smaller.

Generalization to Unseen Tasks

The drop from seen to unseen task performance is modest (5–9 AUROC points), suggesting that VLA hidden states encode task-general failure signatures. When a robot drops an object, the internal representation shifts in a characteristic way regardless of which specific object or task is involved. This is the most important finding: SAFE is not memorizing task-specific failure patterns but learning generalizable failure features.

Conformal Calibration Works

When applying conformal prediction with α = 0.1 (targeting 90% failure detection), the empirical detection rate meets or exceeds the guarantee on held-out test data. This validates that the exchangeability assumption holds well enough in practice for robot manipulation settings, and that the conformal calibration procedure produces reliable, actionable thresholds.

Real Robot Validation

Results on the physical Franka Panda confirm that SAFE works beyond simulation. The sim-to-real gap is present but manageable — detection rates on the real robot are somewhat lower than in LIBERO, but still operationally useful. Importantly, the conformal guarantees transfer: the calibrated thresholds from real-robot calibration data provide the promised coverage.

9 — Which Layer to Monitor?

Not all layers of a VLA model are equally informative for failure detection. SAFE systematically ablates layer choice across all three backbones.

Layer Position Feature Type Detection Quality Intuition
Early (0–25%) Low-level visual / token embeddings Poor Too low-level; hasn't integrated task semantics
Middle (25–60%) Mid-level representations Good Rich mix of visual and semantic information
Late-middle (60–80%) High-level task representations Best Task-relevant features before action-specific compression
Final (80–100%) Pre-action / action-specific Moderate Over-specialized for action prediction; loses some state info
Sweet spot: The late-middle layers (roughly 60–80% depth) consistently produce the best features for failure detection. At this depth, the model has fully integrated visual and language information into high-level task representations, but has not yet compressed everything down to the narrow bottleneck needed for action prediction. These layers carry the richest “world state” information.

10 — Why Internal Monitoring Works

It may seem surprising that a model's own hidden states can reveal its failures. After all, if the model “knew” it was failing, why wouldn't it correct itself? The answer lies in the gap between representation and action.

The Representation-Action Gap

A VLA model's hidden states are trained to represent the world accurately enough to predict useful actions on average. This means the representations encode detailed information about the current state of the scene: where objects are, whether the gripper is holding something, the spatial relationships between the arm and the target.

But the action head is trained to output the most likely next action given this representation. It is not trained to output “I am failing” or “this grasp is slipping.” The failure information is present in the representation but not exposed in the output. SAFE simply trains a small head to read what the action head ignores.

An analogy: imagine a doctor who can accurately describe a patient's symptoms (representation) but has been trained only to prescribe medication (action). The doctor's notes contain all the information needed to detect a misdiagnosis, but the prescription alone does not reveal it. SAFE is like a second opinion that reads the doctor's notes.

Generalization explained: Why does SAFE generalize to unseen tasks? Because VLA models are themselves trained on diverse tasks. Their representations learn general concepts like “the gripper is empty when it should be holding something” or “the object is not where the instruction said to put it.” These concepts are task-independent, and SAFE's detector learns to read them.

11 — Practical Deployment Considerations

Computational Overhead

Minimal impact on inference

The detector heads (MLP or LSTM) add negligible computational cost compared to the VLA forward pass itself. A π0 forward pass takes tens of milliseconds on a modern GPU; the MLP detector adds less than 1ms. Feature extraction via forward hooks has zero additional compute cost — the features are already computed as part of the normal VLA forward pass.

What Happens When a Failure Is Detected?

Recovery strategies

SAFE detects failures but does not prescribe a specific recovery strategy. The simplest response is to stop execution and alert a human operator. More sophisticated approaches could include automatic retry (re-attempt the task from the current state), rollback (return to a known-good state), or escalation (switch to a more conservative policy). The choice depends on the deployment context and safety requirements.

Calibration Data Requirements

How much data do you need?

Conformal prediction requires a calibration set with both successes and failures. The more calibration failures you have, the tighter the conformal guarantee. In practice, 50–100 failure episodes provide reasonably tight bounds. This is modest: running a VLA policy 200 times on a few tasks will typically yield enough failures for calibration, even for policies with 60–70% success rates.

12 — Key Takeaways

First general-purpose failure detector for VLAs. SAFE is the first method to detect failures across tasks and VLA architectures without task-specific training, opening the door to scalable safety monitoring for robot foundation models.
Internal representations are the key. By monitoring hidden states rather than actions or external observations, SAFE accesses rich, high-dimensional signals that already encode whether execution is succeeding. The VLA model does the hard perceptual work; SAFE just reads the result.
Conformal prediction closes the safety gap. Rather than relying on arbitrary thresholds, SAFE uses conformal prediction to provide provable guarantees on the failure detection rate. This moves robot safety monitoring from empirical heuristics to statistical rigor.
Generalization is real. 84% failure detection on completely unseen tasks demonstrates that VLA hidden states encode task-general notions of success and failure. This is not overfitting to specific failure modes — it is learning what “going wrong” looks like in the model's own language.

Summary of Contributions

Contribution Description
General failure detection Monitor VLA hidden states to detect failures without task-specific training
Lightweight detector heads MLP (single-step) and LSTM (temporal) heads that add minimal overhead
Conformal safety guarantees Calibrated thresholds with provable bounds on false negative rates
Cross-architecture evaluation Validated on π0, OpenVLA, and π0-FAST
Sim + real validation LIBERO benchmark and physical Franka Panda experiments
Generalization 84% failure detection on tasks unseen during detector training