Segment Anything: From Images to Video

SAM (2023) & SAM 2 (2024) — Meta AI
Segmentation Foundation Model Promptable Zero-Shot Video Meta FAIR

1 — The Problem to Solve

Image segmentation is the task of labeling every pixel in an image — deciding exactly which pixels belong to which object. Unlike detection (bounding boxes), segmentation gives pixel-precise outlines.

Before SAM, segmentation models were trained on specific datasets for specific categories (people, cars, buildings). You couldn't just point at a random object and say "segment that." SAM changed this by creating a promptable foundation model that segments anything — any object in any image — given just a point, box, or text prompt.

Image + click prompt SAM Precise pixel mask Prompt types: Point click(s) — foreground / background Bounding box Rough mask (coarse input) T  Text description (with CLIP)
The SAM project included three things: (1) A new task — "promptable segmentation," (2) A new model — the SAM architecture, and (3) A new dataset — SA-1B with 1.1 billion masks on 11 million images, by far the largest segmentation dataset ever created.

2 — SAM Architecture (2023)

Kirillov et al. — arXiv:2304.02643 — Meta FAIR

SAM has three components: an image encoder (runs once per image), a prompt encoder (lightweight, runs per prompt), and a mask decoder (lightweight, produces the mask). This design means you can encode an image once and then interactively try different prompts in real time.

IMAGE ENCODER 1024x1024 Image ViT-H MAE pre-trained 632M params Output: 64x64x256 image embedding Runs ONCE per image (~0.15s on GPU) PROMPT ENCODER Points Boxes Sparse: positional encodings Mask input Dense: conv downsampled Runs per prompt (~1ms) MASK DECODER 2-Way Cross-Attn prompts ↔ image ×2 transformer layers Mask 1 Mask 2 Mask 3 IoU scores Outputs 3 masks at 3 levels of granularity + confidence Runs per prompt (~10ms)

1 Image Encoder: ViT-H

ViT-H/16 (MAE Pre-trained)

1024×1024×364×64×256

The heaviest component. A ViT-Huge with 632M parameters (see our ViT walkthrough) processes the input image. Pre-trained with MAE (Masked Autoencoder) self-supervised learning on massive data. The 16×16 patch size on a 1024×1024 image gives 4,096 tokens. After the ViT, neck layers (1×1 and 3×3 convolutions) reduce the output to 64×64×256.

This is intentionally the expensive part — it runs once per image, then the embedding is reused for all prompts.

2 Prompt Encoder

Sparse + Dense Prompt Encoding

Points/boxes → 256-d tokens; masks → 256-ch map

Sparse prompts (points and box corners) are represented using positional encodings plus learned embeddings indicating whether each point is foreground or background. Each prompt becomes a 256-d token.

Dense prompts (input masks from a previous iteration) are downsampled via two 2×2 stride-2 convolutions to produce a 256-channel map matching the image embedding resolution. This enables iterative refinement — the user provides a rough mask and SAM refines it.

3 Mask Decoder

Two-Way Transformer + Upsampling

64×64×256 + prompts → 3 masks + IoU scores

The decoder is deliberately lightweight (just 2 Transformer layers) so it can run interactively. It uses two-way cross-attention: prompt tokens attend to the image embedding (asking "what's at my location?") AND image tokens attend to prompts (letting the image "see" what's being asked for).

The decoder outputs 3 masks at different granularity levels (whole object, part, sub-part) plus an IoU confidence score for each. This handles ambiguity — when you click on a person's shirt, SAM gives you the shirt, the torso, and the whole person as three options.

Ambiguity-Aware Outputs: Why Three Masks?

Multi-Granularity Mask Prediction

1 prompt → 3 masks at different levels

When you click on a shirt button, do you mean the button, the shirt, or the whole person? SAM doesn't guess — it gives you all three. Each mask comes with an IoU confidence score so you can pick the best one (or the model can auto-select).

Ambiguity Resolution: One Click, Three Masks Click on shirt Mask 1: Part IoU: 0.72 Mask 2: Object IoU: 0.91 Mask 3: Whole IoU: 0.85 Auto-select: highest IoU → Mask 2 IoU prediction head: 3-layer MLP 256 → 256 → 3

The Data Engine: How Meta Built SA-1B

Three-Phase Data Engine

130K → 4.3M → 1.1B masks

SA-1B wasn't built all at once. Meta used a three-phase data engine where SAM and human annotators improved each other iteratively:

SA-1B Data Engine: Model-in-the-Loop Annotation Phase 1: Assisted Manual Annotators click, SAM suggests masks Human corrects & approves 4.3M masks (120K images) Phase 2: Semi-Automatic SAM auto-generates masks Annotators fill in gaps 5.9M masks (180K images) Phase 3: Fully Automatic SAM generates all masks alone No human annotation needed 1.1B masks (11M images) Retrain SAM on new data after each phase → model improves → generates better data

3 — SAM 2: Extending to Video (2024)

Ravi et al. — arXiv:2408.00714 — Meta FAIR

SAM worked on single images. SAM 2 extends it to video — give a prompt on one frame, and SAM 2 tracks and segments the object through the entire video. This is fundamentally harder because objects move, get occluded, change shape, and the camera moves too.

What SAM 2 Adds

Streaming Architecture

Process video frame-by-frame with memory

SAM 2 processes video as a stream — one frame at a time, left to right. It doesn't need to see the whole video at once. This makes it practical for real-time applications and arbitrarily long videos.

Streaming Video Processing Frame 1 (prompted) Frame 2 Frame 3 Frame N Memory Bank (spatial memory + object pointers) Each frame: 1. Encode image 2. Read memory 3. Predict mask 4. Write to memory

Hiera Image Encoder (replacing ViT-H)

Hierarchical ViT — faster than ViT-H

SAM 2 replaces the heavyweight ViT-H with Hiera (Hierarchical Vision Transformer), also from Meta. Hiera uses MAE pre-training but with a hierarchical design that produces multi-scale features naturally (like a CNN) while being 6× faster than ViT-H. This is critical for video — processing every frame with ViT-H would be too slow.

Memory Attention Module

Current frame attends to memory bank

The key new component. After encoding the current frame, a memory attention module performs cross-attention between the current frame's features and a memory bank that stores information from previous frames. This bank contains:

  • Spatial memory — per-frame feature maps from recent frames and the prompted frame
  • Object pointers — lightweight tokens summarizing each object's appearance, enabling tracking through occlusions

The memory bank has a fixed capacity (keeping the most recent N frames plus the prompted frame), so it works on arbitrarily long videos without growing memory.

Mask Decoder (Same Design as SAM)

Prompt tokens + memory-conditioned features → masks

The mask decoder uses the same two-way cross-attention design as SAM, but now operates on memory-conditioned features instead of raw image features. It still produces multiple mask hypotheses with confidence scores.

SAM 2 Per-Frame Processing Pipeline

Detailed Frame Processing Flow

image → encode → memory attend → decode → mask + update memory

For each frame in the video, SAM 2 executes a precise sequence. When no prompt is given (non-prompted frames), the model relies entirely on the memory bank to identify and segment the tracked object.

SAM 2: Per-Frame Processing (for frame t) Hiera Encoder frame_t → F_t Memory Attention L transformer blocks self-attn + cross-attn Memory Bank recent N frame memories prompted frame memories object pointers Prompt Encoder (if prompted) Mask Decoder 2-way cross-attn + occlusion head Mask_t Occ score Memory Encoder mask + features → memory write to memory bank On single images: memory bank is empty → SAM 2 reduces to SAM (strict generalization)
Occlusion handling: SAM 2 adds a dedicated occlusion prediction head — a small MLP that estimates whether the target object is currently visible. When the occlusion score is high, the model knows not to produce a mask for that frame and relies on the memory bank to re-identify the object when it reappears. This is critical for videos where objects go behind other objects, leave the frame, or get temporarily hidden.

Interactive Video Annotation

Click on frame 1 → track through video → correct on frame N → refine

SAM 2's streaming architecture enables interactive video annotation: a user clicks on an object in one frame, SAM 2 propagates the mask through the video, and if the tracking goes wrong, the user can add corrections on later frames. Each correction is fed back into the memory bank, improving tracking for all subsequent frames. This achieves equivalent accuracy with 3× fewer user interactions than previous annotation tools.

SA-V Dataset

Meta created the SA-V (Segment Anything in Video) dataset — 50,900 videos with 642,600 masklets (per-object mask tracks across frames). This is 53× more mask annotations than previous video segmentation datasets. Like SA-1B, it was built with a human-in-the-loop data engine using SAM 2 itself.

4 — SAM vs. SAM 2 Comparison

FeatureSAM (2023)SAM 2 (2024)
DomainImages onlyImages AND video
Image encoderViT-H (632M params)Hiera (hierarchical ViT, 6× faster)
MemoryNone — each image independentMemory bank + object pointers
Video trackingNot supportedFrame-by-frame with memory attention
OcclusionN/AOcclusion prediction + re-identification
Training dataSA-1B (1.1B masks, 11M images)SA-V (642K masklets, 51K videos) + SA-1B
Image segmentationBaselineBetter — SAM 2 also improves on images
Interactive speed~10ms per prompt~10ms per prompt + 3× faster per frame
Surprising result: SAM 2 isn't just better at video — it's also better at images. Despite being trained jointly on images and video, SAM 2 outperforms the original SAM on image segmentation benchmarks. The video training acts as a form of augmentation, teaching the model to understand objects from multiple viewpoints.

5 — References & Further Reading