Segment Anything: From Images to Video

SAM (2023) & SAM 2 (2024) — Meta AI

Segmentation Foundation Model Promptable Zero-Shot Video Meta FAIR

SAM: arXiv:2304.02643 | SAM 2: arXiv:2408.00714 | Code: facebookresearch/sam2

1 — The Problem to Solve

Image segmentation is the task of labeling every pixel in an image — deciding exactly which pixels belong to which object. Unlike detection (bounding boxes), segmentation gives pixel-precise outlines.

Before SAM, segmentation models were trained on specific datasets for specific categories (people, cars, buildings). You couldn't just point at a random object and say "segment that." SAM changed this by creating a promptable foundation model that segments anything — any object in any image — given just a point, box, or text prompt.

The SAM project included three things: (1) A new task — "promptable segmentation," (2) A new model — the SAM architecture, and (3) A new dataset — SA-1B with 1.1 billion masks on 11 million images, by far the largest segmentation dataset ever created.

2 — SAM Architecture (2023)

Kirillov et al. — arXiv:2304.02643 — Meta FAIR

SAM has three components: an image encoder (runs once per image), a prompt encoder (lightweight, runs per prompt), and a mask decoder (lightweight, produces the mask). This design means you can encode an image once and then interactively try different prompts in real time.

1 Image Encoder: ViT-H

ViT-H/16 (MAE Pre-trained)

1024×1024×3 → 64×64×256

The heaviest component. A ViT-Huge with 632M parameters (see our ViT walkthrough) processes the input image. Pre-trained with MAE (Masked Autoencoder) self-supervised learning on massive data. The 16×16 patch size on a 1024×1024 image gives 4,096 tokens. After the ViT, neck layers (1×1 and 3×3 convolutions) reduce the output to 64×64×256.

This is intentionally the expensive part — it runs once per image, then the embedding is reused for all prompts.

2 Prompt Encoder

Sparse + Dense Prompt Encoding

Points/boxes → 256-d tokens; masks → 256-ch map

Sparse prompts (points and box corners) are represented using positional encodings plus learned embeddings indicating whether each point is foreground or background. Each prompt becomes a 256-d token.

Dense prompts (input masks from a previous iteration) are downsampled via two 2×2 stride-2 convolutions to produce a 256-channel map matching the image embedding resolution. This enables iterative refinement — the user provides a rough mask and SAM refines it.

3 Mask Decoder

Two-Way Transformer + Upsampling

64×64×256 + prompts → 3 masks + IoU scores

The decoder is deliberately lightweight (just 2 Transformer layers) so it can run interactively. It uses two-way cross-attention: prompt tokens attend to the image embedding (asking "what's at my location?") AND image tokens attend to prompts (letting the image "see" what's being asked for).

The decoder outputs 3 masks at different granularity levels (whole object, part, sub-part) plus an IoU confidence score for each. This handles ambiguity — when you click on a person's shirt, SAM gives you the shirt, the torso, and the whole person as three options.

Ambiguity-Aware Outputs: Why Three Masks?

Multi-Granularity Mask Prediction

1 prompt → 3 masks at different levels

When you click on a shirt button, do you mean the button, the shirt, or the whole person? SAM doesn't guess — it gives you all three. Each mask comes with an IoU confidence score so you can pick the best one (or the model can auto-select).

The Data Engine: How Meta Built SA-1B

Three-Phase Data Engine

130K → 4.3M → 1.1B masks

SA-1B wasn't built all at once. Meta used a three-phase data engine where SAM and human annotators improved each other iteratively:

3 — SAM 2: Extending to Video (2024)

Ravi et al. — arXiv:2408.00714 — Meta FAIR

SAM worked on single images. SAM 2 extends it to video — give a prompt on one frame, and SAM 2 tracks and segments the object through the entire video. This is fundamentally harder because objects move, get occluded, change shape, and the camera moves too.

What SAM 2 Adds

Streaming Architecture

Process video frame-by-frame with memory

SAM 2 processes video as a stream — one frame at a time, left to right. It doesn't need to see the whole video at once. This makes it practical for real-time applications and arbitrarily long videos.

Hiera Image Encoder (replacing ViT-H)

Hierarchical ViT — faster than ViT-H

SAM 2 replaces the heavyweight ViT-H with Hiera (Hierarchical Vision Transformer), also from Meta. Hiera uses MAE pre-training but with a hierarchical design that produces multi-scale features naturally (like a CNN) while being 6× faster than ViT-H. This is critical for video — processing every frame with ViT-H would be too slow.

Memory Attention Module

Current frame attends to memory bank

The key new component. After encoding the current frame, a memory attention module performs cross-attention between the current frame's features and a memory bank that stores information from previous frames. This bank contains:

Spatial memory — per-frame feature maps from recent frames and the prompted frame
Object pointers — lightweight tokens summarizing each object's appearance, enabling tracking through occlusions

The memory bank has a fixed capacity (keeping the most recent N frames plus the prompted frame), so it works on arbitrarily long videos without growing memory.

Mask Decoder (Same Design as SAM)

Prompt tokens + memory-conditioned features → masks

The mask decoder uses the same two-way cross-attention design as SAM, but now operates on memory-conditioned features instead of raw image features. It still produces multiple mask hypotheses with confidence scores.

SAM 2 Per-Frame Processing Pipeline

Detailed Frame Processing Flow

image → encode → memory attend → decode → mask + update memory

For each frame in the video, SAM 2 executes a precise sequence. When no prompt is given (non-prompted frames), the model relies entirely on the memory bank to identify and segment the tracked object.

Occlusion handling: SAM 2 adds a dedicated occlusion prediction head — a small MLP that estimates whether the target object is currently visible. When the occlusion score is high, the model knows not to produce a mask for that frame and relies on the memory bank to re-identify the object when it reappears. This is critical for videos where objects go behind other objects, leave the frame, or get temporarily hidden.

Interactive Video Annotation

Click on frame 1 → track through video → correct on frame N → refine

SAM 2's streaming architecture enables interactive video annotation: a user clicks on an object in one frame, SAM 2 propagates the mask through the video, and if the tracking goes wrong, the user can add corrections on later frames. Each correction is fed back into the memory bank, improving tracking for all subsequent frames. This achieves equivalent accuracy with 3× fewer user interactions than previous annotation tools.

SA-V Dataset

Meta created the SA-V (Segment Anything in Video) dataset — 50,900 videos with 642,600 masklets (per-object mask tracks across frames). This is 53× more mask annotations than previous video segmentation datasets. Like SA-1B, it was built with a human-in-the-loop data engine using SAM 2 itself.

4 — SAM vs. SAM 2 Comparison

Feature	SAM (2023)	SAM 2 (2024)
Domain	Images only	Images AND video
Image encoder	ViT-H (632M params)	Hiera (hierarchical ViT, 6× faster)
Memory	None — each image independent	Memory bank + object pointers
Video tracking	Not supported	Frame-by-frame with memory attention
Occlusion	N/A	Occlusion prediction + re-identification
Training data	SA-1B (1.1B masks, 11M images)	SA-V (642K masklets, 51K videos) + SA-1B
Image segmentation	Baseline	Better — SAM 2 also improves on images
Interactive speed	~10ms per prompt	~10ms per prompt + 3× faster per frame

Surprising result: SAM 2 isn't just better at video — it's also better at images. Despite being trained jointly on images and video, SAM 2 outperforms the original SAM on image segmentation benchmarks. The video training acts as a form of augmentation, teaching the model to understand objects from multiple viewpoints.

5 — References & Further Reading

Segment Anything — Kirillov et al., Meta AI, 2023
SAM 2: Segment Anything in Images and Videos — Ravi et al., Meta FAIR, 2024
Official SAM 2 GitHub Repository
SAM 2 Project Page
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles — Ryali et al., Meta, 2023