Segment Anything: From Images to Video
1 — The Problem to Solve
Image segmentation is the task of labeling every pixel in an image — deciding exactly which pixels belong to which object. Unlike detection (bounding boxes), segmentation gives pixel-precise outlines.
Before SAM, segmentation models were trained on specific datasets for specific categories (people, cars, buildings). You couldn't just point at a random object and say "segment that." SAM changed this by creating a promptable foundation model that segments anything — any object in any image — given just a point, box, or text prompt.
2 — SAM Architecture (2023)
Kirillov et al. — arXiv:2304.02643 — Meta FAIR
SAM has three components: an image encoder (runs once per image), a prompt encoder (lightweight, runs per prompt), and a mask decoder (lightweight, produces the mask). This design means you can encode an image once and then interactively try different prompts in real time.
1 Image Encoder: ViT-H
ViT-H/16 (MAE Pre-trained)
The heaviest component. A ViT-Huge with 632M parameters (see our ViT walkthrough) processes the input image. Pre-trained with MAE (Masked Autoencoder) self-supervised learning on massive data. The 16×16 patch size on a 1024×1024 image gives 4,096 tokens. After the ViT, neck layers (1×1 and 3×3 convolutions) reduce the output to 64×64×256.
This is intentionally the expensive part — it runs once per image, then the embedding is reused for all prompts.
2 Prompt Encoder
Sparse + Dense Prompt Encoding
Sparse prompts (points and box corners) are represented using positional encodings plus learned embeddings indicating whether each point is foreground or background. Each prompt becomes a 256-d token.
Dense prompts (input masks from a previous iteration) are downsampled via two 2×2 stride-2 convolutions to produce a 256-channel map matching the image embedding resolution. This enables iterative refinement — the user provides a rough mask and SAM refines it.
3 Mask Decoder
Two-Way Transformer + Upsampling
The decoder is deliberately lightweight (just 2 Transformer layers) so it can run interactively. It uses two-way cross-attention: prompt tokens attend to the image embedding (asking "what's at my location?") AND image tokens attend to prompts (letting the image "see" what's being asked for).
The decoder outputs 3 masks at different granularity levels (whole object, part, sub-part) plus an IoU confidence score for each. This handles ambiguity — when you click on a person's shirt, SAM gives you the shirt, the torso, and the whole person as three options.
Ambiguity-Aware Outputs: Why Three Masks?
Multi-Granularity Mask Prediction
When you click on a shirt button, do you mean the button, the shirt, or the whole person? SAM doesn't guess — it gives you all three. Each mask comes with an IoU confidence score so you can pick the best one (or the model can auto-select).
The Data Engine: How Meta Built SA-1B
Three-Phase Data Engine
SA-1B wasn't built all at once. Meta used a three-phase data engine where SAM and human annotators improved each other iteratively:
3 — SAM 2: Extending to Video (2024)
Ravi et al. — arXiv:2408.00714 — Meta FAIR
SAM worked on single images. SAM 2 extends it to video — give a prompt on one frame, and SAM 2 tracks and segments the object through the entire video. This is fundamentally harder because objects move, get occluded, change shape, and the camera moves too.
What SAM 2 Adds
Streaming Architecture
SAM 2 processes video as a stream — one frame at a time, left to right. It doesn't need to see the whole video at once. This makes it practical for real-time applications and arbitrarily long videos.
Hiera Image Encoder (replacing ViT-H)
SAM 2 replaces the heavyweight ViT-H with Hiera (Hierarchical Vision Transformer), also from Meta. Hiera uses MAE pre-training but with a hierarchical design that produces multi-scale features naturally (like a CNN) while being 6× faster than ViT-H. This is critical for video — processing every frame with ViT-H would be too slow.
Memory Attention Module
The key new component. After encoding the current frame, a memory attention module performs cross-attention between the current frame's features and a memory bank that stores information from previous frames. This bank contains:
- Spatial memory — per-frame feature maps from recent frames and the prompted frame
- Object pointers — lightweight tokens summarizing each object's appearance, enabling tracking through occlusions
The memory bank has a fixed capacity (keeping the most recent N frames plus the prompted frame), so it works on arbitrarily long videos without growing memory.
Mask Decoder (Same Design as SAM)
The mask decoder uses the same two-way cross-attention design as SAM, but now operates on memory-conditioned features instead of raw image features. It still produces multiple mask hypotheses with confidence scores.
SAM 2 Per-Frame Processing Pipeline
Detailed Frame Processing Flow
For each frame in the video, SAM 2 executes a precise sequence. When no prompt is given (non-prompted frames), the model relies entirely on the memory bank to identify and segment the tracked object.
Interactive Video Annotation
SAM 2's streaming architecture enables interactive video annotation: a user clicks on an object in one frame, SAM 2 propagates the mask through the video, and if the tracking goes wrong, the user can add corrections on later frames. Each correction is fed back into the memory bank, improving tracking for all subsequent frames. This achieves equivalent accuracy with 3× fewer user interactions than previous annotation tools.
SA-V Dataset
Meta created the SA-V (Segment Anything in Video) dataset — 50,900 videos with 642,600 masklets (per-object mask tracks across frames). This is 53× more mask annotations than previous video segmentation datasets. Like SA-1B, it was built with a human-in-the-loop data engine using SAM 2 itself.
4 — SAM vs. SAM 2 Comparison
| Feature | SAM (2023) | SAM 2 (2024) |
|---|---|---|
| Domain | Images only | Images AND video |
| Image encoder | ViT-H (632M params) | Hiera (hierarchical ViT, 6× faster) |
| Memory | None — each image independent | Memory bank + object pointers |
| Video tracking | Not supported | Frame-by-frame with memory attention |
| Occlusion | N/A | Occlusion prediction + re-identification |
| Training data | SA-1B (1.1B masks, 11M images) | SA-V (642K masklets, 51K videos) + SA-1B |
| Image segmentation | Baseline | Better — SAM 2 also improves on images |
| Interactive speed | ~10ms per prompt | ~10ms per prompt + 3× faster per frame |
5 — References & Further Reading
- Segment Anything — Kirillov et al., Meta AI, 2023
- SAM 2: Segment Anything in Images and Videos — Ravi et al., Meta FAIR, 2024
- Official SAM 2 GitHub Repository
- SAM 2 Project Page
- Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles — Ryali et al., Meta, 2023