FastVLM: Efficient Vision Encoding for Vision Language Models

CVPR 2025 — Apple
Vision-Language Efficient Inference Hybrid Encoder On-Device Mobile Apple

1 — The Problem to Solve

Vision-Language Models (VLMs) combine a vision encoder with a large language model to understand images and answer questions about them. You show the model a photo, a chart, or a document, and it can describe what it sees, answer questions, or extract text. Models like LLaVA, GPT-4V, and Gemini all follow this pattern.

The problem is speed. Current VLMs use heavy vision encoders like ViT-L/14 (304M parameters) or SigLIP-SO400M (430M parameters). These encoders are slow, especially at high resolutions — and high resolution is critical for reading text in documents, charts, and fine-grained visual details. Worse, higher resolution means more visual tokens fed to the LLM, which increases the time-to-first-token (TTFT) — the latency a user experiences before the model starts responding.

The VLM Latency Bottleneck 1024x1024 Image (chart, doc, photo...) Input ViT-L/14 304M params 576 tokens at 336px SLOW at high-res Bottleneck! MLP Proj LLM Decoder Prefill 576 tokens + text tokens High TTFT Text Response (slow start) More tokens = more LLM prefill time = slower response

This creates a three-way tension: you want high resolution for accuracy, few tokens for fast LLM prefilling, and a small, fast encoder for low latency. Most prior work addressed this by keeping heavy encoders and adding token pruning or merging on top. FastVLM takes a different approach — it redesigns the vision encoder itself.

FastVLM's thesis: Instead of using a heavy ViT encoder and then pruning its tokens, build a hybrid convolutional-transformer encoder that natively produces fewer tokens and runs faster — especially at high resolutions. Simplicity over complexity.

2 — Architecture Overview

FastVLM follows the standard three-component VLM design: a vision encoder (FastViTHD), an MLP projection layer, and an LLM decoder. The innovation is entirely in the vision encoder — the rest is deliberately kept simple.

FastVLM Architecture VISION ENCODER: FastViTHD (125M params) 768x768 Image Stem 4x S1 Conv d=2 c=96 S2 Conv d=12 c=192 S3 Conv d=24 c=384 S4 MHSA d=4 c=768 S5 MHSA d=2 c=1536 RepMixer (Conv) Multi-Head Self-Attention Total downsampling: 64x — Output: 12x12 tokens at 768px input 144 tokens (vs 576 for ViT-L/14 at 336px) d = depth (blocks), c = channels, MHSA = multi-head self-attention MLP Projection 1536 → LLM dim LLM DECODER Qwen2 (0.5B / 1.5B / 7B) Prefill: 144 visual tokens + text query tokens Autoregressive text output "What text is in this chart?" → "The chart shows revenue..." Vision: fast (hybrid conv-transformer) Light LLM: fast (fewer tokens to prefill)
Key design principle: FastVLM does not use token pruning, token merging, or any post-hoc compression. The efficiency comes entirely from the encoder architecture itself — a hybrid design where early convolution stages do the heavy spatial processing cheaply, and self-attention is only applied to already-downsampled feature maps in the final stages.

3 — FastViTHD: The Hybrid Vision Encoder

The core contribution — a 125M parameter hybrid encoder built on Apple's FastViT and MobileCLIP research

FastViTHD extends Apple's earlier FastViT architecture (ICCV 2023) with an additional fifth stage. The key insight is that convolutions are efficient at processing high-resolution spatial features, while self-attention excels at capturing global context. By stacking convolutions first and self-attention last, you get both — without the cost of running self-attention on high-resolution feature maps.

1 Convolutional Stem

Patch Embedding Stem

768×768×3192×192×96

The stem downsamples the input image by using strided convolutions. This is a critical efficiency gain over ViT, which uses a single 16×16 patch embedding — the stem gives the convolutional stages a head start at reduced resolution while preserving fine spatial detail through learned filters rather than hard patch boundaries.

2 Stages 1–3: RepMixer Convolutional Blocks

RepMixer Blocks (Efficient Token Mixing)

Stage 1: 96-ch, depth 2 → Stage 2: 192-ch, depth 12 → Stage 3: 384-ch, depth 24

The first three stages use RepMixer blocks — a structural reparameterization technique from FastViT. During training, each block uses a depthwise convolution branch and a skip connection for token mixing. At inference time, these branches are fused into a single depthwise convolution through reparameterization, eliminating the skip connection overhead entirely.

Each stage also includes a ConvFFN (convolutional feed-forward network) with 7×7 depthwise convolutions preceding a standard FFN with 4× expansion ratio. The depthwise convolutions inject local spatial information that pure FFNs miss.

Each inter-stage transition applies a 2× downsampling via strided patch embedding layers. After three convolutional stages, the feature map is downsampled by 32× total (4× from stem, 2× per stage transition).

RepMixer: Training vs. Inference Training Input X Identity skip DW-Conv 3x3 + Y Reparameterize Inference (fused) Input X Single fused DW-Conv Y No skip overhead at inference!

3 Stages 4–5: Multi-Head Self-Attention

Transformer Blocks (Global Context)

Stage 4: 768-ch, depth 4 → Stage 5: 1536-ch, depth 2

The final two stages switch from convolutions to multi-head self-attention (MHSA). By this point, the spatial resolution has been reduced dramatically — at 768×768 input, Stage 4 operates on a 24×24 feature map (576 tokens) and Stage 5 on 12×12 (144 tokens). Self-attention at these resolutions is cheap.

This is the critical difference from ViT: in a standard ViT-L/14, self-attention runs on all patches from the very first layer — at 1024×1024, that means attention over 4,096 tokens in every one of 24 layers. In FastViTHD, self-attention only runs on the final 6 layers with heavily downsampled maps. The expensive spatial processing is already done by the convolutional stages.

The extra stage that matters: FastViTHD adds Stage 5 (with 2× more downsampling) on top of the original FastViT architecture. This ensures self-attention operates on tensors downsampled by 64× instead of 32×, halving the token count fed to the LLM and reducing encoding latency. This single change generates 4× fewer tokens compared to a standard 32× downsampled encoder.

4 Token Count: Why It Matters

Resolution → Token Count Scaling

64× downsampling means far fewer tokens

Every visual token that enters the LLM adds to the prefill cost. FastViTHD's 64× total downsampling produces dramatically fewer tokens than ViT encoders:

Input ResolutionFastViTHD TokensViT-L/14 TokensReduction
256 × 2561625616×
512 × 512641,02416×
768 × 7681442,30416×
1024 × 10242564,09616×

At 768×768, FastViTHD produces just 144 tokens — fewer than ViT-L/14 produces at 336×336 (576 tokens). This means FastVLM can run at higher resolution with fewer tokens, getting better visual detail while being faster for the LLM.

FastViTHD Stage-by-Stage Summary

StageTypeDepthChannelsSpatial (at 768px)Role
StemStrided Conv96192 × 192Initial 4× downsample
Stage 1RepMixer29696 × 96Low-level features
Stage 2RepMixer1219248 × 48Mid-level features
Stage 3RepMixer2438424 × 24High-level local features
Stage 4MHSA476824 × 24Global context (no downsample)
Stage 5MHSA2153612 × 12Global context + final downsample

4 — Projection & Language Model

1 MLP Projection Layer

Vision-Language Connector

N×1536N×dLLM

The projection layer maps the 1536-dimensional output tokens from FastViTHD into the embedding space of the LLM decoder. This is a simple two-layer MLP with GELU activation — following the LLaVA-1.5 design. No cross-attention, no complex fusion — just a linear transformation. The resulting visual tokens are concatenated with the text tokens and fed to the LLM.

For a 768×768 input image, this projects 144 visual tokens of dimension 1536 into the LLM's hidden dimension (e.g., 896 for Qwen2-0.5B, 1536 for Qwen2-1.5B, or 3584 for Qwen2-7B).

2 LLM Decoder

Autoregressive Language Model

Visual tokens + text tokens → generated response

FastVLM uses Qwen2 models as the LLM decoder, available in three sizes to match deployment targets:

  • Qwen2-0.5B — for mobile and edge devices (iPhone, iPad)
  • Qwen2-1.5B — mid-range balance of capability and speed
  • Qwen2-7B — full capability for desktop/server deployment

The LLM receives the projected visual tokens followed by the user's text query tokens. It then generates a text response autoregressively. Because FastViTHD produces so few visual tokens, the LLM prefill phase is fast — this is where the 85× TTFT improvement over LLaVA-OneVision comes from.

Token Flow into the LLM 144 Visual Tokens (from FastViTHD + MLP) Text Query: "What does this chart show?" Qwen2 LLM: Prefill all tokens → Generate response autoregressively "The chart shows quarterly revenue growth..." 144 tokens vs 576+ for ViT = Much faster TTFT

5 — Training Pipeline

Multi-stage training following the LLaVA-1.5 recipe, scaled up for high resolution

FastVLM uses a three-stage training pipeline. The vision encoder (FastViTHD) is initialized from MobileCLIP pre-trained weights. The LLM decoder starts from Qwen2 pre-trained weights. Training progressively unfreezes components:

Stage 1: Projector Alignment

LLaVA-558K — 1 epoch — ~30 min on 8×H100

Only the MLP projection layer is trained. The vision encoder and LLM are both frozen. This stage teaches the projection to map visual tokens into the LLM's embedding space using 558K image-text pairs from LLaVA. Learning rate: 10-3, batch size: 256.

This is fast — about 30 minutes on a single 8-GPU node — because only the small MLP is being updated.

Stage 1.5: Resolution Scaling

CC3M + CC12M — 15M samples — vision encoder + projector

The vision encoder and projector are fine-tuned on 15 million samples from CC3M and CC12M at increasing resolutions. This teaches FastViTHD to handle high-resolution inputs that go beyond its CLIP pre-training distribution. The LLM remains frozen. Learning rate: 2×10-5.

Stage 2: Full Instruction Tuning

1.1M–11.9M instruction samples — all components trained

All three components (vision encoder, projector, and LLM) are jointly fine-tuned on instruction-following data — image-question-answer triplets that teach the model to follow user instructions about images. Learning rate: 2×10-5, batch size: 128. Uses cosine decay with 3% warmup and AdamW optimizer.

Training Stages: What Gets Updated Stage 1 Stage 1.5 Stage 2 Encoder frozen MLP trained LLM frozen Encoder trained MLP trained LLM frozen Encoder trained MLP trained LLM trained 558K pairs ~30 min 15M samples high-res scaling 1.1M-11.9M inst. full fine-tune All training on a single node with 8× NVIDIA H100-80GB GPUs
Pre-training lineage: FastViTHD is initialized from Apple's MobileCLIP MCi2 encoder (35.7M params, trained on DataCompDR). The extra Stage 5 parameters are initialized randomly and learned during training. This CLIP pre-training gives FastViTHD strong visual representations from the start — the VLM training then adapts these for instruction following.

6 — Performance & Benchmarks

FastVLM achieves the state-of-the-art trade-off between accuracy, latency, and model size. The key insight is that you don't need a large, slow vision encoder to get good VLM performance — a well-designed efficient encoder can match or beat heavy alternatives while running dramatically faster.

Speed Comparison

Time-to-First-Token (TTFT)

The metric users feel most
TTFT Comparison (lower is better) FastVLM 0.5B LLaVA-OV 0.5B 85× ConvLLaVA 7B 1.22× Cambrian 8B 7.9× Key Results: 85× faster than LLaVA-OneVision 3.4× smaller vision encoder 7.9× faster TTFT than Cambrian-1-8B 22% faster than ConvLLaVA

Benchmark Results

BenchmarkFastVLM-0.5BLLaVA-OV-0.5BFastVLM-7BCambrian-1-8B
SeedBenchBetterBaselineBetterBaseline
MMMU49.9ComparableBetterComparable
TextVQA74.8Baseline
DocVQA78.9Baseline
GQA65.8Baseline
Vision Encoder Size125M430M125M304M
TTFT (relative)85×7.9×
Compared to ConvLLaVA (the closest competitor using the same LLM and similar training data), FastVLM achieves 8.4% better performance on TextVQA and 12.5% improvement on DocVQA while running 22% faster. This demonstrates that the efficiency doesn't come at the cost of accuracy.

7 — Why Hybrid Encoders Scale Better

The paper includes a detailed efficiency analysis showing why the hybrid convolutional-transformer approach fundamentally scales better than pure ViTs as resolution increases.

ViT Scaling Problem

Self-attention cost grows quadratically with token count

In a standard ViT, doubling the image resolution quadruples the number of tokens (patches). Since self-attention is O(n2) in token count, this means an 8× increase in attention computation for each layer, across all layers. At 1024×1024 with a patch size of 14, you get 5,329 tokens — and ViT runs self-attention on all of them in every one of 24+ layers.

FastViTHD Scaling Advantage

Convolutions handle spatial resolution; attention handles semantics

FastViTHD's convolutional stages (Stages 1–3) scale linearly with resolution because convolutions are local operations — they don't attend to every other position. The expensive self-attention in Stages 4–5 only sees the already-downsampled feature maps. This means doubling the input resolution adds mostly cheap convolutional work, not expensive attention work.

Computational Cost vs. Resolution Input Resolution Compute Cost 256 512 768 1024 ViT-L/14 FastViTHD Gap grows with resolution
No token pruning needed: Many VLMs use techniques like token merging (ToMe) or learned token pruning to reduce the token count after encoding. FastVLM eliminates the need for these entirely — the encoder natively outputs the right number of tokens. This is architecturally simpler and avoids the information loss that comes with post-hoc token reduction.

8 — On-Device Deployment

FastVLM is designed from the ground up for on-device inference — running directly on iPhones, iPads, and Macs without a cloud backend. This has privacy, latency, and offline availability benefits.

Apple Silicon Deployment

MLX framework — iPhone, iPad, Mac

Apple provides an iOS/macOS demo app built on the MLX framework. The models are available in Apple Silicon-compatible formats:

  • FastVLM-0.5B — runs on iPhone and iPad with minimal memory
  • FastVLM-1.5B — suitable for iPad Pro and Mac
  • FastVLM-7B — for Mac with sufficient unified memory

The small model size (0.5B total parameters with a 125M encoder) combined with the low token count makes real-time vision-language interaction possible on mobile hardware. Users can point their camera at a document, chart, or scene and get instant language model responses — all processed locally.

Privacy by design: Because FastVLM runs entirely on-device, images never leave the user's device. This is critical for processing sensitive documents, medical images, or private photos — the model sees the image locally and generates a response without any cloud upload.

9 — Model Variants

VariantVision EncoderLLM DecoderTotal ParamsTarget Device
FastVLM-0.5BFastViTHD (125M)Qwen2-0.5B~625MiPhone, iPad
FastVLM-1.5BFastViTHD (125M)Qwen2-1.5B~1.6BiPad Pro, Mac
FastVLM-7BFastViTHD (125M)Qwen2-7B~7.1BMac, Server

All three variants share the same FastViTHD vision encoder — only the LLM decoder size changes. This means the vision encoding speed is identical across all variants; the difference is in the language model's capacity and generation quality.

10 — FastVLM vs. Prior Approaches

ApproachVision EncoderToken StrategyEncoder SizeTokens (768px)Speed
FastVLMFastViTHD (hybrid)Native low-count125M144Fastest
LLaVA-1.5CLIP ViT-L/14All tokens304M2,304Slow
LLaVA-OneVisionSigLIP-SO400MAll tokens + AnyRes430M2,500+Slowest
ConvLLaVAConvNeXtConv downsampleSimilar~576Moderate
Cambrian-1Multi-encoderToken merging304M+576Moderate
Architectural lineage: FastViTHD builds on two prior Apple works: FastViT (ICCV 2023), which introduced the RepMixer hybrid architecture for image classification, and MobileCLIP (CVPR 2024), which adapted FastViT for contrastive language-image pre-training. FastVLM extends this line by adding the extra downsampling stage and integrating with an LLM decoder for full vision-language capability.

11 — References & Further Reading