FastVLM: Efficient Vision Encoding for Vision Language Models

CVPR 2025 — Apple

Vision-Language Efficient Inference Hybrid Encoder On-Device Mobile Apple

Paper: arXiv:2412.13303 | Code: apple/ml-fastvlm | Project: Apple ML Research

1 — The Problem to Solve

Vision-Language Models (VLMs) combine a vision encoder with a large language model to understand images and answer questions about them. You show the model a photo, a chart, or a document, and it can describe what it sees, answer questions, or extract text. Models like LLaVA, GPT-4V, and Gemini all follow this pattern.

The problem is speed. Current VLMs use heavy vision encoders like ViT-L/14 (304M parameters) or SigLIP-SO400M (430M parameters). These encoders are slow, especially at high resolutions — and high resolution is critical for reading text in documents, charts, and fine-grained visual details. Worse, higher resolution means more visual tokens fed to the LLM, which increases the time-to-first-token (TTFT) — the latency a user experiences before the model starts responding.

This creates a three-way tension: you want high resolution for accuracy, few tokens for fast LLM prefilling, and a small, fast encoder for low latency. Most prior work addressed this by keeping heavy encoders and adding token pruning or merging on top. FastVLM takes a different approach — it redesigns the vision encoder itself.

FastVLM's thesis: Instead of using a heavy ViT encoder and then pruning its tokens, build a hybrid convolutional-transformer encoder that natively produces fewer tokens and runs faster — especially at high resolutions. Simplicity over complexity.

2 — Architecture Overview

FastVLM follows the standard three-component VLM design: a vision encoder (FastViTHD), an MLP projection layer, and an LLM decoder. The innovation is entirely in the vision encoder — the rest is deliberately kept simple.

Key design principle: FastVLM does not use token pruning, token merging, or any post-hoc compression. The efficiency comes entirely from the encoder architecture itself — a hybrid design where early convolution stages do the heavy spatial processing cheaply, and self-attention is only applied to already-downsampled feature maps in the final stages.

3 — FastViTHD: The Hybrid Vision Encoder

The core contribution — a 125M parameter hybrid encoder built on Apple's FastViT and MobileCLIP research

FastViTHD extends Apple's earlier FastViT architecture (ICCV 2023) with an additional fifth stage. The key insight is that convolutions are efficient at processing high-resolution spatial features, while self-attention excels at capturing global context. By stacking convolutions first and self-attention last, you get both — without the cost of running self-attention on high-resolution feature maps.

1 Convolutional Stem

Patch Embedding Stem

768×768×3 → 192×192×96

The stem downsamples the input image by 4× using strided convolutions. This is a critical efficiency gain over ViT, which uses a single 16×16 patch embedding — the stem gives the convolutional stages a head start at reduced resolution while preserving fine spatial detail through learned filters rather than hard patch boundaries.

2 Stages 1–3: RepMixer Convolutional Blocks

RepMixer Blocks (Efficient Token Mixing)

Stage 1: 96-ch, depth 2 → Stage 2: 192-ch, depth 12 → Stage 3: 384-ch, depth 24

The first three stages use RepMixer blocks — a structural reparameterization technique from FastViT. During training, each block uses a depthwise convolution branch and a skip connection for token mixing. At inference time, these branches are fused into a single depthwise convolution through reparameterization, eliminating the skip connection overhead entirely.

Each stage also includes a ConvFFN (convolutional feed-forward network) with 7×7 depthwise convolutions preceding a standard FFN with 4× expansion ratio. The depthwise convolutions inject local spatial information that pure FFNs miss.

Each inter-stage transition applies a 2× downsampling via strided patch embedding layers. After three convolutional stages, the feature map is downsampled by 32× total (4× from stem, 2× per stage transition).

3 Stages 4–5: Multi-Head Self-Attention

Transformer Blocks (Global Context)

Stage 4: 768-ch, depth 4 → Stage 5: 1536-ch, depth 2

The final two stages switch from convolutions to multi-head self-attention (MHSA). By this point, the spatial resolution has been reduced dramatically — at 768×768 input, Stage 4 operates on a 24×24 feature map (576 tokens) and Stage 5 on 12×12 (144 tokens). Self-attention at these resolutions is cheap.

This is the critical difference from ViT: in a standard ViT-L/14, self-attention runs on all patches from the very first layer — at 1024×1024, that means attention over 4,096 tokens in every one of 24 layers. In FastViTHD, self-attention only runs on the final 6 layers with heavily downsampled maps. The expensive spatial processing is already done by the convolutional stages.

The extra stage that matters: FastViTHD adds Stage 5 (with 2× more downsampling) on top of the original FastViT architecture. This ensures self-attention operates on tensors downsampled by 64× instead of 32×, halving the token count fed to the LLM and reducing encoding latency. This single change generates 4× fewer tokens compared to a standard 32× downsampled encoder.

4 Token Count: Why It Matters

Resolution → Token Count Scaling

64× downsampling means far fewer tokens

Every visual token that enters the LLM adds to the prefill cost. FastViTHD's 64× total downsampling produces dramatically fewer tokens than ViT encoders:

Input Resolution	FastViTHD Tokens	ViT-L/14 Tokens	Reduction
256 × 256	16	256	16×
512 × 512	64	1,024	16×
768 × 768	144	2,304	16×
1024 × 1024	256	4,096	16×

At 768×768, FastViTHD produces just 144 tokens — fewer than ViT-L/14 produces at 336×336 (576 tokens). This means FastVLM can run at higher resolution with fewer tokens, getting better visual detail while being faster for the LLM.

FastViTHD Stage-by-Stage Summary

Stage	Type	Depth	Channels	Spatial (at 768px)	Role
Stem	Strided Conv	—	96	192 × 192	Initial 4× downsample
Stage 1	RepMixer	2	96	96 × 96	Low-level features
Stage 2	RepMixer	12	192	48 × 48	Mid-level features
Stage 3	RepMixer	24	384	24 × 24	High-level local features
Stage 4	MHSA	4	768	24 × 24	Global context (no downsample)
Stage 5	MHSA	2	1536	12 × 12	Global context + final downsample

4 — Projection & Language Model

1 MLP Projection Layer

Vision-Language Connector

N×1536 → N×d_LLM

The projection layer maps the 1536-dimensional output tokens from FastViTHD into the embedding space of the LLM decoder. This is a simple two-layer MLP with GELU activation — following the LLaVA-1.5 design. No cross-attention, no complex fusion — just a linear transformation. The resulting visual tokens are concatenated with the text tokens and fed to the LLM.

For a 768×768 input image, this projects 144 visual tokens of dimension 1536 into the LLM's hidden dimension (e.g., 896 for Qwen2-0.5B, 1536 for Qwen2-1.5B, or 3584 for Qwen2-7B).

2 LLM Decoder

Autoregressive Language Model

Visual tokens + text tokens → generated response

FastVLM uses Qwen2 models as the LLM decoder, available in three sizes to match deployment targets:

Qwen2-0.5B — for mobile and edge devices (iPhone, iPad)
Qwen2-1.5B — mid-range balance of capability and speed
Qwen2-7B — full capability for desktop/server deployment

The LLM receives the projected visual tokens followed by the user's text query tokens. It then generates a text response autoregressively. Because FastViTHD produces so few visual tokens, the LLM prefill phase is fast — this is where the 85× TTFT improvement over LLaVA-OneVision comes from.

5 — Training Pipeline

Multi-stage training following the LLaVA-1.5 recipe, scaled up for high resolution

FastVLM uses a three-stage training pipeline. The vision encoder (FastViTHD) is initialized from MobileCLIP pre-trained weights. The LLM decoder starts from Qwen2 pre-trained weights. Training progressively unfreezes components:

Stage 1: Projector Alignment

LLaVA-558K — 1 epoch — ~30 min on 8×H100

Only the MLP projection layer is trained. The vision encoder and LLM are both frozen. This stage teaches the projection to map visual tokens into the LLM's embedding space using 558K image-text pairs from LLaVA. Learning rate: 10^-3, batch size: 256.

This is fast — about 30 minutes on a single 8-GPU node — because only the small MLP is being updated.

Stage 1.5: Resolution Scaling

CC3M + CC12M — 15M samples — vision encoder + projector

The vision encoder and projector are fine-tuned on 15 million samples from CC3M and CC12M at increasing resolutions. This teaches FastViTHD to handle high-resolution inputs that go beyond its CLIP pre-training distribution. The LLM remains frozen. Learning rate: 2×10^-5.

Stage 2: Full Instruction Tuning

1.1M–11.9M instruction samples — all components trained

All three components (vision encoder, projector, and LLM) are jointly fine-tuned on instruction-following data — image-question-answer triplets that teach the model to follow user instructions about images. Learning rate: 2×10^-5, batch size: 128. Uses cosine decay with 3% warmup and AdamW optimizer.

Pre-training lineage: FastViTHD is initialized from Apple's MobileCLIP MCi2 encoder (35.7M params, trained on DataCompDR). The extra Stage 5 parameters are initialized randomly and learned during training. This CLIP pre-training gives FastViTHD strong visual representations from the start — the VLM training then adapts these for instruction following.

6 — Performance & Benchmarks

FastVLM achieves the state-of-the-art trade-off between accuracy, latency, and model size. The key insight is that you don't need a large, slow vision encoder to get good VLM performance — a well-designed efficient encoder can match or beat heavy alternatives while running dramatically faster.

Speed Comparison

Time-to-First-Token (TTFT)

The metric users feel most

Benchmark Results

Benchmark	FastVLM-0.5B	LLaVA-OV-0.5B	FastVLM-7B	Cambrian-1-8B
SeedBench	Better	Baseline	Better	Baseline
MMMU	49.9	Comparable	Better	Comparable
TextVQA	—	—	74.8	Baseline
DocVQA	—	—	78.9	Baseline
GQA	—	—	65.8	Baseline
Vision Encoder Size	125M	430M	125M	304M
TTFT (relative)	1×	85×	1×	7.9×

Compared to ConvLLaVA (the closest competitor using the same LLM and similar training data), FastVLM achieves 8.4% better performance on TextVQA and 12.5% improvement on DocVQA while running 22% faster. This demonstrates that the efficiency doesn't come at the cost of accuracy.

7 — Why Hybrid Encoders Scale Better

The paper includes a detailed efficiency analysis showing why the hybrid convolutional-transformer approach fundamentally scales better than pure ViTs as resolution increases.

ViT Scaling Problem

Self-attention cost grows quadratically with token count

In a standard ViT, doubling the image resolution quadruples the number of tokens (patches). Since self-attention is O(n²) in token count, this means an 8× increase in attention computation for each layer, across all layers. At 1024×1024 with a patch size of 14, you get 5,329 tokens — and ViT runs self-attention on all of them in every one of 24+ layers.

FastViTHD Scaling Advantage

Convolutions handle spatial resolution; attention handles semantics

FastViTHD's convolutional stages (Stages 1–3) scale linearly with resolution because convolutions are local operations — they don't attend to every other position. The expensive self-attention in Stages 4–5 only sees the already-downsampled feature maps. This means doubling the input resolution adds mostly cheap convolutional work, not expensive attention work.

No token pruning needed: Many VLMs use techniques like token merging (ToMe) or learned token pruning to reduce the token count after encoding. FastVLM eliminates the need for these entirely — the encoder natively outputs the right number of tokens. This is architecturally simpler and avoids the information loss that comes with post-hoc token reduction.

8 — On-Device Deployment

FastVLM is designed from the ground up for on-device inference — running directly on iPhones, iPads, and Macs without a cloud backend. This has privacy, latency, and offline availability benefits.

Apple Silicon Deployment

MLX framework — iPhone, iPad, Mac

Apple provides an iOS/macOS demo app built on the MLX framework. The models are available in Apple Silicon-compatible formats:

FastVLM-0.5B — runs on iPhone and iPad with minimal memory
FastVLM-1.5B — suitable for iPad Pro and Mac
FastVLM-7B — for Mac with sufficient unified memory

The small model size (0.5B total parameters with a 125M encoder) combined with the low token count makes real-time vision-language interaction possible on mobile hardware. Users can point their camera at a document, chart, or scene and get instant language model responses — all processed locally.

Privacy by design: Because FastVLM runs entirely on-device, images never leave the user's device. This is critical for processing sensitive documents, medical images, or private photos — the model sees the image locally and generates a response without any cloud upload.

9 — Model Variants

Variant	Vision Encoder	LLM Decoder	Total Params	Target Device
FastVLM-0.5B	FastViTHD (125M)	Qwen2-0.5B	~625M	iPhone, iPad
FastVLM-1.5B	FastViTHD (125M)	Qwen2-1.5B	~1.6B	iPad Pro, Mac
FastVLM-7B	FastViTHD (125M)	Qwen2-7B	~7.1B	Mac, Server

All three variants share the same FastViTHD vision encoder — only the LLM decoder size changes. This means the vision encoding speed is identical across all variants; the difference is in the language model's capacity and generation quality.

10 — FastVLM vs. Prior Approaches

Approach	Vision Encoder	Token Strategy	Encoder Size	Tokens (768px)	Speed
FastVLM	FastViTHD (hybrid)	Native low-count	125M	144	Fastest
LLaVA-1.5	CLIP ViT-L/14	All tokens	304M	2,304	Slow
LLaVA-OneVision	SigLIP-SO400M	All tokens + AnyRes	430M	2,500+	Slowest
ConvLLaVA	ConvNeXt	Conv downsample	Similar	~576	Moderate
Cambrian-1	Multi-encoder	Token merging	304M+	576	Moderate

Architectural lineage: FastViTHD builds on two prior Apple works: FastViT (ICCV 2023), which introduced the RepMixer hybrid architecture for image classification, and MobileCLIP (CVPR 2024), which adapted FastViT for contrastive language-image pre-training. FastVLM extends this line by adding the extra downsampling stage and integrating with an LLM decoder for full vision-language capability.

11 — References & Further Reading

FastVLM: Efficient Vision Encoding for Vision Language Models — Vasu, Faghri, Li, Koc, True, Antony, Santhanam, Gabriel, Grasch, Tuzel, Pouransari — CVPR 2025
Official GitHub Repository — apple/ml-fastvlm
Apple ML Research Project Page
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization — Vasu et al. — ICCV 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training — Vasu et al. — CVPR 2024
Visual Instruction Tuning (LLaVA) — Liu et al. — NeurIPS 2023
Our ViT Walkthrough — for comparison with the standard Vision Transformer approach