FastVLM: Efficient Vision Encoding for Vision Language Models
1 — The Problem to Solve
Vision-Language Models (VLMs) combine a vision encoder with a large language model to understand images and answer questions about them. You show the model a photo, a chart, or a document, and it can describe what it sees, answer questions, or extract text. Models like LLaVA, GPT-4V, and Gemini all follow this pattern.
The problem is speed. Current VLMs use heavy vision encoders like ViT-L/14 (304M parameters) or SigLIP-SO400M (430M parameters). These encoders are slow, especially at high resolutions — and high resolution is critical for reading text in documents, charts, and fine-grained visual details. Worse, higher resolution means more visual tokens fed to the LLM, which increases the time-to-first-token (TTFT) — the latency a user experiences before the model starts responding.
This creates a three-way tension: you want high resolution for accuracy, few tokens for fast LLM prefilling, and a small, fast encoder for low latency. Most prior work addressed this by keeping heavy encoders and adding token pruning or merging on top. FastVLM takes a different approach — it redesigns the vision encoder itself.
2 — Architecture Overview
FastVLM follows the standard three-component VLM design: a vision encoder (FastViTHD), an MLP projection layer, and an LLM decoder. The innovation is entirely in the vision encoder — the rest is deliberately kept simple.
3 — FastViTHD: The Hybrid Vision Encoder
The core contribution — a 125M parameter hybrid encoder built on Apple's FastViT and MobileCLIP research
FastViTHD extends Apple's earlier FastViT architecture (ICCV 2023) with an additional fifth stage. The key insight is that convolutions are efficient at processing high-resolution spatial features, while self-attention excels at capturing global context. By stacking convolutions first and self-attention last, you get both — without the cost of running self-attention on high-resolution feature maps.
1 Convolutional Stem
Patch Embedding Stem
The stem downsamples the input image by 4× using strided convolutions. This is a critical efficiency gain over ViT, which uses a single 16×16 patch embedding — the stem gives the convolutional stages a head start at reduced resolution while preserving fine spatial detail through learned filters rather than hard patch boundaries.
2 Stages 1–3: RepMixer Convolutional Blocks
RepMixer Blocks (Efficient Token Mixing)
The first three stages use RepMixer blocks — a structural reparameterization technique from FastViT. During training, each block uses a depthwise convolution branch and a skip connection for token mixing. At inference time, these branches are fused into a single depthwise convolution through reparameterization, eliminating the skip connection overhead entirely.
Each stage also includes a ConvFFN (convolutional feed-forward network) with 7×7 depthwise convolutions preceding a standard FFN with 4× expansion ratio. The depthwise convolutions inject local spatial information that pure FFNs miss.
Each inter-stage transition applies a 2× downsampling via strided patch embedding layers. After three convolutional stages, the feature map is downsampled by 32× total (4× from stem, 2× per stage transition).
3 Stages 4–5: Multi-Head Self-Attention
Transformer Blocks (Global Context)
The final two stages switch from convolutions to multi-head self-attention (MHSA). By this point, the spatial resolution has been reduced dramatically — at 768×768 input, Stage 4 operates on a 24×24 feature map (576 tokens) and Stage 5 on 12×12 (144 tokens). Self-attention at these resolutions is cheap.
This is the critical difference from ViT: in a standard ViT-L/14, self-attention runs on all patches from the very first layer — at 1024×1024, that means attention over 4,096 tokens in every one of 24 layers. In FastViTHD, self-attention only runs on the final 6 layers with heavily downsampled maps. The expensive spatial processing is already done by the convolutional stages.
4 Token Count: Why It Matters
Resolution → Token Count Scaling
Every visual token that enters the LLM adds to the prefill cost. FastViTHD's 64× total downsampling produces dramatically fewer tokens than ViT encoders:
| Input Resolution | FastViTHD Tokens | ViT-L/14 Tokens | Reduction |
|---|---|---|---|
| 256 × 256 | 16 | 256 | 16× |
| 512 × 512 | 64 | 1,024 | 16× |
| 768 × 768 | 144 | 2,304 | 16× |
| 1024 × 1024 | 256 | 4,096 | 16× |
At 768×768, FastViTHD produces just 144 tokens — fewer than ViT-L/14 produces at 336×336 (576 tokens). This means FastVLM can run at higher resolution with fewer tokens, getting better visual detail while being faster for the LLM.
FastViTHD Stage-by-Stage Summary
| Stage | Type | Depth | Channels | Spatial (at 768px) | Role |
|---|---|---|---|---|---|
| Stem | Strided Conv | — | 96 | 192 × 192 | Initial 4× downsample |
| Stage 1 | RepMixer | 2 | 96 | 96 × 96 | Low-level features |
| Stage 2 | RepMixer | 12 | 192 | 48 × 48 | Mid-level features |
| Stage 3 | RepMixer | 24 | 384 | 24 × 24 | High-level local features |
| Stage 4 | MHSA | 4 | 768 | 24 × 24 | Global context (no downsample) |
| Stage 5 | MHSA | 2 | 1536 | 12 × 12 | Global context + final downsample |
4 — Projection & Language Model
1 MLP Projection Layer
Vision-Language Connector
The projection layer maps the 1536-dimensional output tokens from FastViTHD into the embedding space of the LLM decoder. This is a simple two-layer MLP with GELU activation — following the LLaVA-1.5 design. No cross-attention, no complex fusion — just a linear transformation. The resulting visual tokens are concatenated with the text tokens and fed to the LLM.
For a 768×768 input image, this projects 144 visual tokens of dimension 1536 into the LLM's hidden dimension (e.g., 896 for Qwen2-0.5B, 1536 for Qwen2-1.5B, or 3584 for Qwen2-7B).
2 LLM Decoder
Autoregressive Language Model
FastVLM uses Qwen2 models as the LLM decoder, available in three sizes to match deployment targets:
- Qwen2-0.5B — for mobile and edge devices (iPhone, iPad)
- Qwen2-1.5B — mid-range balance of capability and speed
- Qwen2-7B — full capability for desktop/server deployment
The LLM receives the projected visual tokens followed by the user's text query tokens. It then generates a text response autoregressively. Because FastViTHD produces so few visual tokens, the LLM prefill phase is fast — this is where the 85× TTFT improvement over LLaVA-OneVision comes from.
5 — Training Pipeline
Multi-stage training following the LLaVA-1.5 recipe, scaled up for high resolution
FastVLM uses a three-stage training pipeline. The vision encoder (FastViTHD) is initialized from MobileCLIP pre-trained weights. The LLM decoder starts from Qwen2 pre-trained weights. Training progressively unfreezes components:
Stage 1: Projector Alignment
Only the MLP projection layer is trained. The vision encoder and LLM are both frozen. This stage teaches the projection to map visual tokens into the LLM's embedding space using 558K image-text pairs from LLaVA. Learning rate: 10-3, batch size: 256.
This is fast — about 30 minutes on a single 8-GPU node — because only the small MLP is being updated.
Stage 1.5: Resolution Scaling
The vision encoder and projector are fine-tuned on 15 million samples from CC3M and CC12M at increasing resolutions. This teaches FastViTHD to handle high-resolution inputs that go beyond its CLIP pre-training distribution. The LLM remains frozen. Learning rate: 2×10-5.
Stage 2: Full Instruction Tuning
All three components (vision encoder, projector, and LLM) are jointly fine-tuned on instruction-following data — image-question-answer triplets that teach the model to follow user instructions about images. Learning rate: 2×10-5, batch size: 128. Uses cosine decay with 3% warmup and AdamW optimizer.
6 — Performance & Benchmarks
FastVLM achieves the state-of-the-art trade-off between accuracy, latency, and model size. The key insight is that you don't need a large, slow vision encoder to get good VLM performance — a well-designed efficient encoder can match or beat heavy alternatives while running dramatically faster.
Speed Comparison
Time-to-First-Token (TTFT)
Benchmark Results
| Benchmark | FastVLM-0.5B | LLaVA-OV-0.5B | FastVLM-7B | Cambrian-1-8B |
|---|---|---|---|---|
| SeedBench | Better | Baseline | Better | Baseline |
| MMMU | 49.9 | Comparable | Better | Comparable |
| TextVQA | — | — | 74.8 | Baseline |
| DocVQA | — | — | 78.9 | Baseline |
| GQA | — | — | 65.8 | Baseline |
| Vision Encoder Size | 125M | 430M | 125M | 304M |
| TTFT (relative) | 1× | 85× | 1× | 7.9× |
7 — Why Hybrid Encoders Scale Better
The paper includes a detailed efficiency analysis showing why the hybrid convolutional-transformer approach fundamentally scales better than pure ViTs as resolution increases.
ViT Scaling Problem
In a standard ViT, doubling the image resolution quadruples the number of tokens (patches). Since self-attention is O(n2) in token count, this means an 8× increase in attention computation for each layer, across all layers. At 1024×1024 with a patch size of 14, you get 5,329 tokens — and ViT runs self-attention on all of them in every one of 24+ layers.
FastViTHD Scaling Advantage
FastViTHD's convolutional stages (Stages 1–3) scale linearly with resolution because convolutions are local operations — they don't attend to every other position. The expensive self-attention in Stages 4–5 only sees the already-downsampled feature maps. This means doubling the input resolution adds mostly cheap convolutional work, not expensive attention work.
8 — On-Device Deployment
FastVLM is designed from the ground up for on-device inference — running directly on iPhones, iPads, and Macs without a cloud backend. This has privacy, latency, and offline availability benefits.
Apple Silicon Deployment
Apple provides an iOS/macOS demo app built on the MLX framework. The models are available in Apple Silicon-compatible formats:
- FastVLM-0.5B — runs on iPhone and iPad with minimal memory
- FastVLM-1.5B — suitable for iPad Pro and Mac
- FastVLM-7B — for Mac with sufficient unified memory
The small model size (0.5B total parameters with a 125M encoder) combined with the low token count makes real-time vision-language interaction possible on mobile hardware. Users can point their camera at a document, chart, or scene and get instant language model responses — all processed locally.
9 — Model Variants
| Variant | Vision Encoder | LLM Decoder | Total Params | Target Device |
|---|---|---|---|---|
| FastVLM-0.5B | FastViTHD (125M) | Qwen2-0.5B | ~625M | iPhone, iPad |
| FastVLM-1.5B | FastViTHD (125M) | Qwen2-1.5B | ~1.6B | iPad Pro, Mac |
| FastVLM-7B | FastViTHD (125M) | Qwen2-7B | ~7.1B | Mac, Server |
All three variants share the same FastViTHD vision encoder — only the LLM decoder size changes. This means the vision encoding speed is identical across all variants; the difference is in the language model's capacity and generation quality.
10 — FastVLM vs. Prior Approaches
| Approach | Vision Encoder | Token Strategy | Encoder Size | Tokens (768px) | Speed |
|---|---|---|---|---|---|
| FastVLM | FastViTHD (hybrid) | Native low-count | 125M | 144 | Fastest |
| LLaVA-1.5 | CLIP ViT-L/14 | All tokens | 304M | 2,304 | Slow |
| LLaVA-OneVision | SigLIP-SO400M | All tokens + AnyRes | 430M | 2,500+ | Slowest |
| ConvLLaVA | ConvNeXt | Conv downsample | Similar | ~576 | Moderate |
| Cambrian-1 | Multi-encoder | Token merging | 304M+ | 576 | Moderate |
11 — References & Further Reading
- FastVLM: Efficient Vision Encoding for Vision Language Models — Vasu, Faghri, Li, Koc, True, Antony, Santhanam, Gabriel, Grasch, Tuzel, Pouransari — CVPR 2025
- Official GitHub Repository — apple/ml-fastvlm
- Apple ML Research Project Page
- FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization — Vasu et al. — ICCV 2023
- MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training — Vasu et al. — CVPR 2024
- Visual Instruction Tuning (LLaVA) — Liu et al. — NeurIPS 2023
- Our ViT Walkthrough — for comparison with the standard Vision Transformer approach