Multimodal Transformers
Multimodal transformers fuse information across modalities — vision, language, audio, actions — into shared representations. This walkthrough covers the three dominant architectural paradigms: contrastive two-tower (CLIP), cross-attention fusion (Flamingo, LLaVA), and unified embedding (Gemini, GPT-4V) — when to use each, what to look for in your data, and how they compare.
When Do You Need a Multimodal Transformer?
Problem Signals
Reach for multimodal when your task exhibits one or more of these patterns:
Cross-Modal Grounding
The answer depends on connecting a concept in one modality to a region/span in another. "Where is the dog?" requires grounding language to image pixels. Visual question answering, referring expression comprehension, image captioning.
Multi-Modal Generation
The output modality differs from the input, or you generate in multiple modalities jointly. Text-to-image (Stable Diffusion/DALL-E), image-conditioned text generation, vision-language-action models for robotics.
Retrieval Across Modalities
Given a query in modality A (text), find matches in modality B (images/video/audio). Zero-shot image search, video retrieval from natural language queries, audio-visual correspondence.
Complementary Evidence Fusion
Each modality provides partial evidence that must be integrated — like a medical AI combining radiology images with clinical notes, or a robot combining camera feeds with language instructions.
What to Look for in Your Data
Paired vs. Unpaired Data
- Paired data (image-caption pairs, video-transcript): Enables contrastive learning (CLIP) and supervised fusion. The quality of pairing matters enormously — noisy web-scraped alt-text vs. human-written captions produce very different models.
- Unpaired data (images without captions, text without images): Can still be used for pretraining individual encoders. Multimodal alignment then requires a bridge — typically a projection layer or adapter trained on a smaller paired set.
Data Scale & Modality Balance
- Contrastive models (CLIP, SigLIP) need hundreds of millions of image-text pairs. CLIP trained on 400M, SigLIP on billions.
- Fusion models (LLaVA, Flamingo) can work with much less paired data because they leverage frozen pretrained backbones — LLaVA-1.5 used only 665K instruction pairs for fine-tuning.
- Modality imbalance: If you have 10× more text than images, a two-tower contrastive approach won't utilize the extra text. A fusion model with a frozen LLM backbone can leverage it.
Alignment Granularity
- Global alignment: "This image shows a dog" — whole-image to whole-sentence. Good for retrieval. CLIP excels here.
- Region-level: "The red car on the left" — requires spatial grounding. Needs cross-attention or region features.
- Token-level: Pixel-by-pixel or frame-by-frame correspondence. Needs dense fusion (unified embedding or heavy cross-attention).
Latency & Deployment Constraints
- Two-tower: Encode each modality once, compare with dot product. Sub-millisecond retrieval over millions of items. Best for serving at scale.
- Cross-attention fusion: Requires running both modalities through shared layers. More compute per query, but richer understanding.
- Unified embedding: Most expensive at inference, but most capable. Appropriate when quality matters more than latency.
1 Contrastive Two-Tower (CLIP / SigLIP)
How It Works
image: [B, 3, 224, 224] → [B, d] · text: [B, L] → [B, d]Both towers project to a shared embedding dimension d (typically 512 or 768). For CLIP ViT-L/14: image encoder produces a [B, 257, 1024] patch sequence (1 [CLS] + 16×16 patches), the [CLS] token is projected to [B, 768]; the text encoder produces [B, L, 512] and the [EOS] token is projected to [B, 768]. Both are l2-normalized.
Training: the similarity matrix S = v t⊤ ∈ [B, B] has diagonal entries for true pairs and off-diagonal for false pairs. InfoNCE treats each row as a B-way classification (and symmetrically for columns).
SigLIP improvement: Replaces the softmax-based InfoNCE with a pairwise sigmoid loss. Each (i, j) entry of S is independently classified as matching or not, removing the need for synchronized batch statistics across GPUs. This allows larger effective batch sizes and better scaling.
When to Use Two-Tower
- Zero-shot classification: Compare an image against many class-name prompts. No training needed for new categories.
- Cross-modal retrieval: Pre-compute embeddings offline, retrieve in sub-millisecond with approximate nearest neighbors.
- Building a multimodal backbone: CLIP/SigLIP vision encoders are used as the "eyes" for downstream models (LLaVA, Flamingo, PaliGemma).
Limitations
- No fine-grained interaction: A single global vector per modality can't represent spatial relationships, counting, or compositional reasoning ("the cat is to the left of the dog").
- No generation: Two-tower models produce embeddings, not text or images. You can retrieve, not generate.
- Bag-of-concepts bias: CLIP often behaves like a bag of words — "a dog biting a man" and "a man biting a dog" get similar scores.
Key Papers & Models
2 Cross-Attention Fusion (Flamingo / LLaVA)
How It Works
vision: [B, N_v, d_v] → [B, K, d_llm]The vision encoder (typically CLIP/SigLIP ViT-L/14 or ViT-H/14) processes the image into a grid of Nv patch tokens — Nv = 256 for 224×224 at patch 14. The projection module then maps these into the LLM's dllm-dim embedding space and (optionally) compresses the sequence length.
The critical design choice is how visual tokens interact with text:
- Flamingo: A Perceiver Resampler first compresses [B, N_v, d_v] → [B, 64, d_llm] via learned query tokens. Gated cross-attention layers (Q = text, K/V = visual) are inserted between frozen LLM layers; gates are zero-initialized so the model starts as a pure LLM.
- LLaVA: Simpler — an MLP projection maps [B, N_v, d_v] → [B, N_v, d_llm] and the projected tokens are concatenated with text tokens along the sequence axis. The LLM's existing self-attention handles the interaction. No architectural changes to the LLM.
When to Use Cross-Attention Fusion
- Visual question answering & reasoning: "What color is the car in front of the house?" requires grounding and spatial understanding.
- Image/video captioning & description: Generate detailed, contextual text about visual content.
- Document understanding: Process charts, diagrams, screenshots with text questions.
- Leveraging existing LLMs: You have a strong pretrained LLM and want to add vision without retraining from scratch.
- Limited paired data: Freeze both encoders, only train the projection layer and adapters — LLaVA-1.5 fine-tuned on 665K examples.
Flamingo vs. LLaVA: The Architecture Decision
| Flamingo-style | LLaVA-style | |
|---|---|---|
| Fusion mechanism | Dedicated cross-attention layers | Concatenate in token space |
| LLM modification | Inserts new layers (gated) | None — LLM unchanged |
| Visual token count | Fixed (64 via Perceiver) | Variable (196+ for ViT-L/14) |
| Multi-image | Natural (interleaved) | Requires context management |
| Training cost | Higher (new cross-attn params) | Lower (just projection layer) |
| Inference cost | Lower (fewer visual tokens) | Higher (many visual tokens in context) |
Key Papers & Models
3 Unified Embedding (Gemini / GPT-4V)
How It Works
all modalities → [B, L, d_model]Each modality has a tokenizer that converts raw input into tokens, all of which land in a shared dmodel-dim embedding space (typical 2048–8192). The concatenated sequence [B, L, d_model] — where L mixes text, image, and audio tokens — is processed by a standard transformer with full self-attention. Every token attends to every other, regardless of modality.
Tokenization strategies vary:
- Gemini: SigLIP-derived visual encoder produces ~256 tokens per image; these are interleaved with SentencePiece text tokens. Audio uses USM-derived features — ~6.25 tokens/s. All project to the same dmodel.
- Chameleon (Meta): Discretizes images into 256×256 → 1024 VQ-VAE tokens that share the same vocabulary as text. True token-level unification — image and text tokens use the same softmax head over a combined vocab of ~65K.
- Gemini Robotics: Extends to action tokens for robot control — continuous joint commands are discretized into a small vocab per joint and generated autoregressively.
When to Use Unified Embedding
- Any-to-any generation: Tasks that require generating in multiple modalities (text → image, image → text, interleaved).
- Complex multi-modal reasoning: Math problems with diagrams, multi-step visual reasoning, science questions with charts.
- Embodied AI / robotics: Vision + language understanding + action generation in a single model (Gemini Robotics, RT-2).
- You have massive compute and data: These models require training at unprecedented scale. Gemini Ultra used thousands of TPUs.
Limitations
- Compute cost: Full self-attention over all modality tokens is O(n²) where n includes visual tokens (often 256–4096+). Much more expensive than two-tower.
- Training complexity: Balancing loss across modalities is hard. Image generation and text generation compete for capacity. Modality-specific data ratios matter enormously.
- Not open-source: The most capable unified models (Gemini, GPT-4V) are proprietary. Open alternatives (Chameleon) exist but lag in capability.
Key Papers & Models
Paradigm Comparison
| Dimension | Two-Tower (CLIP) | Cross-Attention (LLaVA) | Unified (Gemini) |
|---|---|---|---|
| Interaction depth | None — dot product only | Medium — cross-attn layers | Deep — full self-attention |
| Training data needed | 400M+ pairs | 600K–2M (fine-tune) | Billions (end-to-end) |
| Retrieval speed | Sub-ms (precomputed) | Not applicable | Not applicable |
| Generation quality | No generation | Good text generation | Best — any-to-any |
| Spatial reasoning | Weak | Good | Best |
| Compositionality | Weak (bag of concepts) | Good | Best |
| Training cost | Moderate | Low (frozen backbone) | Extreme |
| Open models | CLIP, SigLIP, OpenCLIP | LLaVA, PaliGemma, Qwen-VL | Chameleon (limited) |
| Best for | Retrieval, zero-shot, backbones | VQA, captioning, document AI | General-purpose reasoning |
Decision Framework
Choose Two-Tower (CLIP/SigLIP) if...
- You need fast retrieval across millions of items
- Zero-shot classification without training is sufficient
- You need a vision backbone for a downstream model
- You have hundreds of millions of image-text pairs
- Latency constraints are strict (<10ms per query)
Choose Cross-Attention Fusion (LLaVA/Flamingo) if...
- You need to generate text conditioned on images (VQA, captioning)
- You want to add vision to an existing LLM without retraining it
- You have limited paired data (<1M examples)
- You need a good balance of quality and training cost
- Fine-grained spatial understanding matters
Choose Unified Embedding (Gemini) if...
- You need any-to-any generation (text, image, action)
- Complex multi-step reasoning across modalities is required
- You're building an embodied agent (robot, game AI)
- You have massive compute budget and diverse multimodal data
- Maximum capability matters more than efficiency
Use Cases in Practice
E-Commerce Product Search
User types "blue running shoes" or uploads a photo. CLIP/SigLIP encodes the query and catalog images independently. Retrieve top-K by cosine similarity in <5ms over millions of products. Pre-compute all product embeddings offline. This is the approach Nike and similar retailers use.
Medical Image Report Generation
Radiologist uploads a chest X-ray. A frozen CheXNet or BiomedCLIP encodes the image. A medical LLM generates structured findings via cross-attention to the visual features. Training data: ~200K image-report pairs from hospital archives.
Robot Manipulation from Language Instructions
"Pick up the red block and place it on the blue plate." Camera feeds + language instruction are tokenized into a single sequence. The unified model reasons about spatial relationships and outputs action tokens for the robot arm. This is the RT-2 / Gemini Robotics / π₀ approach.
The Trajectory
But two-tower models aren't going away. They remain the backbone: SigLIP powers PaliGemma, CLIP powers LLaVA, DINOv2 powers many vision pipelines. The practical pattern is: contrastive pretraining → cross-attention fine-tuning → (optionally) unified end-to-end training. Each paradigm builds on the last.