Multimodal Transformers

Cross-Attention, Unified Embeddings & Contrastive Learning

Multimodal transformers fuse information across modalities — vision, language, audio, actions — into shared representations. This walkthrough covers the three dominant architectural paradigms: contrastive two-tower (CLIP), cross-attention fusion (Flamingo, LLaVA), and unified embedding (Gemini, GPT-4V) — when to use each, what to look for in your data, and how they compare.

When Do You Need a Multimodal Transformer?

The core question: Does your task require joint reasoning across modalities — not just processing each input independently? If yes, you need multimodal. If each modality can be handled in isolation and combined by simple rules, you probably don't.

Problem Signals

Reach for multimodal when your task exhibits one or more of these patterns:

Cross-Modal Grounding

The answer depends on connecting a concept in one modality to a region/span in another. "Where is the dog?" requires grounding language to image pixels. Visual question answering, referring expression comprehension, image captioning.

Multi-Modal Generation

The output modality differs from the input, or you generate in multiple modalities jointly. Text-to-image (Stable Diffusion/DALL-E), image-conditioned text generation, vision-language-action models for robotics.

Retrieval Across Modalities

Given a query in modality A (text), find matches in modality B (images/video/audio). Zero-shot image search, video retrieval from natural language queries, audio-visual correspondence.

Complementary Evidence Fusion

Each modality provides partial evidence that must be integrated — like a medical AI combining radiology images with clinical notes, or a robot combining camera feeds with language instructions.

What to Look for in Your Data

Paired vs. Unpaired Data

Paired data (image-caption pairs, video-transcript): Enables contrastive learning (CLIP) and supervised fusion. The quality of pairing matters enormously — noisy web-scraped alt-text vs. human-written captions produce very different models.
Unpaired data (images without captions, text without images): Can still be used for pretraining individual encoders. Multimodal alignment then requires a bridge — typically a projection layer or adapter trained on a smaller paired set.

Data Scale & Modality Balance

Contrastive models (CLIP, SigLIP) need hundreds of millions of image-text pairs. CLIP trained on 400M, SigLIP on billions.
Fusion models (LLaVA, Flamingo) can work with much less paired data because they leverage frozen pretrained backbones — LLaVA-1.5 used only 665K instruction pairs for fine-tuning.
Modality imbalance: If you have 10× more text than images, a two-tower contrastive approach won't utilize the extra text. A fusion model with a frozen LLM backbone can leverage it.

Alignment Granularity

Global alignment: "This image shows a dog" — whole-image to whole-sentence. Good for retrieval. CLIP excels here.
Region-level: "The red car on the left" — requires spatial grounding. Needs cross-attention or region features.
Token-level: Pixel-by-pixel or frame-by-frame correspondence. Needs dense fusion (unified embedding or heavy cross-attention).

Latency & Deployment Constraints

Two-tower: Encode each modality once, compare with dot product. Sub-millisecond retrieval over millions of items. Best for serving at scale.
Cross-attention fusion: Requires running both modalities through shared layers. More compute per query, but richer understanding.
Unified embedding: Most expensive at inference, but most capable. Appropriate when quality matters more than latency.

1 Contrastive Two-Tower (CLIP / SigLIP)

Key idea: Train two separate encoders (one per modality) to embed matching pairs close together and non-matching pairs far apart in a shared vector space. At inference, compare any image to any text via cosine similarity — no fusion needed.

How It Works

image: [B, 3, 224, 224] → [B, d] · text: [B, L] → [B, d]

Both towers project to a shared embedding dimension d (typically 512 or 768). For CLIP ViT-L/14: image encoder produces a [B, 257, 1024] patch sequence (1 [CLS] + 16×16 patches), the [CLS] token is projected to [B, 768]; the text encoder produces [B, L, 512] and the [EOS] token is projected to [B, 768]. Both are l₂-normalized.

Training: the similarity matrix S = v t^⊤ ∈ [B, B] has diagonal entries for true pairs and off-diagonal for false pairs. InfoNCE treats each row as a B-way classification (and symmetrically for columns).

SigLIP improvement: Replaces the softmax-based InfoNCE with a pairwise sigmoid loss. Each (i, j) entry of S is independently classified as matching or not, removing the need for synchronized batch statistics across GPUs. This allows larger effective batch sizes and better scaling.

When to Use Two-Tower

Zero-shot classification: Compare an image against many class-name prompts. No training needed for new categories.
Cross-modal retrieval: Pre-compute embeddings offline, retrieve in sub-millisecond with approximate nearest neighbors.
Building a multimodal backbone: CLIP/SigLIP vision encoders are used as the "eyes" for downstream models (LLaVA, Flamingo, PaliGemma).

Limitations

No fine-grained interaction: A single global vector per modality can't represent spatial relationships, counting, or compositional reasoning ("the cat is to the left of the dog").
No generation: Two-tower models produce embeddings, not text or images. You can retrieve, not generate.
Bag-of-concepts bias: CLIP often behaves like a bag of words — "a dog biting a man" and "a man biting a dog" get similar scores.

Key Papers & Models

CLIP: Learning Transferable Visual Models From Natural Language Supervision

Radford et al., OpenAI 2021 · 400M image-text pairs · ViT-L/14 + Transformer text encoder

SigLIP: Sigmoid Loss for Language Image Pre-Training

Zhai et al., Google 2023 · Pairwise sigmoid loss · Scales to larger batches without cross-GPU sync

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Jia et al., Google 2021 · 1.8B noisy image-text pairs · EfficientNet + BERT

2 Cross-Attention Fusion (Flamingo / LLaVA)

Key idea: Keep pretrained unimodal encoders frozen (or lightly tuned), and add cross-attention layers that let one modality attend to the other. The language model "looks at" visual features through learned attention gates, enabling rich interaction without training from scratch.

How It Works

vision: [B, N_v, d_v] → [B, K, d_llm]

The vision encoder (typically CLIP/SigLIP ViT-L/14 or ViT-H/14) processes the image into a grid of N_v patch tokens — N_v = 256 for 224×224 at patch 14. The projection module then maps these into the LLM's d_llm-dim embedding space and (optionally) compresses the sequence length.

The critical design choice is how visual tokens interact with text:

Flamingo: A Perceiver Resampler first compresses [B, N_v, d_v] → [B, 64, d_llm] via learned query tokens. Gated cross-attention layers (Q = text, K/V = visual) are inserted between frozen LLM layers; gates are zero-initialized so the model starts as a pure LLM.
LLaVA: Simpler — an MLP projection maps [B, N_v, d_v] → [B, N_v, d_llm] and the projected tokens are concatenated with text tokens along the sequence axis. The LLM's existing self-attention handles the interaction. No architectural changes to the LLM.

When to Use Cross-Attention Fusion

Visual question answering & reasoning: "What color is the car in front of the house?" requires grounding and spatial understanding.
Image/video captioning & description: Generate detailed, contextual text about visual content.
Document understanding: Process charts, diagrams, screenshots with text questions.
Leveraging existing LLMs: You have a strong pretrained LLM and want to add vision without retraining from scratch.
Limited paired data: Freeze both encoders, only train the projection layer and adapters — LLaVA-1.5 fine-tuned on 665K examples.

Flamingo vs. LLaVA: The Architecture Decision

	Flamingo-style	LLaVA-style
Fusion mechanism	Dedicated cross-attention layers	Concatenate in token space
LLM modification	Inserts new layers (gated)	None — LLM unchanged
Visual token count	Fixed (64 via Perceiver)	Variable (196+ for ViT-L/14)
Multi-image	Natural (interleaved)	Requires context management
Training cost	Higher (new cross-attn params)	Lower (just projection layer)
Inference cost	Lower (fewer visual tokens)	Higher (many visual tokens in context)

Key Papers & Models

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac et al., DeepMind 2022 · Perceiver resampler + gated cross-attention · 80B params

LLaVA: Visual Instruction Tuning

Liu et al., 2023 · CLIP ViT-L + Vicuna · MLP projection · 665K instruction pairs

PaliGemma: A versatile 3B VLM for transfer

Google 2024 · SigLIP-So400m + Gemma 2B · Linear projection · Strong at fine-grained tasks

Qwen-VL: A Versatile Vision-Language Model

Alibaba 2023 · Cross-attention compressor (256 queries) · Grounding & multi-image support

3 Unified Embedding (Gemini / GPT-4V)

Key idea: All modalities are tokenized into a single sequence and processed by one transformer. No separate encoders at inference — vision tokens, text tokens, and (optionally) audio/action tokens are all first-class citizens in the same attention mechanism. The model is trained end-to-end on interleaved multimodal data.

How It Works

all modalities → [B, L, d_model]

Each modality has a tokenizer that converts raw input into tokens, all of which land in a shared d_model-dim embedding space (typical 2048–8192). The concatenated sequence [B, L, d_model] — where L mixes text, image, and audio tokens — is processed by a standard transformer with full self-attention. Every token attends to every other, regardless of modality.

Tokenization strategies vary:

Gemini: SigLIP-derived visual encoder produces ~256 tokens per image; these are interleaved with SentencePiece text tokens. Audio uses USM-derived features — ~6.25 tokens/s. All project to the same d_model.
Chameleon (Meta): Discretizes images into 256×256 → 1024 VQ-VAE tokens that share the same vocabulary as text. True token-level unification — image and text tokens use the same softmax head over a combined vocab of ~65K.
Gemini Robotics: Extends to action tokens for robot control — continuous joint commands are discretized into a small vocab per joint and generated autoregressively.

When to Use Unified Embedding

Any-to-any generation: Tasks that require generating in multiple modalities (text → image, image → text, interleaved).
Complex multi-modal reasoning: Math problems with diagrams, multi-step visual reasoning, science questions with charts.
Embodied AI / robotics: Vision + language understanding + action generation in a single model (Gemini Robotics, RT-2).
You have massive compute and data: These models require training at unprecedented scale. Gemini Ultra used thousands of TPUs.

Limitations

Compute cost: Full self-attention over all modality tokens is O(n²) where n includes visual tokens (often 256–4096+). Much more expensive than two-tower.
Training complexity: Balancing loss across modalities is hard. Image generation and text generation compete for capacity. Modality-specific data ratios matter enormously.
Not open-source: The most capable unified models (Gemini, GPT-4V) are proprietary. Open alternatives (Chameleon) exist but lag in capability.

Key Papers & Models

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind 2023 · Natively multimodal · Text, image, audio, video, code

GPT-4V(ision) System Card

OpenAI 2023 · Unified multimodal reasoning · Image understanding + text generation

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meta 2024 · VQ-VAE image tokenization · Shared vocabulary for image & text generation

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Google DeepMind 2023 · PaLM-E/PaLI backbone · Actions as text tokens

Paradigm Comparison

Dimension	Two-Tower (CLIP)	Cross-Attention (LLaVA)	Unified (Gemini)
Interaction depth	None — dot product only	Medium — cross-attn layers	Deep — full self-attention
Training data needed	400M+ pairs	600K–2M (fine-tune)	Billions (end-to-end)
Retrieval speed	Sub-ms (precomputed)	Not applicable	Not applicable
Generation quality	No generation	Good text generation	Best — any-to-any
Spatial reasoning	Weak	Good	Best
Compositionality	Weak (bag of concepts)	Good	Best
Training cost	Moderate	Low (frozen backbone)	Extreme
Open models	CLIP, SigLIP, OpenCLIP	LLaVA, PaliGemma, Qwen-VL	Chameleon (limited)
Best for	Retrieval, zero-shot, backbones	VQA, captioning, document AI	General-purpose reasoning

Decision Framework

Start with the task, not the architecture. The right multimodal approach follows from what your system needs to do and what data you have.

Choose Two-Tower (CLIP/SigLIP) if...

You need fast retrieval across millions of items
Zero-shot classification without training is sufficient
You need a vision backbone for a downstream model
You have hundreds of millions of image-text pairs
Latency constraints are strict (<10ms per query)

Choose Cross-Attention Fusion (LLaVA/Flamingo) if...

You need to generate text conditioned on images (VQA, captioning)
You want to add vision to an existing LLM without retraining it
You have limited paired data (<1M examples)
You need a good balance of quality and training cost
Fine-grained spatial understanding matters

Choose Unified Embedding (Gemini) if...

You need any-to-any generation (text, image, action)
Complex multi-step reasoning across modalities is required
You're building an embodied agent (robot, game AI)
You have massive compute budget and diverse multimodal data
Maximum capability matters more than efficiency

Use Cases in Practice

E-Commerce Product Search

Two-Tower

User types "blue running shoes" or uploads a photo. CLIP/SigLIP encodes the query and catalog images independently. Retrieve top-K by cosine similarity in <5ms over millions of products. Pre-compute all product embeddings offline. This is the approach Nike and similar retailers use.

Medical Image Report Generation

Cross-Attention

Radiologist uploads a chest X-ray. A frozen CheXNet or BiomedCLIP encodes the image. A medical LLM generates structured findings via cross-attention to the visual features. Training data: ~200K image-report pairs from hospital archives.

Robot Manipulation from Language Instructions

Unified

"Pick up the red block and place it on the blue plate." Camera feeds + language instruction are tokenized into a single sequence. The unified model reasons about spatial relationships and outputs action tokens for the robot arm. This is the RT-2 / Gemini Robotics / π₀ approach.

The Trajectory

The field is converging toward unified models. CLIP (2021) proved contrastive pretraining works. Flamingo (2022) showed you can graft vision onto LLMs. LLaVA (2023) made it accessible. Gemini (2023) and GPT-4V showed end-to-end multimodal training at scale produces the best results. The open-source ecosystem is following: LLaVA → LLaVA-NeXT → models with video, audio, and action support.

But two-tower models aren't going away. They remain the backbone: SigLIP powers PaliGemma, CLIP powers LLaVA, DINOv2 powers many vision pipelines. The practical pattern is: contrastive pretraining → cross-attention fine-tuning → (optionally) unified end-to-end training. Each paradigm builds on the last.