Multimodal Transformers

Cross-Attention, Unified Embeddings & Contrastive Learning

Multimodal transformers fuse information across modalities — vision, language, audio, actions — into shared representations. This walkthrough covers the three dominant architectural paradigms: contrastive two-tower (CLIP), cross-attention fusion (Flamingo, LLaVA), and unified embedding (Gemini, GPT-4V) — when to use each, what to look for in your data, and how they compare.

When Do You Need a Multimodal Transformer?

The core question: Does your task require joint reasoning across modalities — not just processing each input independently? If yes, you need multimodal. If each modality can be handled in isolation and combined by simple rules, you probably don't.

Problem Signals

Reach for multimodal when your task exhibits one or more of these patterns:

Cross-Modal Grounding

The answer depends on connecting a concept in one modality to a region/span in another. "Where is the dog?" requires grounding language to image pixels. Visual question answering, referring expression comprehension, image captioning.

Multi-Modal Generation

The output modality differs from the input, or you generate in multiple modalities jointly. Text-to-image (Stable Diffusion/DALL-E), image-conditioned text generation, vision-language-action models for robotics.

Retrieval Across Modalities

Given a query in modality A (text), find matches in modality B (images/video/audio). Zero-shot image search, video retrieval from natural language queries, audio-visual correspondence.

Complementary Evidence Fusion

Each modality provides partial evidence that must be integrated — like a medical AI combining radiology images with clinical notes, or a robot combining camera feeds with language instructions.

What to Look for in Your Data

Paired vs. Unpaired Data

  • Paired data (image-caption pairs, video-transcript): Enables contrastive learning (CLIP) and supervised fusion. The quality of pairing matters enormously — noisy web-scraped alt-text vs. human-written captions produce very different models.
  • Unpaired data (images without captions, text without images): Can still be used for pretraining individual encoders. Multimodal alignment then requires a bridge — typically a projection layer or adapter trained on a smaller paired set.

Data Scale & Modality Balance

  • Contrastive models (CLIP, SigLIP) need hundreds of millions of image-text pairs. CLIP trained on 400M, SigLIP on billions.
  • Fusion models (LLaVA, Flamingo) can work with much less paired data because they leverage frozen pretrained backbones — LLaVA-1.5 used only 665K instruction pairs for fine-tuning.
  • Modality imbalance: If you have 10× more text than images, a two-tower contrastive approach won't utilize the extra text. A fusion model with a frozen LLM backbone can leverage it.

Alignment Granularity

  • Global alignment: "This image shows a dog" — whole-image to whole-sentence. Good for retrieval. CLIP excels here.
  • Region-level: "The red car on the left" — requires spatial grounding. Needs cross-attention or region features.
  • Token-level: Pixel-by-pixel or frame-by-frame correspondence. Needs dense fusion (unified embedding or heavy cross-attention).

Latency & Deployment Constraints

  • Two-tower: Encode each modality once, compare with dot product. Sub-millisecond retrieval over millions of items. Best for serving at scale.
  • Cross-attention fusion: Requires running both modalities through shared layers. More compute per query, but richer understanding.
  • Unified embedding: Most expensive at inference, but most capable. Appropriate when quality matters more than latency.

1 Contrastive Two-Tower (CLIP / SigLIP)

Key idea: Train two separate encoders (one per modality) to embed matching pairs close together and non-matching pairs far apart in a shared vector space. At inference, compare any image to any text via cosine similarity — no fusion needed.
Image Encoder (ViT-L/14) Text Encoder (Transformer) 224×224 image "a photo of..." v ∈ Rᵈ t ∈ Rᵈ cos(v, t) → similarity InfoNCE / Sigmoid Loss Two-Tower Contrastive Architecture

How It Works

image: [B, 3, 224, 224][B, d] · text: [B, L][B, d]

Both towers project to a shared embedding dimension d (typically 512 or 768). For CLIP ViT-L/14: image encoder produces a [B, 257, 1024] patch sequence (1 [CLS] + 16×16 patches), the [CLS] token is projected to [B, 768]; the text encoder produces [B, L, 512] and the [EOS] token is projected to [B, 768]. Both are l2-normalized.

Training: the similarity matrix S = v t ∈ [B, B] has diagonal entries for true pairs and off-diagonal for false pairs. InfoNCE treats each row as a B-way classification (and symmetrically for columns).

SigLIP improvement: Replaces the softmax-based InfoNCE with a pairwise sigmoid loss. Each (i, j) entry of S is independently classified as matching or not, removing the need for synchronized batch statistics across GPUs. This allows larger effective batch sizes and better scaling.

When to Use Two-Tower

  • Zero-shot classification: Compare an image against many class-name prompts. No training needed for new categories.
  • Cross-modal retrieval: Pre-compute embeddings offline, retrieve in sub-millisecond with approximate nearest neighbors.
  • Building a multimodal backbone: CLIP/SigLIP vision encoders are used as the "eyes" for downstream models (LLaVA, Flamingo, PaliGemma).

Limitations

  • No fine-grained interaction: A single global vector per modality can't represent spatial relationships, counting, or compositional reasoning ("the cat is to the left of the dog").
  • No generation: Two-tower models produce embeddings, not text or images. You can retrieve, not generate.
  • Bag-of-concepts bias: CLIP often behaves like a bag of words — "a dog biting a man" and "a man biting a dog" get similar scores.

Key Papers & Models

CLIP: Learning Transferable Visual Models From Natural Language Supervision
Radford et al., OpenAI 2021 · 400M image-text pairs · ViT-L/14 + Transformer text encoder
SigLIP: Sigmoid Loss for Language Image Pre-Training
Zhai et al., Google 2023 · Pairwise sigmoid loss · Scales to larger batches without cross-GPU sync
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Jia et al., Google 2021 · 1.8B noisy image-text pairs · EfficientNet + BERT

2 Cross-Attention Fusion (Flamingo / LLaVA)

Key idea: Keep pretrained unimodal encoders frozen (or lightly tuned), and add cross-attention layers that let one modality attend to the other. The language model "looks at" visual features through learned attention gates, enabling rich interaction without training from scratch.
Cross-Attention Fusion Architecture Vision Encoder (ViT, frozen) Visual tokens Projection / Perceiver compress to K tokens Language Model (LLaMA, Vicuna, etc.) Self-Attention (text) Cross-Attention Q=text, K,V=visual tokens Feed-Forward (FFN) × L layers Generated text output Projection Variants Flamingo: Perceiver resampler LLaVA: Linear / MLP projection Qwen-VL: Cross-attn compressor PaliGemma: SigLIP + linear

How It Works

vision: [B, N_v, d_v][B, K, d_llm]

The vision encoder (typically CLIP/SigLIP ViT-L/14 or ViT-H/14) processes the image into a grid of Nv patch tokens — Nv = 256 for 224×224 at patch 14. The projection module then maps these into the LLM's dllm-dim embedding space and (optionally) compresses the sequence length.

The critical design choice is how visual tokens interact with text:

  • Flamingo: A Perceiver Resampler first compresses [B, N_v, d_v][B, 64, d_llm] via learned query tokens. Gated cross-attention layers (Q = text, K/V = visual) are inserted between frozen LLM layers; gates are zero-initialized so the model starts as a pure LLM.
  • LLaVA: Simpler — an MLP projection maps [B, N_v, d_v][B, N_v, d_llm] and the projected tokens are concatenated with text tokens along the sequence axis. The LLM's existing self-attention handles the interaction. No architectural changes to the LLM.

When to Use Cross-Attention Fusion

  • Visual question answering & reasoning: "What color is the car in front of the house?" requires grounding and spatial understanding.
  • Image/video captioning & description: Generate detailed, contextual text about visual content.
  • Document understanding: Process charts, diagrams, screenshots with text questions.
  • Leveraging existing LLMs: You have a strong pretrained LLM and want to add vision without retraining from scratch.
  • Limited paired data: Freeze both encoders, only train the projection layer and adapters — LLaVA-1.5 fine-tuned on 665K examples.

Flamingo vs. LLaVA: The Architecture Decision

Flamingo-styleLLaVA-style
Fusion mechanismDedicated cross-attention layersConcatenate in token space
LLM modificationInserts new layers (gated)None — LLM unchanged
Visual token countFixed (64 via Perceiver)Variable (196+ for ViT-L/14)
Multi-imageNatural (interleaved)Requires context management
Training costHigher (new cross-attn params)Lower (just projection layer)
Inference costLower (fewer visual tokens)Higher (many visual tokens in context)

Key Papers & Models

Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac et al., DeepMind 2022 · Perceiver resampler + gated cross-attention · 80B params
LLaVA: Visual Instruction Tuning
Liu et al., 2023 · CLIP ViT-L + Vicuna · MLP projection · 665K instruction pairs
PaliGemma: A versatile 3B VLM for transfer
Google 2024 · SigLIP-So400m + Gemma 2B · Linear projection · Strong at fine-grained tasks
Qwen-VL: A Versatile Vision-Language Model
Alibaba 2023 · Cross-attention compressor (256 queries) · Grounding & multi-image support

3 Unified Embedding (Gemini / GPT-4V)

Key idea: All modalities are tokenized into a single sequence and processed by one transformer. No separate encoders at inference — vision tokens, text tokens, and (optionally) audio/action tokens are all first-class citizens in the same attention mechanism. The model is trained end-to-end on interleaved multimodal data.
Unified Embedding Architecture Vision Tokenizer Text Tokenizer Audio Tokenizer Action Tokenizer img_1 img_2 ... text_1 text_2 ... aud_1 ... act_1 ... Unified Transformer Full self-attention across ALL modality tokens × N layers Image output Text output Action output Any-to-any: image→text, text→image, image+text→action Natively multimodal — no modality is privileged

How It Works

all modalities → [B, L, d_model]

Each modality has a tokenizer that converts raw input into tokens, all of which land in a shared dmodel-dim embedding space (typical 2048–8192). The concatenated sequence [B, L, d_model] — where L mixes text, image, and audio tokens — is processed by a standard transformer with full self-attention. Every token attends to every other, regardless of modality.

Tokenization strategies vary:

  • Gemini: SigLIP-derived visual encoder produces ~256 tokens per image; these are interleaved with SentencePiece text tokens. Audio uses USM-derived features — ~6.25 tokens/s. All project to the same dmodel.
  • Chameleon (Meta): Discretizes images into 256×256 → 1024 VQ-VAE tokens that share the same vocabulary as text. True token-level unification — image and text tokens use the same softmax head over a combined vocab of ~65K.
  • Gemini Robotics: Extends to action tokens for robot control — continuous joint commands are discretized into a small vocab per joint and generated autoregressively.

When to Use Unified Embedding

  • Any-to-any generation: Tasks that require generating in multiple modalities (text → image, image → text, interleaved).
  • Complex multi-modal reasoning: Math problems with diagrams, multi-step visual reasoning, science questions with charts.
  • Embodied AI / robotics: Vision + language understanding + action generation in a single model (Gemini Robotics, RT-2).
  • You have massive compute and data: These models require training at unprecedented scale. Gemini Ultra used thousands of TPUs.

Limitations

  • Compute cost: Full self-attention over all modality tokens is O(n²) where n includes visual tokens (often 256–4096+). Much more expensive than two-tower.
  • Training complexity: Balancing loss across modalities is hard. Image generation and text generation compete for capacity. Modality-specific data ratios matter enormously.
  • Not open-source: The most capable unified models (Gemini, GPT-4V) are proprietary. Open alternatives (Chameleon) exist but lag in capability.

Key Papers & Models

Gemini: A Family of Highly Capable Multimodal Models
Google DeepMind 2023 · Natively multimodal · Text, image, audio, video, code
GPT-4V(ision) System Card
OpenAI 2023 · Unified multimodal reasoning · Image understanding + text generation
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Meta 2024 · VQ-VAE image tokenization · Shared vocabulary for image & text generation
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Google DeepMind 2023 · PaLM-E/PaLI backbone · Actions as text tokens

Paradigm Comparison

Dimension Two-Tower (CLIP) Cross-Attention (LLaVA) Unified (Gemini)
Interaction depth None — dot product only Medium — cross-attn layers Deep — full self-attention
Training data needed 400M+ pairs 600K–2M (fine-tune) Billions (end-to-end)
Retrieval speed Sub-ms (precomputed) Not applicable Not applicable
Generation quality No generation Good text generation Best — any-to-any
Spatial reasoning Weak Good Best
Compositionality Weak (bag of concepts) Good Best
Training cost Moderate Low (frozen backbone) Extreme
Open models CLIP, SigLIP, OpenCLIP LLaVA, PaliGemma, Qwen-VL Chameleon (limited)
Best for Retrieval, zero-shot, backbones VQA, captioning, document AI General-purpose reasoning

Decision Framework

Start with the task, not the architecture. The right multimodal approach follows from what your system needs to do and what data you have.

Choose Two-Tower (CLIP/SigLIP) if...

  • You need fast retrieval across millions of items
  • Zero-shot classification without training is sufficient
  • You need a vision backbone for a downstream model
  • You have hundreds of millions of image-text pairs
  • Latency constraints are strict (<10ms per query)

Choose Cross-Attention Fusion (LLaVA/Flamingo) if...

  • You need to generate text conditioned on images (VQA, captioning)
  • You want to add vision to an existing LLM without retraining it
  • You have limited paired data (<1M examples)
  • You need a good balance of quality and training cost
  • Fine-grained spatial understanding matters

Choose Unified Embedding (Gemini) if...

  • You need any-to-any generation (text, image, action)
  • Complex multi-step reasoning across modalities is required
  • You're building an embodied agent (robot, game AI)
  • You have massive compute budget and diverse multimodal data
  • Maximum capability matters more than efficiency

Use Cases in Practice

E-Commerce Product Search

Two-Tower

User types "blue running shoes" or uploads a photo. CLIP/SigLIP encodes the query and catalog images independently. Retrieve top-K by cosine similarity in <5ms over millions of products. Pre-compute all product embeddings offline. This is the approach Nike and similar retailers use.

Medical Image Report Generation

Cross-Attention

Radiologist uploads a chest X-ray. A frozen CheXNet or BiomedCLIP encodes the image. A medical LLM generates structured findings via cross-attention to the visual features. Training data: ~200K image-report pairs from hospital archives.

Robot Manipulation from Language Instructions

Unified

"Pick up the red block and place it on the blue plate." Camera feeds + language instruction are tokenized into a single sequence. The unified model reasons about spatial relationships and outputs action tokens for the robot arm. This is the RT-2 / Gemini Robotics / π₀ approach.

The Trajectory

The field is converging toward unified models. CLIP (2021) proved contrastive pretraining works. Flamingo (2022) showed you can graft vision onto LLMs. LLaVA (2023) made it accessible. Gemini (2023) and GPT-4V showed end-to-end multimodal training at scale produces the best results. The open-source ecosystem is following: LLaVA → LLaVA-NeXT → models with video, audio, and action support.

But two-tower models aren't going away. They remain the backbone: SigLIP powers PaliGemma, CLIP powers LLaVA, DINOv2 powers many vision pipelines. The practical pattern is: contrastive pretraining → cross-attention fine-tuning → (optionally) unified end-to-end training. Each paradigm builds on the last.