Encoder-Only Transformers

BERT, DeBERTa & Bidirectional Understanding

Encoder-only transformers see the entire input at once — every token attends to every other token bidirectionally. This makes them the natural choice when you need to understand text rather than generate it. Classification, named entity recognition, semantic similarity, retrieval, and embeddings are their home turf.

When to Use Encoder-Only

The core insight: Bidirectional attention means each token sees the full context — both left and right. This is ideal when the task is understanding a complete input, not generating text token by token. If your output is a label, a span, or an embedding — not a sentence — start here.

Text Classification

Sentiment analysis, spam detection, topic labeling, intent classification. The [CLS] token pools the full sequence into a single vector; a linear head maps it to class probabilities. Fine-tuning BERT on 1K–10K labeled examples often beats prompting a 70B LLM.

Named Entity Recognition & Token-Level Tasks

NER, POS tagging, slot filling. Each output token gets its own classification head. Bidirectional context is critical — recognizing "Apple" as ORG vs. FOOD depends on what comes after it, not just before.

Semantic Similarity & Sentence Pairs

Natural language inference (NLI), semantic textual similarity (STS), duplicate detection, paraphrase identification. Encode two sentences with [SEP] separator, or use Sentence-BERT's siamese architecture for fast comparison.

Retrieval & Ranking

Encode queries and documents into dense vectors. Retrieve via approximate nearest neighbors. Cross-encoder (encode query+doc together) for reranking. Bi-encoder (encode separately) for first-stage retrieval at scale.

Feature Extraction & Embeddings

Use the encoder as a feature extractor for downstream ML pipelines. Pool the final hidden states into fixed-size vectors. Feed into XGBoost, logistic regression, or clustering. Often the fastest path to a production system.

What to Look for in Your Data

Labeled vs. Unlabeled Data

  • Unlabeled text: Pretrain (or use an existing pretrained model). BERT was pretrained on 3.3B words from BooksCorpus + English Wikipedia.
  • Labeled data for fine-tuning: Encoder models are remarkably sample-efficient. BERT fine-tunes effectively on as few as 1K–10K labeled examples. Compare this to decoder-only models that often need 50K+ for instruction tuning.
  • Rule of thumb: If you have <50K labeled examples and the task is classification or extraction, encoder-only will almost certainly outperform a prompted LLM at lower cost.

Task Type: Sequence-Level vs. Token-Level

  • Sequence-level (classification, sentiment, NLI): Use the [CLS] token representation. One label per input.
  • Token-level (NER, POS tagging, slot filling): Use per-token representations. One label per token. Requires token-aligned labels — be careful with subword tokenization splitting entities.
  • Span extraction (question answering, extractive summarization): Predict start and end positions in the input. SQuAD-style — the answer must be a contiguous span.

Single Input vs. Sentence Pairs

  • Single sentence: Classification, NER. Format: [CLS] tokens [SEP]
  • Sentence pairs: NLI, STS, duplicate detection, reranking. Format: [CLS] sent_A [SEP] sent_B [SEP]. Segment embeddings distinguish which sentence each token belongs to.

Domain Specificity

If your text is outside the general web domain, consider domain-adapted models:

  • SciBERT — Scientific papers (Semantic Scholar corpus)
  • BioBERT / PubMedBERT — Biomedical literature
  • FinBERT — Financial text (SEC filings, earnings calls)
  • LegalBERT — Legal documents and contracts
  • CodeBERT — Source code and documentation

Domain pretraining typically adds 2–5% accuracy on in-domain tasks compared to general BERT.

Sequence Length

  • BERT/RoBERTa: 512 tokens max. Fine for most sentences and short documents.
  • Longformer / BigBird: 4,096 tokens with sparse attention. For long documents.
  • ModernBERT: 8,192 tokens with Flash Attention. The modern default.
  • If your inputs regularly exceed 512 tokens, use a long-context model or truncation strategy (head-only, tail-only, or head+tail).

Architecture

Key distinction from decoders: No causal mask. Every token attends to every other token. The attention matrix is fully dense, not triangular. This is what gives encoders their understanding power — and why they can't generate text autoregressively.
Encoder-Only Architecture (BERT) [CLS] The cat sat on the [MASK] [SEP] Embedding Layer Token Embedding + Segment Embedding + Position Embedding (summed) Bidirectional Multi-Head Self-Attention Every token attends to every other token Fully bidirectional (no causal mask) Feed-Forward Network (GELU activation) + LayerNorm + Residual Connection × 12 layers [CLS] Pooled Output Sequence classification head Per-Token Outputs NER / token classification head Positive / Negative / Neutral O B-PER I-PER BERT-Base: 12 layers, 768 hidden, 12 heads, 110M params BERT-Large: 24 layers, 1024 hidden, 16 heads, 340M params

Pretraining Objectives

Encoder-only models learn representations through self-supervised objectives on unlabeled text. The choice of pretraining objective has a massive impact on downstream performance.

1 Masked Language Modeling (MLM)

BERT, RoBERTa, DeBERTa

Randomly mask 15% of input tokens. The model predicts the original token at each masked position using bidirectional context. This forces the encoder to build rich contextual representations.

Input: The [MASK] sat on the [MASK] Predict: cat mat P(cat|context) · P(mat|context)

The 15% masking budget uses: 80% [MASK] token, 10% random token, 10% original token — to prevent the model from learning that [MASK] is always the prediction target.

2 Next Sentence Prediction (NSP)

BERT only

Given two sentences, predict whether sentence B actually follows sentence A in the original text. 50% real pairs, 50% random pairs. BERT used this to learn sentence-level relationships.

The verdict: RoBERTa showed that removing NSP and training with longer sequences and more data improves performance. NSP is now considered unnecessary — MLM alone is sufficient when trained properly.

3 Replaced Token Detection (RTD)

ELECTRA

Instead of masking, a small generator network replaces tokens with plausible alternatives. The encoder (discriminator) must detect which tokens were replaced. Every token gets a training signal, not just the 15% that are masked — making ELECTRA 4× more sample-efficient than BERT.

  • Generator: Small masked LM that produces replacement tokens
  • Discriminator: The encoder being trained — binary classification at each position (original vs. replaced)
  • At fine-tuning time, discard the generator, use only the discriminator

Key Models

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al., Google 2018 · MLM + NSP · 110M/340M params · 512 token context · The paper that started it all
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu et al., Meta 2019 · No NSP, dynamic masking, larger batches, 160GB text · Same architecture, better training recipe
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
He et al., Microsoft 2020 · Disentangled attention (separate content & position vectors) · Enhanced mask decoder · Surpassed human on SuperGLUE
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Clark et al., Google/Stanford 2020 · Replaced token detection · 4× more sample-efficient · ELECTRA-Small matches GPT-Large
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers & Gurevych, 2019 · Siamese/triplet architecture · Cosine similarity for sentence comparison · 5000× faster than cross-encoder
ModernBERT: Smarter, Better, Faster, Longer
Warner et al., 2024 · RoPE + Flash Attention + GeGLU + 8192 context · Alternating local/global attention · Modern BERT for the 2020s

Model Comparison

Model Pretraining Key Innovation Context Params Best For
BERT MLM + NSP Bidirectional pretraining 512 110M/340M Baseline, fine-tuning
RoBERTa MLM only Better training recipe 512 125M/355M General NLU tasks
DeBERTa MLM Disentangled attention 512 140M/400M Highest accuracy
ELECTRA RTD All-token training signal 512 14M/335M Low-compute training
ModernBERT MLM RoPE, Flash Attn, GeGLU 8192 150M/395M Long docs, modern stack

When NOT to Use Encoder-Only

Encoder-only models cannot generate text. They produce representations, not sequences. If your task requires producing new text, you need a decoder.

Text Generation

Chatbots, content creation, code generation, summarization (abstractive), translation. These require autoregressive decoding — use decoder-only (GPT, LLaMA) or encoder-decoder (T5, BART).

Long-Form Reasoning & Chain-of-Thought

Multi-step mathematical reasoning, complex instruction following, planning. Decoder-only models excel here because they can "think out loud" token by token. Encoders produce fixed-size representations that can't expand.

Sequence-to-Sequence Tasks

Translation, abstractive summarization, data-to-text. These need both an encoder (understand input) and a decoder (generate output). Use T5, BART, or mBART.

Decision Framework

The practical test: If your output is a label (classification), a span (extraction), or a vector (embedding) — use encoder-only. If your output is text — use a decoder.

Choose Encoder-Only If...

  • Your task is classification, NER, extraction, similarity, or retrieval
  • You have 1K–100K labeled examples (sweet spot for fine-tuning)
  • Latency matters — encoder inference is fast (single forward pass, no autoregressive loop)
  • You need deterministic, reproducible outputs (no sampling variability)
  • Cost matters — a fine-tuned 110M BERT beats a prompted 70B LLM on many classification tasks at 1/600th the compute
  • Your text is short to medium (<512 tokens, or <8K with ModernBERT)
  • You need embeddings for downstream ML pipelines or vector databases

Use Cases in Practice

Sentiment Analysis at Scale

Classification

E-commerce reviews, social media monitoring, customer feedback. Fine-tune BERT or RoBERTa on 5K–10K labeled reviews. Single forward pass per review — process millions per hour on a single GPU. Accuracy typically 92–95% on binary sentiment, competitive with GPT-4 at a fraction of the cost.

Named Entity Recognition in Medical Records

Token Classification

Extract drug names, dosages, conditions, procedures from clinical notes. Fine-tune BioBERT or PubMedBERT with BIO tagging. Token-level classification with bidirectional context is critical — "left" means different things in "patient left the hospital" vs. "pain in left knee."

Semantic Search with Sentence-BERT

Retrieval

Encode your document corpus into vectors with Sentence-BERT (or the newer E5/GTE models). Store in a vector database (Pinecone, Weaviate, pgvector). At query time, encode the query and retrieve top-K by cosine similarity. Sub-10ms latency over millions of documents. Powers RAG pipelines.

Document Classification for Compliance

Classification

Classify legal documents, SEC filings, insurance claims into categories. Fine-tune FinBERT or LegalBERT on domain-specific labels. For documents exceeding 512 tokens, use ModernBERT (8K context) or Longformer. Deterministic outputs are critical for audit trails — no sampling variance.

The Trajectory

Encoder-only isn't dead — it's specialized. The LLM wave shifted attention to decoders, but encoders remain the right tool for classification, retrieval, and embedding tasks. ModernBERT (2024) showed there's still room for improvement with modern techniques (RoPE, Flash Attention, longer context). Meanwhile, the embedding model space (E5, GTE, Nomic) continues to advance encoder architectures for retrieval.

The practical reality: most production NLP systems use encoder-only models for understanding tasks and decoder-only models for generation. A fine-tuned DeBERTa for classification + a vector database with E5 embeddings for retrieval + an LLM for generation covers most enterprise use cases.