Encoder-Only Transformers

BERT, DeBERTa & Bidirectional Understanding

Encoder-only transformers see the entire input at once — every token attends to every other token bidirectionally. This makes them the natural choice when you need to understand text rather than generate it. Classification, named entity recognition, semantic similarity, retrieval, and embeddings are their home turf.

When to Use Encoder-Only

The core insight: Bidirectional attention means each token sees the full context — both left and right. This is ideal when the task is understanding a complete input, not generating text token by token. If your output is a label, a span, or an embedding — not a sentence — start here.

Text Classification

Sentiment analysis, spam detection, topic labeling, intent classification. The [CLS] token pools the full sequence into a single vector; a linear head maps it to class probabilities. Fine-tuning BERT on 1K–10K labeled examples often beats prompting a 70B LLM.

Named Entity Recognition & Token-Level Tasks

NER, POS tagging, slot filling. Each output token gets its own classification head. Bidirectional context is critical — recognizing "Apple" as ORG vs. FOOD depends on what comes after it, not just before.

Semantic Similarity & Sentence Pairs

Natural language inference (NLI), semantic textual similarity (STS), duplicate detection, paraphrase identification. Encode two sentences with [SEP] separator, or use Sentence-BERT's siamese architecture for fast comparison.

Retrieval & Ranking

Encode queries and documents into dense vectors. Retrieve via approximate nearest neighbors. Cross-encoder (encode query+doc together) for reranking. Bi-encoder (encode separately) for first-stage retrieval at scale.

Feature Extraction & Embeddings

Use the encoder as a feature extractor for downstream ML pipelines. Pool the final hidden states into fixed-size vectors. Feed into XGBoost, logistic regression, or clustering. Often the fastest path to a production system.

What to Look for in Your Data

Labeled vs. Unlabeled Data

Unlabeled text: Pretrain (or use an existing pretrained model). BERT was pretrained on 3.3B words from BooksCorpus + English Wikipedia.
Labeled data for fine-tuning: Encoder models are remarkably sample-efficient. BERT fine-tunes effectively on as few as 1K–10K labeled examples. Compare this to decoder-only models that often need 50K+ for instruction tuning.
Rule of thumb: If you have <50K labeled examples and the task is classification or extraction, encoder-only will almost certainly outperform a prompted LLM at lower cost.

Task Type: Sequence-Level vs. Token-Level

Sequence-level (classification, sentiment, NLI): Use the [CLS] token representation. One label per input.
Token-level (NER, POS tagging, slot filling): Use per-token representations. One label per token. Requires token-aligned labels — be careful with subword tokenization splitting entities.
Span extraction (question answering, extractive summarization): Predict start and end positions in the input. SQuAD-style — the answer must be a contiguous span.

Single Input vs. Sentence Pairs

Single sentence: Classification, NER. Format: [CLS] tokens [SEP]
Sentence pairs: NLI, STS, duplicate detection, reranking. Format: [CLS] sent_A [SEP] sent_B [SEP]. Segment embeddings distinguish which sentence each token belongs to.

Domain Specificity

If your text is outside the general web domain, consider domain-adapted models:

SciBERT — Scientific papers (Semantic Scholar corpus)
BioBERT / PubMedBERT — Biomedical literature
FinBERT — Financial text (SEC filings, earnings calls)
LegalBERT — Legal documents and contracts
CodeBERT — Source code and documentation

Domain pretraining typically adds 2–5% accuracy on in-domain tasks compared to general BERT.

Sequence Length

BERT/RoBERTa: 512 tokens max. Fine for most sentences and short documents.
Longformer / BigBird: 4,096 tokens with sparse attention. For long documents.
ModernBERT: 8,192 tokens with Flash Attention. The modern default.
If your inputs regularly exceed 512 tokens, use a long-context model or truncation strategy (head-only, tail-only, or head+tail).

Architecture

Key distinction from decoders: No causal mask. Every token attends to every other token. The attention matrix is fully dense, not triangular. This is what gives encoders their understanding power — and why they can't generate text autoregressively.

Pretraining Objectives

Encoder-only models learn representations through self-supervised objectives on unlabeled text. The choice of pretraining objective has a massive impact on downstream performance.

1 Masked Language Modeling (MLM)

BERT, RoBERTa, DeBERTa

Randomly mask 15% of input tokens. The model predicts the original token at each masked position using bidirectional context. This forces the encoder to build rich contextual representations.

The 15% masking budget uses: 80% [MASK] token, 10% random token, 10% original token — to prevent the model from learning that [MASK] is always the prediction target.

2 Next Sentence Prediction (NSP)

BERT only

Given two sentences, predict whether sentence B actually follows sentence A in the original text. 50% real pairs, 50% random pairs. BERT used this to learn sentence-level relationships.

The verdict: RoBERTa showed that removing NSP and training with longer sequences and more data improves performance. NSP is now considered unnecessary — MLM alone is sufficient when trained properly.

3 Replaced Token Detection (RTD)

ELECTRA

Instead of masking, a small generator network replaces tokens with plausible alternatives. The encoder (discriminator) must detect which tokens were replaced. Every token gets a training signal, not just the 15% that are masked — making ELECTRA 4× more sample-efficient than BERT.

Generator: Small masked LM that produces replacement tokens
Discriminator: The encoder being trained — binary classification at each position (original vs. replaced)
At fine-tuning time, discard the generator, use only the discriminator

Key Models

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin et al., Google 2018 · MLM + NSP · 110M/340M params · 512 token context · The paper that started it all

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu et al., Meta 2019 · No NSP, dynamic masking, larger batches, 160GB text · Same architecture, better training recipe

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

He et al., Microsoft 2020 · Disentangled attention (separate content & position vectors) · Enhanced mask decoder · Surpassed human on SuperGLUE

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Clark et al., Google/Stanford 2020 · Replaced token detection · 4× more sample-efficient · ELECTRA-Small matches GPT-Large

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers & Gurevych, 2019 · Siamese/triplet architecture · Cosine similarity for sentence comparison · 5000× faster than cross-encoder

ModernBERT: Smarter, Better, Faster, Longer

Warner et al., 2024 · RoPE + Flash Attention + GeGLU + 8192 context · Alternating local/global attention · Modern BERT for the 2020s

Model Comparison

Model	Pretraining	Key Innovation	Context	Params	Best For
BERT	MLM + NSP	Bidirectional pretraining	512	110M/340M	Baseline, fine-tuning
RoBERTa	MLM only	Better training recipe	512	125M/355M	General NLU tasks
DeBERTa	MLM	Disentangled attention	512	140M/400M	Highest accuracy
ELECTRA	RTD	All-token training signal	512	14M/335M	Low-compute training
ModernBERT	MLM	RoPE, Flash Attn, GeGLU	8192	150M/395M	Long docs, modern stack

When NOT to Use Encoder-Only

Encoder-only models cannot generate text. They produce representations, not sequences. If your task requires producing new text, you need a decoder.

Text Generation

Chatbots, content creation, code generation, summarization (abstractive), translation. These require autoregressive decoding — use decoder-only (GPT, LLaMA) or encoder-decoder (T5, BART).

Long-Form Reasoning & Chain-of-Thought

Multi-step mathematical reasoning, complex instruction following, planning. Decoder-only models excel here because they can "think out loud" token by token. Encoders produce fixed-size representations that can't expand.

Sequence-to-Sequence Tasks

Translation, abstractive summarization, data-to-text. These need both an encoder (understand input) and a decoder (generate output). Use T5, BART, or mBART.

Decision Framework

The practical test: If your output is a label (classification), a span (extraction), or a vector (embedding) — use encoder-only. If your output is text — use a decoder.

Choose Encoder-Only If...

Your task is classification, NER, extraction, similarity, or retrieval
You have 1K–100K labeled examples (sweet spot for fine-tuning)
Latency matters — encoder inference is fast (single forward pass, no autoregressive loop)
You need deterministic, reproducible outputs (no sampling variability)
Cost matters — a fine-tuned 110M BERT beats a prompted 70B LLM on many classification tasks at 1/600th the compute
Your text is short to medium (<512 tokens, or <8K with ModernBERT)
You need embeddings for downstream ML pipelines or vector databases

Use Cases in Practice

Sentiment Analysis at Scale

Classification

E-commerce reviews, social media monitoring, customer feedback. Fine-tune BERT or RoBERTa on 5K–10K labeled reviews. Single forward pass per review — process millions per hour on a single GPU. Accuracy typically 92–95% on binary sentiment, competitive with GPT-4 at a fraction of the cost.

Named Entity Recognition in Medical Records

Token Classification

Extract drug names, dosages, conditions, procedures from clinical notes. Fine-tune BioBERT or PubMedBERT with BIO tagging. Token-level classification with bidirectional context is critical — "left" means different things in "patient left the hospital" vs. "pain in left knee."

Semantic Search with Sentence-BERT

Retrieval

Encode your document corpus into vectors with Sentence-BERT (or the newer E5/GTE models). Store in a vector database (Pinecone, Weaviate, pgvector). At query time, encode the query and retrieve top-K by cosine similarity. Sub-10ms latency over millions of documents. Powers RAG pipelines.

Document Classification for Compliance

Classification

Classify legal documents, SEC filings, insurance claims into categories. Fine-tune FinBERT or LegalBERT on domain-specific labels. For documents exceeding 512 tokens, use ModernBERT (8K context) or Longformer. Deterministic outputs are critical for audit trails — no sampling variance.

The Trajectory

Encoder-only isn't dead — it's specialized. The LLM wave shifted attention to decoders, but encoders remain the right tool for classification, retrieval, and embedding tasks. ModernBERT (2024) showed there's still room for improvement with modern techniques (RoPE, Flash Attention, longer context). Meanwhile, the embedding model space (E5, GTE, Nomic) continues to advance encoder architectures for retrieval.

The practical reality: most production NLP systems use encoder-only models for understanding tasks and decoder-only models for generation. A fine-tuned DeBERTa for classification + a vector database with E5 embeddings for retrieval + an LLM for generation covers most enterprise use cases.