Encoder-Only Transformers
Encoder-only transformers see the entire input at once — every token attends to every other token bidirectionally. This makes them the natural choice when you need to understand text rather than generate it. Classification, named entity recognition, semantic similarity, retrieval, and embeddings are their home turf.
When to Use Encoder-Only
Text Classification
Sentiment analysis, spam detection, topic labeling, intent classification. The [CLS] token pools the full sequence into a single vector; a linear head maps it to class probabilities. Fine-tuning BERT on 1K–10K labeled examples often beats prompting a 70B LLM.
Named Entity Recognition & Token-Level Tasks
NER, POS tagging, slot filling. Each output token gets its own classification head. Bidirectional context is critical — recognizing "Apple" as ORG vs. FOOD depends on what comes after it, not just before.
Semantic Similarity & Sentence Pairs
Natural language inference (NLI), semantic textual similarity (STS), duplicate detection, paraphrase identification. Encode two sentences with [SEP] separator, or use Sentence-BERT's siamese architecture for fast comparison.
Retrieval & Ranking
Encode queries and documents into dense vectors. Retrieve via approximate nearest neighbors. Cross-encoder (encode query+doc together) for reranking. Bi-encoder (encode separately) for first-stage retrieval at scale.
Feature Extraction & Embeddings
Use the encoder as a feature extractor for downstream ML pipelines. Pool the final hidden states into fixed-size vectors. Feed into XGBoost, logistic regression, or clustering. Often the fastest path to a production system.
What to Look for in Your Data
Labeled vs. Unlabeled Data
- Unlabeled text: Pretrain (or use an existing pretrained model). BERT was pretrained on 3.3B words from BooksCorpus + English Wikipedia.
- Labeled data for fine-tuning: Encoder models are remarkably sample-efficient. BERT fine-tunes effectively on as few as 1K–10K labeled examples. Compare this to decoder-only models that often need 50K+ for instruction tuning.
- Rule of thumb: If you have <50K labeled examples and the task is classification or extraction, encoder-only will almost certainly outperform a prompted LLM at lower cost.
Task Type: Sequence-Level vs. Token-Level
- Sequence-level (classification, sentiment, NLI): Use the [CLS] token representation. One label per input.
- Token-level (NER, POS tagging, slot filling): Use per-token representations. One label per token. Requires token-aligned labels — be careful with subword tokenization splitting entities.
- Span extraction (question answering, extractive summarization): Predict start and end positions in the input. SQuAD-style — the answer must be a contiguous span.
Single Input vs. Sentence Pairs
- Single sentence: Classification, NER. Format: [CLS] tokens [SEP]
- Sentence pairs: NLI, STS, duplicate detection, reranking. Format: [CLS] sent_A [SEP] sent_B [SEP]. Segment embeddings distinguish which sentence each token belongs to.
Domain Specificity
If your text is outside the general web domain, consider domain-adapted models:
- SciBERT — Scientific papers (Semantic Scholar corpus)
- BioBERT / PubMedBERT — Biomedical literature
- FinBERT — Financial text (SEC filings, earnings calls)
- LegalBERT — Legal documents and contracts
- CodeBERT — Source code and documentation
Domain pretraining typically adds 2–5% accuracy on in-domain tasks compared to general BERT.
Sequence Length
- BERT/RoBERTa: 512 tokens max. Fine for most sentences and short documents.
- Longformer / BigBird: 4,096 tokens with sparse attention. For long documents.
- ModernBERT: 8,192 tokens with Flash Attention. The modern default.
- If your inputs regularly exceed 512 tokens, use a long-context model or truncation strategy (head-only, tail-only, or head+tail).
Architecture
Pretraining Objectives
Encoder-only models learn representations through self-supervised objectives on unlabeled text. The choice of pretraining objective has a massive impact on downstream performance.
1 Masked Language Modeling (MLM)
Randomly mask 15% of input tokens. The model predicts the original token at each masked position using bidirectional context. This forces the encoder to build rich contextual representations.
The 15% masking budget uses: 80% [MASK] token, 10% random token, 10% original token — to prevent the model from learning that [MASK] is always the prediction target.
2 Next Sentence Prediction (NSP)
Given two sentences, predict whether sentence B actually follows sentence A in the original text. 50% real pairs, 50% random pairs. BERT used this to learn sentence-level relationships.
The verdict: RoBERTa showed that removing NSP and training with longer sequences and more data improves performance. NSP is now considered unnecessary — MLM alone is sufficient when trained properly.
3 Replaced Token Detection (RTD)
Instead of masking, a small generator network replaces tokens with plausible alternatives. The encoder (discriminator) must detect which tokens were replaced. Every token gets a training signal, not just the 15% that are masked — making ELECTRA 4× more sample-efficient than BERT.
- Generator: Small masked LM that produces replacement tokens
- Discriminator: The encoder being trained — binary classification at each position (original vs. replaced)
- At fine-tuning time, discard the generator, use only the discriminator
Key Models
Model Comparison
| Model | Pretraining | Key Innovation | Context | Params | Best For |
|---|---|---|---|---|---|
| BERT | MLM + NSP | Bidirectional pretraining | 512 | 110M/340M | Baseline, fine-tuning |
| RoBERTa | MLM only | Better training recipe | 512 | 125M/355M | General NLU tasks |
| DeBERTa | MLM | Disentangled attention | 512 | 140M/400M | Highest accuracy |
| ELECTRA | RTD | All-token training signal | 512 | 14M/335M | Low-compute training |
| ModernBERT | MLM | RoPE, Flash Attn, GeGLU | 8192 | 150M/395M | Long docs, modern stack |
When NOT to Use Encoder-Only
Text Generation
Chatbots, content creation, code generation, summarization (abstractive), translation. These require autoregressive decoding — use decoder-only (GPT, LLaMA) or encoder-decoder (T5, BART).
Long-Form Reasoning & Chain-of-Thought
Multi-step mathematical reasoning, complex instruction following, planning. Decoder-only models excel here because they can "think out loud" token by token. Encoders produce fixed-size representations that can't expand.
Sequence-to-Sequence Tasks
Translation, abstractive summarization, data-to-text. These need both an encoder (understand input) and a decoder (generate output). Use T5, BART, or mBART.
Decision Framework
Choose Encoder-Only If...
- Your task is classification, NER, extraction, similarity, or retrieval
- You have 1K–100K labeled examples (sweet spot for fine-tuning)
- Latency matters — encoder inference is fast (single forward pass, no autoregressive loop)
- You need deterministic, reproducible outputs (no sampling variability)
- Cost matters — a fine-tuned 110M BERT beats a prompted 70B LLM on many classification tasks at 1/600th the compute
- Your text is short to medium (<512 tokens, or <8K with ModernBERT)
- You need embeddings for downstream ML pipelines or vector databases
Use Cases in Practice
Sentiment Analysis at Scale
E-commerce reviews, social media monitoring, customer feedback. Fine-tune BERT or RoBERTa on 5K–10K labeled reviews. Single forward pass per review — process millions per hour on a single GPU. Accuracy typically 92–95% on binary sentiment, competitive with GPT-4 at a fraction of the cost.
Named Entity Recognition in Medical Records
Extract drug names, dosages, conditions, procedures from clinical notes. Fine-tune BioBERT or PubMedBERT with BIO tagging. Token-level classification with bidirectional context is critical — "left" means different things in "patient left the hospital" vs. "pain in left knee."
Semantic Search with Sentence-BERT
Encode your document corpus into vectors with Sentence-BERT (or the newer E5/GTE models). Store in a vector database (Pinecone, Weaviate, pgvector). At query time, encode the query and retrieve top-K by cosine similarity. Sub-10ms latency over millions of documents. Powers RAG pipelines.
Document Classification for Compliance
Classify legal documents, SEC filings, insurance claims into categories. Fine-tune FinBERT or LegalBERT on domain-specific labels. For documents exceeding 512 tokens, use ModernBERT (8K context) or Longformer. Deterministic outputs are critical for audit trails — no sampling variance.
The Trajectory
The practical reality: most production NLP systems use encoder-only models for understanding tasks and decoder-only models for generation. A fine-tuned DeBERTa for classification + a vector database with E5 embeddings for retrieval + an LLM for generation covers most enterprise use cases.