The DETR Family: From Transformers to Real-Time Detection

DETR → RT-DETR → RF-DETR

Object Detection Transformer Set Prediction No NMS No Anchors

DETR: arXiv:2005.12872 | RT-DETR: arXiv:2304.08069 | RF-DETR: arXiv:2511.09554

1 — The Problem & Why DETR Matters

Object detection traditionally requires a messy pipeline: a backbone extracts features, a neck fuses scales, region proposals or anchor boxes generate candidates, and Non-Maximum Suppression (NMS) removes duplicates. Every stage has hand-designed hyperparameters.

DETR's radical idea: treat object detection as a direct set prediction problem. Use a Transformer to predict a fixed set of detections in parallel — no anchors, no NMS, no hand-designed post-processing. The model learns to output exactly one prediction per object through a bipartite matching loss.

2 — DETR (Facebook AI, 2020)

Carion et al. — arXiv:2005.12872

1 Backbone: CNN Feature Extraction

ResNet-50 Backbone

800×800×3 → 25×25×2048

A standard ResNet-50 (see our ResNet walkthrough) extracts features. The output from the last stage is a 25×25×2048 feature map (for an 800×800 input). A 1×1 convolution reduces channels to 256, giving 25×25×256.

2 Positional Encoding + Flatten

Spatial Positional Encoding

25×25×256 → 625×256

The 2D feature map is flattened to a sequence of 625 tokens (25×25). Fixed sinusoidal 2D positional encodings are added so the Transformer knows where each token sits spatially. This is analogous to positional embeddings in ViT, but using the original Transformer's sine/cosine formulation extended to 2D.

3 Transformer Encoder

6 Encoder Layers

625×256 → 625×256

Standard Transformer encoder with multi-head self-attention and feed-forward networks. Each image feature token attends to all others, building global context. After 6 layers, every spatial position has information from the entire image — crucial for reasoning about object relationships and removing duplicate detections.

4 Transformer Decoder + Object Queries

6 Decoder Layers with 100 Object Queries

100×256 (queries) + 625×256 (memory)

This is DETR's most novel component. 100 learned object queries — fixed-size set of 256-dimensional vectors — attend to the encoder output via cross-attention. Each query "specializes" in detecting one object. Through 6 decoder layers, queries compete for objects and learn to avoid predicting the same one.

Hungarian matching loss: During training, DETR uses the Hungarian algorithm to find the optimal one-to-one assignment between the 100 predictions and ground-truth objects. This bipartite matching ensures each ground-truth gets exactly one prediction — eliminating the need for NMS.

5 Prediction Heads (FFN)

Class + Box FFN per Query

100 × 256 → 100 × (class + 4 coords)

Each decoder output token independently passes through two feed-forward networks: one predicts the class label (91 COCO categories + "no object") and one predicts normalized box coordinates (center_x, center_y, width, height). The "no object" class (∅) is predicted for most queries — it means that query slot found nothing.

Hungarian Matching: How DETR Trains Without NMS

Bipartite Matching Loss

N predictions × M ground truths → 1-to-1 assignment

During training, DETR uses the Hungarian algorithm to find the optimal one-to-one matching between its 100 predictions and the ground-truth objects. The matching cost combines classification probability, L1 box distance, and GIoU. Each ground truth is assigned exactly one prediction; unmatched predictions are trained to predict ∅.

DETR Limitations

Slow convergence: DETR requires 500 epochs to train (vs. 36 for Faster R-CNN). Weak on small objects: Using only the final backbone feature map loses fine-grained detail. These limitations motivated the next generation.

Decoder Cross-Attention: Where Queries Look

Attention Visualization Across Decoder Layers

A fascinating property of DETR's decoder: in early layers, queries attend broadly across the image (exploring). In later layers, attention sharpens to focus precisely on the object boundaries. This progressive refinement is why the decoder needs multiple layers.

3 — RT-DETR (Baidu, 2023)

Zhao et al. — arXiv:2304.08069 — First real-time DETR

RT-DETR asked: can we make DETR real-time? The answer required three changes: a better backbone, an efficient hybrid encoder, and IoU-aware query selection.

HGNetv2 Backbone (or ResNet)

Multi-scale features: S3, S4, S5

RT-DETR uses a fast CNN backbone (HGNetv2) that outputs multi-scale feature maps — unlike DETR's single-scale approach. This gives access to fine-grained features for small objects (S3 at stride 8) alongside high-level features for large objects (S5 at stride 32).

Hybrid Encoder: Intra-Scale + Cross-Scale

Efficient multi-scale fusion

Instead of DETR's expensive all-to-all attention, RT-DETR uses a two-stage encoder: Attention-based Intra-Scale Modules (AISM) apply self-attention within each scale separately (much cheaper), then CNN-based Cross-Scale Fusion (CCFM) merges information across scales using efficient convolutions rather than attention.

Model	Backbone	COCO AP	FPS (T4)
RT-DETR-R50	ResNet-50	53.1%	108
RT-DETR-L	HGNetv2	53.0%	114
RT-DETR-X	HGNetv2	54.8%	74

Key achievement: RT-DETR proved that DETR-family models can be real-time. It matched YOLO accuracy while keeping DETR's elegant NMS-free design. The flexible decoder also allows speed-accuracy trade-off by adjusting the number of decoder layers at inference time.

4 — RF-DETR (Roboflow, 2025)

Robinson et al. — arXiv:2511.09554 — ICLR 2026 — First real-time detector to exceed 60 AP on COCO

RF-DETR's key insight: CNN backbones don't benefit from large-scale pre-training the way Transformers do. By swapping in a pre-trained DINOv2 ViT backbone (which learned rich visual representations from massive unsupervised data), and using Neural Architecture Search to find optimal sub-networks, RF-DETR pushes accuracy far beyond CNN-based detectors.

DINOv2 ViT Backbone (Pre-trained)

Single-scale ViT → projected multi-scale features

Unlike RT-DETR's CNN backbone that naturally produces multi-scale features, RF-DETR uses a single-scale DINOv2 Vision Transformer (see our ViT walkthrough). A projector module bilinearly interpolates the ViT output to create multi-scale feature maps. The backbone interleaves windowed attention (efficient, local) with global attention (expensive, full-image) blocks to balance accuracy and speed.

Why DINOv2? DINOv2 was pre-trained on 142M images with self-supervised learning. It learned to recognize objects, textures, and spatial relationships without any labels. RF-DETR inherits all this knowledge, so it converges faster and generalizes better than CNN backbones that must learn everything from detection data alone.

LW-DETR Decoder with Deformable Cross-Attention

Lightweight decoder from LW-DETR

The decoder follows the LW-DETR framework, using deformable cross-attention (from Deformable DETR) instead of standard cross-attention. Rather than attending to all spatial positions, each query only attends to a small set of learned sampling points — dramatically reducing computation while maintaining accuracy.

Neural Architecture Search (NAS)

Discover optimal accuracy-latency trade-offs

RF-DETR's most distinctive contribution: weight-sharing NAS. After training a single large "supernet," thousands of sub-network configurations are evaluated without retraining. This discovers accuracy-latency Pareto curves — the nano, small, and medium variants are NAS-discovered, while base and large are hand-designed.

Model	Params	COCO AP	Design
RF-DETR Nano	30.5M	48.0	NAS-discovered
RF-DETR Base	29.0M	53.3	Hand-designed
RF-DETR Large	129.0M	~56	Hand-designed
RF-DETR 2XL	—	60+	NAS + hand

5 — Architecture Comparison

Component	DETR (2020)	RT-DETR (2023)	RF-DETR (2025)
Backbone	ResNet-50 (single scale)	HGNetv2 (multi-scale S3/S4/S5)	DINOv2 ViT (pre-trained, multi-scale via projector)
Encoder	6× standard self-attention	AISM (intra-scale) + CCFM (cross-scale CNN)	ViT backbone + windowed/global attention mix
Decoder	6× cross-attention + self-attention	6× standard decoder (flexible: 1-6 at inference)	LW-DETR decoder w/ deformable cross-attention
Queries	100 fixed learned embeddings	IoU-aware top-K from encoder features	Content-based from encoder features
NMS	Not needed	Not needed	Not needed
Training	500 epochs	72 epochs	Fast convergence via pre-trained backbone
Best COCO AP	43.3	54.8	60+
Real-time?	No (~12 FPS)	Yes (~114 FPS)	Yes (NAS-optimized)

The DETR family trend: Each generation keeps the core idea (set prediction, no NMS, bipartite matching) while fixing a specific limitation: DETR fixed the pipeline complexity, RT-DETR fixed the speed, RF-DETR fixed the accuracy ceiling by leveraging self-supervised pre-training.

6 — Evolution Summary

7 — References & Further Reading

DETR: End-to-End Object Detection with Transformers — Carion et al., Facebook AI, 2020
Deformable DETR — Zhu et al., 2020 (deformable attention)
RT-DETR: DETRs Beat YOLOs on Real-time Object Detection — Zhao et al., Baidu, 2023
RF-DETR: Neural Architecture Search for Real-Time Detection Transformers — Robinson et al., Roboflow, 2025
RF-DETR GitHub Repository