Standard Attention
Compute Everything
Every token attends to every other token. Full n×n matrix.
O(n²)Complexity
~16BScores @ 128K context
✓ Exact
✗ Brutal at scale
Linear Attention
Approximate Everything
Never builds the full n×n matrix. Uses a kernel trick to decompose into two smaller matrices.
Instead of n×n scores, accumulate a
small d×d state — scales linearly with n
O(n)Complexity
~128KScores @ 128K context
✓ Fast
✗ Imprecise
Sparse Attention
Compute What Matters
Only compute scores for important token pairs. Precise but hard to route.
O(n√n)Complexity
~variesScores @ 128K context
✓ Precise
✗ Hard to target
The New Frontier
Hybrid Attention Routing
Train a router to dynamically assign each computation — exact attention where it matters, cheap approximation everywhere else.
📥
Input
Token pairs per layer
↓
🧠
Learned Router
Decides per token pair or per layer
↙ ↘
Sparse
Exact softmax
High-importance
Linear
Cheap approx
Low-importance
↘ ↙
✓ Precise where it matters
✓ Cheap everywhere else
✓ Model learns the split
Recent examples
SLA2
Zhang et al., Feb 2026
Learned router for video diffusion. 97% sparsity, 18.6× attention speedup, minimal quality degradation.
MiniCPM-SALA
MiniCPM Team, Feb 2026
Deterministic per-layer routing for LLMs. 1:3 sparse-to-linear ratio, ~75% training cost reduction.