← Visual Notes

How Attention Is Evolving

Self-attention scales quadratically — every token scores against every other token. The field is converging on hybrid routing to break that wall.

Standard Attention
Compute Everything
Every token attends to every other token. Full n×n matrix.
Keys (tokens)
Queries (tokens)
t₁
t₂
t₃
t₄
t₅
t₆
t₇
t₈
t₁
t₂
t₃
t₄
t₅
t₆
t₇
t₈
O(n²)Complexity
~16BScores @ 128K context
✓ Exact ✗ Brutal at scale
Linear Attention
Approximate Everything
Never builds the full n×n matrix. Uses a kernel trick to decompose into two smaller matrices.
×
n × d
φ(Q)
d × n
φ(K)ᵀV
Instead of n×n scores, accumulate a
small d×d state — scales linearly with n
O(n)Complexity
~128KScores @ 128K context
✓ Fast ✗ Imprecise
Sparse Attention
Compute What Matters
Only compute scores for important token pairs. Precise but hard to route.
Keys (tokens)
Queries (tokens)
t₁
t₂
t₃
t₄
t₅
t₆
t₇
t₈
t₁
t₂
t₃
t₄
t₅
t₆
t₇
t₈
O(n√n)Complexity
~variesScores @ 128K context
✓ Precise ✗ Hard to target
None alone is enough —
what if you combine them?
The New Frontier
Hybrid Attention Routing
Train a router to dynamically assign each computation — exact attention where it matters, cheap approximation everywhere else.
📥
Input
Token pairs per layer
🧠
Learned Router
Decides per token pair or per layer
↙ ↘
Sparse
Exact softmax
High-importance
Linear
Cheap approx
Low-importance
↘ ↙
Blended Output
✓ Precise where it matters ✓ Cheap everywhere else ✓ Model learns the split
Recent examples
SLA2
Zhang et al., Feb 2026
Learned router for video diffusion. 97% sparsity, 18.6× attention speedup, minimal quality degradation.
MiniCPM-SALA
MiniCPM Team, Feb 2026
Deterministic per-layer routing for LLMs. 1:3 sparse-to-linear ratio, ~75% training cost reduction.
← Visual Notes