The Evolution of Attention

Standard Attention

Compute Everything

Every token attends to every other token. Full n×n matrix.

Keys (tokens)

Queries (tokens)

t₁

t₂

t₃

t₄

t₅

t₆

t₇

t₈

t₁

t₂

t₃

t₄

t₅

t₆

t₇

t₈

O(n²)Complexity

~16BScores @ 128K context

✓ Exact ✗ Brutal at scale

Linear Attention

Approximate Everything

Never builds the full n×n matrix. Uses a kernel trick to decompose into two smaller matrices.

×

n × d
φ(Q)

d × n
φ(K)ᵀV

Instead of n×n scores, accumulate a
small d×d state — scales linearly with n

O(n)Complexity

~128KScores @ 128K context

✓ Fast ✗ Imprecise

Sparse Attention

Compute What Matters

Only compute scores for important token pairs. Precise but hard to route.

Keys (tokens)

Queries (tokens)

t₁

t₂

t₃

t₄

t₅

t₆

t₇

t₈

t₁

t₂

t₃

t₄

t₅

t₆

t₇

t₈

O(n√n)Complexity

~variesScores @ 128K context

✓ Precise ✗ Hard to target

The New Frontier

Hybrid Attention Routing

Train a router to dynamically assign each computation — exact attention where it matters, cheap approximation everywhere else.

📥

Input

Token pairs per layer

↓

🧠

Learned Router

Decides per token pair or per layer

↙ ↘

Sparse

Exact softmax
High-importance

Linear

Cheap approx
Low-importance

↘ ↙

⚡

Blended Output

✓ Precise where it matters ✓ Cheap everywhere else ✓ Model learns the split

Recent examples

SLA2

Zhang et al., Feb 2026

Learned router for video diffusion. 97% sparsity, 18.6× attention speedup, minimal quality degradation.

MiniCPM-SALA

MiniCPM Team, Feb 2026

Deterministic per-layer routing for LLMs. 1:3 sparse-to-linear ratio, ~75% training cost reduction.

How Attention Is Evolving