← Visual Notes

SEQUOIA: Optimal Token Trees for Speculative Decoding

LLMs generate tokens one at a time, bottlenecked by memory bandwidth. Speculative decoding uses a small model to draft tokens and a big model to verify them in parallel. SEQUOIA finds the mathematically optimal draft tree shape using dynamic programming.

The Problem
The Bottleneck
Every token requires a complete forward pass through all model parameters. This is brutally sequential — you can't generate token 5 until you have tokens 1–4.
Standard autoregressive decoding
"The
model
generates
???
???
???
???
Each BIG MODEL CALL produces exactly 1 token
Big model calls
0
Tokens generated
0
The bottleneck: For a 70B parameter model, generating each token means reading ~140GB of weights. The GPU spends most of its time on memory I/O, not actual compute.
Key insight: Verifying a proposed token is much cheaper than generating one from scratch — and you can verify many tokens in a single forward pass.
1 tokenPer forward pass
~140 GBMemory reads per token (70B)
✗ Brutally sequential ✗ Memory-bound
What if you could verify multiple tokens at once?
The Idea
Speculate & Verify
Use a tiny "draft" model to guess what the big model would say. Then verify multiple guesses in one shot — accept the good ones, throw away the bad ones.
Without spec decoding
STEP 1
70B "tokens"
STEP 2
70B "one"
STEP 3
70B "at"
STEP 4
70B "a"
STEP 5
70B "time"
5 big model calls = 5 tokens
With spec decoding
DRAFT (fast)
7B guesses 5 tokens
VERIFY (1 call!)
70B checks all 5
ACCEPT/REJECT
Keep correct, discard wrong
1 big call → up to 5 tokens
Interactive: Spec decoding step-by-step
Prefix: "The quick brown fox"
DRAFT MODEL GUESSES (small, fast, sometimes wrong):
"The
quick
brown
fox
jumps
over
a
lazy
cat
BIG MODEL VERIFICATION (single forward pass):
"The
quick
brown
fox
?
?
?
?
?
Zero quality loss: The output distribution is identical to just using the big model. If the small model guesses wrong, the big model's correct token is used instead.
1 callVerifies up to 5 tokens
ExactOutput distribution preserved
✓ Parallel verification ✓ Lossless
Sequences waste tokens on a single bad guess —
what if you branch?
Token Trees
Why Trees?
Instead of one guess-sequence, propose a branching tree of possibilities. The big model verifies the whole tree in one pass, accepting the longest valid path.
Compare tree structures (same budget = 7 draft tokens)
The sequential ceiling: If you draft 7 tokens in a chain, and position 1 is wrong, you get 0 bonus tokens. All that drafting was wasted. The ceiling is fixed by P(first token correct).
Optimal trees branch where it matters most. Wide near the root (hedge your bets), narrow deeper (more confident). The shape is computed to maximize expected accepted tokens — not hardcoded.
✓ Multiple bets per position ✓ Shape adapts to model
But which tree shape is best?
Dynamic Programming
The Optimal Shape
The tree shape that maximizes expected accepted tokens can't be brute-forced (exponentially many shapes). DP solves it optimally in polynomial time.

The key question: given a budget of N draft tokens, what tree shape maximizes the expected number of tokens the big model accepts?

DP's trick: Any tree is made of subtrees. If you know the optimal subtree for size 3, you can build on it when solving for size 7. You never re-solve the same subproblem.

Each cell T[n, b] = "best expected tokens from a tree of size n, where root has b branches." Hover cells to explore.

DP Table — hover a cell to understand it
Branches ↓ / Size → n=1n=2n=3n=4n=5n=6n=7
← Hover a cell to see what it means
Adapts to your model pair: A pair where the first token is usually accepted gives a shallow, wide tree. Fast-dropping acceptance rates give a deeper, narrower tree. No hardcoded assumptions.
Free at inference time. You run this DP once offline. During serving, the tree shape is just a fixed lookup. Zero overhead.
PolynomialTime complexity
ZeroInference overhead
✓ Polynomial time ✓ Zero inference overhead
Optimal trees + sampling without replacement =
Results
The Payoff
SEQUOIA's optimal tree shapes + sampling without replacement compound into significant real-world speedups.
Llama2-7B on A100
4.04×
vs standard decoding
Llama3-70B offloading
9.5×
vs DeepSpeed
Latency: Llama3-70B on a single L40 GPU
DeepSpeed
5.7s
SEQUOIA
0.60s
Per-token latency. Offloading = model too big for GPU VRAM, pages from CPU RAM.
The two innovations — impact breakdown
Optimal tree
+33%
No-replacement
+65%
Budget scaling
unbounded
4.04×On-chip speedup
9.5×Offloading speedup
✓ Scales with budget ✓ No quality loss
Paper
SEQUOIA: Scalable, Robust, and Hardware-aware Speculative Decoding
Chen et al., 2024
Optimal token tree construction via DP + sampling without replacement for 4–10× faster LLM inference with zero quality loss.
← Visual Notes