SEQUOIA: Optimal Token Trees for Speculative Decoding

The Problem

The Bottleneck

Every token requires a complete forward pass through all model parameters. This is brutally sequential — you can't generate token 5 until you have tokens 1–4.

Standard autoregressive decoding

"The

model

generates

???

Each BIG MODEL CALL produces exactly 1 token

Big model calls

0

Tokens generated

0

The bottleneck: For a 70B parameter model, generating each token means reading ~140GB of weights. The GPU spends most of its time on memory I/O, not actual compute.

Key insight: Verifying a proposed token is much cheaper than generating one from scratch — and you can verify many tokens in a single forward pass.

1 tokenPer forward pass

~140 GBMemory reads per token (70B)

✗ Brutally sequential ✗ Memory-bound

The Idea

Speculate & Verify

Use a tiny "draft" model to guess what the big model would say. Then verify multiple guesses in one shot — accept the good ones, throw away the bad ones.

Without spec decoding

STEP 1

70B "tokens"

STEP 2

70B "one"

STEP 3

70B "at"

STEP 4

70B "a"

STEP 5

70B "time"

5 big model calls = 5 tokens

With spec decoding

DRAFT (fast)

7B guesses 5 tokens

VERIFY (1 call!)

70B checks all 5

ACCEPT/REJECT

Keep correct, discard wrong

1 big call → up to 5 tokens

Interactive: Spec decoding step-by-step

Prefix: "The quick brown fox"

DRAFT MODEL GUESSES (small, fast, sometimes wrong):

"The

quick

brown

fox

jumps

over

a

lazy

cat

BIG MODEL VERIFICATION (single forward pass):

"The

quick

brown

fox

?

Zero quality loss: The output distribution is identical to just using the big model. If the small model guesses wrong, the big model's correct token is used instead.

1 callVerifies up to 5 tokens

ExactOutput distribution preserved

✓ Parallel verification ✓ Lossless

Token Trees

Why Trees?

Instead of one guess-sequence, propose a branching tree of possibilities. The big model verifies the whole tree in one pass, accepting the longest valid path.

Compare tree structures (same budget = 7 draft tokens)

The sequential ceiling: If you draft 7 tokens in a chain, and position 1 is wrong, you get 0 bonus tokens. All that drafting was wasted. The ceiling is fixed by P(first token correct).

Optimal trees branch where it matters most. Wide near the root (hedge your bets), narrow deeper (more confident). The shape is computed to maximize expected accepted tokens — not hardcoded.

✓ Multiple bets per position ✓ Shape adapts to model

Dynamic Programming

The Optimal Shape

The tree shape that maximizes expected accepted tokens can't be brute-forced (exponentially many shapes). DP solves it optimally in polynomial time.

The key question: given a budget of N draft tokens, what tree shape maximizes the expected number of tokens the big model accepts?

DP's trick: Any tree is made of subtrees. If you know the optimal subtree for size 3, you can build on it when solving for size 7. You never re-solve the same subproblem.

Each cell T[n, b] = "best expected tokens from a tree of size n, where root has b branches." Hover cells to explore.

DP Table — hover a cell to understand it

Branches ↓ / Size →	n=1	n=2	n=3	n=4	n=5	n=6	n=7

← Hover a cell to see what it means

Adapts to your model pair: A pair where the first token is usually accepted gives a shallow, wide tree. Fast-dropping acceptance rates give a deeper, narrower tree. No hardcoded assumptions.

Free at inference time. You run this DP once offline. During serving, the tree shape is just a fixed lookup. Zero overhead.

PolynomialTime complexity

ZeroInference overhead

✓ Polynomial time ✓ Zero inference overhead

Results

The Payoff

SEQUOIA's optimal tree shapes + sampling without replacement compound into significant real-world speedups.

Llama2-7B on A100

4.04×

vs standard decoding

Llama3-70B offloading

9.5×

vs DeepSpeed

Latency: Llama3-70B on a single L40 GPU

DeepSpeed

5.7s

SEQUOIA

0.60s

Per-token latency. Offloading = model too big for GPU VRAM, pages from CPU RAM.

The two innovations — impact breakdown

Optimal tree

+33%

No-replacement

+65%

Budget scaling

unbounded

4.04×On-chip speedup

9.5×Offloading speedup

✓ Scales with budget ✓ No quality loss

Paper

SEQUOIA: Scalable, Robust, and Hardware-aware Speculative Decoding

Chen et al., 2024

Optimal token tree construction via DP + sampling without replacement for 4–10× faster LLM inference with zero quality loss.