Speculative Decoding Speedup Calculator

Expected speedup from draft-model speculation given acceptance rate, draft length and model cost ratio.

Token acceptance ratewell-matched draft: 0.7–0.9Draft tokens per round (γ)Draft cost / target cost1B draft for 70B target ≈ 0.015–0.05

—

Expected tokens per round

—

Speedup (×)

Speculation exploits decode being bandwidth-bound: the target model verifies γ drafted tokens in ONE forward pass costing barely more than generating one. Output distribution is provably unchanged — it's exact, not approximate.

Formula

E[tokens/round] = (1−α^(γ+1))/(1−α) · speedup = E ÷ (γ·c + 1) — Leviathan et al. (2023)

References: Leviathan et al. (2023), Fast Inference from Transformers via Speculative Decoding; Chen et al. (2023), Accelerating LLM Decoding with Speculative Sampling

About Speculative Decoding Speedup Calculator

Speculative decoding is the rare free lunch of LLM serving: a small draft model guesses several tokens, the big model verifies them all in a single forward pass, and rejection sampling guarantees the output distribution is EXACTLY the target model's. This calculator implements the expected-speedup algebra from the original paper — acceptance rate raised to draft length, against the combined compute cost — so you can tune the two knobs that matter (draft quality α, draft length γ) before benchmarking. The verdict flags the configurations that backfire, which absolutely happen when drafts are too costly or mismatched.

How to use Speculative Decoding Speedup Calculator

1Enter your values into Speculative Decoding Speedup Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Speculative Decoding Speedup Calculator?

✓Computes Speculative Decoding Speedup instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: E[tokens/round] = (1−α^(γ+1))/(1−α).
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why can the target model verify γ tokens for the price of one?+

Single-token decode is memory-bandwidth-bound: weights stream from HBM while compute idles. Verifying γ tokens reuses the same weight stream with γ-fold more arithmetic — nearly free until you become compute-bound. Speculation converts idle FLOPs into accepted tokens.

What acceptance rate is realistic?+

A same-family draft (Llama-68M→7B, 1B→70B) on natural text reaches α≈0.7–0.9; mismatched tokenizers or domains can halve that. Code and predictable formats accept more; creative high-temperature sampling accepts less. Measure on YOUR traffic before committing.

Is longer drafting always better?+

No — expected accepted tokens saturate at 1/(1−α) while draft cost grows linearly in γ. With α=0.8 the ceiling is 5 tokens; drafting 10 just wastes draft compute. The optimum is usually γ=3–6; this tool shows the curve's bend instantly.

How do Medusa/EAGLE differ from draft-model speculation?+

They generate drafts from the target's OWN hidden states via small extra heads — no separate model, no tokenizer mismatch, c≈0.05 effectively. EAGLE-2 reports 2.5–4× on chat workloads. The same algebra here applies with their measured α and c.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live