ToolJoltTools

Speculative Decoding Speedup Calculator

Expected speedup from draft-model speculation given acceptance rate, draft length and model cost ratio.

โ€”
Expected tokens per round
โ€”
Speedup (ร—)

Speculation exploits decode being bandwidth-bound: the target model verifies ฮณ drafted tokens in ONE forward pass costing barely more than generating one. Output distribution is provably unchanged โ€” it's exact, not approximate.

Formula

E[tokens/round] = (1โˆ’ฮฑ^(ฮณ+1))/(1โˆ’ฮฑ) ยท speedup = E รท (ฮณยทc + 1) โ€” Leviathan et al. (2023)
References: Leviathan et al. (2023), Fast Inference from Transformers via Speculative Decoding; Chen et al. (2023), Accelerating LLM Decoding with Speculative Sampling

About Speculative Decoding Speedup Calculator

Speculative decoding is the rare free lunch of LLM serving: a small draft model guesses several tokens, the big model verifies them all in a single forward pass, and rejection sampling guarantees the output distribution is EXACTLY the target model's. This calculator implements the expected-speedup algebra from the original paper โ€” acceptance rate raised to draft length, against the combined compute cost โ€” so you can tune the two knobs that matter (draft quality ฮฑ, draft length ฮณ) before benchmarking. The verdict flags the configurations that backfire, which absolutely happen when drafts are too costly or mismatched.

How to use Speculative Decoding Speedup Calculator

  1. 1Enter your values into Speculative Decoding Speedup Calculator โ€” sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Speculative Decoding Speedup Calculator?

  • โœ“Computes Speculative Decoding Speedup instantly in your browser โ€” no sign-up, no upload, no server round-trip.
  • โœ“100% free and unlimited, with the exact formula shown: E[tokens/round] = (1โˆ’ฮฑ^(ฮณ+1))/(1โˆ’ฮฑ).
  • โœ“Runs entirely client-side, so every value you enter stays private on your device.
  • โœ“Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why can the target model verify ฮณ tokens for the price of one?+

Single-token decode is memory-bandwidth-bound: weights stream from HBM while compute idles. Verifying ฮณ tokens reuses the same weight stream with ฮณ-fold more arithmetic โ€” nearly free until you become compute-bound. Speculation converts idle FLOPs into accepted tokens.

What acceptance rate is realistic?+

A same-family draft (Llama-68Mโ†’7B, 1Bโ†’70B) on natural text reaches ฮฑโ‰ˆ0.7โ€“0.9; mismatched tokenizers or domains can halve that. Code and predictable formats accept more; creative high-temperature sampling accepts less. Measure on YOUR traffic before committing.

Is longer drafting always better?+

No โ€” expected accepted tokens saturate at 1/(1โˆ’ฮฑ) while draft cost grows linearly in ฮณ. With ฮฑ=0.8 the ceiling is 5 tokens; drafting 10 just wastes draft compute. The optimum is usually ฮณ=3โ€“6; this tool shows the curve's bend instantly.

How do Medusa/EAGLE differ from draft-model speculation?+

They generate drafts from the target's OWN hidden states via small extra heads โ€” no separate model, no tokenizer mismatch, cโ‰ˆ0.05 effectively. EAGLE-2 reports 2.5โ€“4ร— on chat workloads. The same algebra here applies with their measured ฮฑ and c.

Related tools

Related ML & AI tools

Sponsored