ToolJoltTools

LLM Batching Throughput & Latency Calculator

How batch size trades per-user latency for aggregate tokens/sec — the roofline crossover made interactive.

Single-stream speed (tok/s)
Aggregate throughput (tok/s)
Per-user speed (tok/s)
Compute-bound past batch

This is the economics of every LLM API: weights stream from HBM once per step regardless of batch, so serving 32 users costs barely more than serving one — until arithmetic intensity hits the compute roofline.

Formula

single-stream ≈ 0.7·BW ÷ weights · critical batch ≈ compute_FLOPS ÷ (2P × single-stream) — below it, batching adds users at no per-user cost
References: Pope et al. (2022), Efficiently Scaling Transformer Inference; Yu et al. (2022), Orca: Continuous Batching

About LLM Batching Throughput & Latency Calculator

Why can an API sell 70B-model tokens so cheaply? Because decode is bandwidth-bound: the GPU streams all weights from HBM to produce one token per user, and producing thirty-two users' tokens in that same stream costs almost nothing extra. This calculator makes the roofline interactive — single-stream speed from bandwidth over weight bytes, the critical batch where you cross into compute-bound territory, and aggregate versus per-user throughput on either side. It is the one mental model that explains continuous batching, why your local single-user speed flatters no business case, and where adding concurrency starts hurting latency.

How to use LLM Batching Throughput & Latency Calculator

  1. 1Enter your values into LLM Batching Throughput & Latency Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use LLM Batching Throughput & Latency Calculator?

  • Computes LLM Batching Throughput & Latency instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: single-stream ≈ 0.7.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why does batch 32 barely slow down each user?+

Below the critical batch, the bottleneck is streaming weights — identical work for 1 or 32 users; the extra multiply-adds ride along free in otherwise-idle compute units. Per-user speed only degrades once total arithmetic saturates the tensor cores.

What is continuous batching and why did it 10× serving?+

Static batching waits for the whole batch to finish; one long generation stalls everyone. Continuous (iteration-level) batching swaps finished sequences for queued ones every step, keeping the batch full continuously — 5–20× real-world throughput (Orca, then vLLM made it standard).

What limits batch size before compute does?+

KV-cache memory: each concurrent sequence reserves cache (use our per-model KV-cache calculators) and VRAM left after weights divides by it. Long contexts shrink feasible batches drastically — often the binding constraint well before this tool's compute crossover.

Does this math change with quantization?+

Helpfully: 4-bit weights stream 4× fewer bytes — single-stream speed roughly quadruples and the critical batch shifts later (dequant adds some compute). That is why quantized models don't just fit better, they SERVE better per GPU.

Related tools

Related ML & AI tools

Sponsored