LLM FLOPs-per-Token Calculator

Forward-pass compute per token (≈2P) plus attention's context-dependent term — and tokens/s on your GPU.

Model parameters (B)Current context length (tokens)LayersHidden sizeGPU effective TFLOPS (TFLOPS)RTX 4090 FP16 ≈ 165, A100 ≈ 312, H100 ≈ 990 (dense)Decode efficiency (%)Single-stream decode is memory-bound: 20–40% typical

—

GFLOPs per token

—

of which attention (ctx term) (%)

—

Est. tokens/s (compute bound)

The 2P rule: each parameter does one multiply-add per token. Attention adds a term that grows with context — at 128K context it can exceed the dense compute for small models.

Formula

FLOPs/token ≈ 2·P + 4·L·H·ctx (attention scores+values) — the 2P term covers every weight participating in one multiply-add

References: Kaplan et al. (2020), Scaling Laws for Neural Language Models (Appendix); Pope et al. (2022), Efficiently Scaling Transformer Inference

About LLM FLOPs-per-Token Calculator

How much compute does one token actually cost? The famous rule of thumb says 2 FLOPs per parameter — one multiply-add for every weight — but that ignores attention, whose cost grows linearly with how much context the token attends over. This calculator computes both terms, shows attention's share, and converts the total into a ceiling tokens-per-second figure for your GPU at a realistic decode efficiency. It explains at a glance why long-context decoding slows down even when the KV cache fits, and why small models feel disproportionately fast at short contexts.

How to use LLM FLOPs-per-Token Calculator

1Enter your values into LLM FLOPs-per-Token Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use LLM FLOPs-per-Token Calculator?

✓Computes LLM FLOPs-per-Token instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: FLOPs/token ≈ 2.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Where does the 2P rule come from?+

A forward pass multiplies each weight by an activation and accumulates: one multiply plus one add = 2 FLOPs per parameter per token. Training costs ~6P (forward + ~2× backward). It is accurate within ~10% for dense transformers at short context.

Why is real decode throughput far below GPU peak TFLOPS?+

Single-stream decoding is memory-bandwidth bound — every token must stream all weights from HBM. Hence the 20–40% effective-efficiency default. Batching restores compute-bound behavior, which is why servers quote massively higher aggregate tokens/s.

When does attention compute overtake the 2P term?+

When 4·L·H·ctx ≈ 2P, i.e. ctx ≈ P/(2LH). For Llama 3 8B that is ~15K tokens; for a 1B model only ~4K. FlashAttention reduces memory traffic, not these FLOPs — long context is intrinsically more expensive per token.

Is prefill different from decode?+

Yes — prefill processes the whole prompt in parallel and is compute-bound (often hitting 50–70% MFU), while decode emits one token at a time and is bandwidth-bound. This tool models per-token cost, which applies to both, but achievable efficiency differs sharply.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live