Chinchilla Training FLOPs Calculator
Total training compute C = 6·N·D, the Chinchilla-optimal token count, and your over/under-training ratio.
Defaults are Llama 3 8B: 15T tokens is ~94× the Chinchilla optimum — deliberately 'over-trained' because a small model served billions of times repays extra training compute many times over.
Formula
About Chinchilla Training FLOPs Calculator
The Chinchilla paper reset how labs budget training runs: for a fixed compute budget, model size and data should scale together, with roughly 20 tokens per parameter at the optimum. This calculator computes total training FLOPs from the 6ND rule, the compute-optimal token count for your model size, and how far above or below it your plan sits. Modern open models intentionally blow past the optimum — Llama 3 8B trained on 94× Chinchilla — because serving economics, not training economics, dominate; the verdict badge tells you which regime you are in.
How to use Chinchilla Training FLOPs Calculator
- 1Enter your values into Chinchilla Training FLOPs Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Chinchilla Training FLOPs Calculator?
- ✓Computes Chinchilla Training FLOPs instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: C = 6.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Why 6 FLOPs per parameter per token?+
Forward pass costs 2 FLOPs per parameter (multiply-add); the backward pass computes gradients w.r.t. both activations and weights, costing roughly twice the forward — 4 more. Total ≈ 6·N per token, the standard accounting used in scaling-law papers.
Is 20 tokens/parameter still the right optimum?+
As a compute-optimal rule it has held up well, with refinements (data quality, repeated epochs, learning-rate schedules) shifting it modestly. But 'optimal' assumes you only care about training cost — inference-aware analyses justify training small models far longer.
What does it mean that Llama 3 is '94× over-trained'?+
Meta spent ~94× more tokens than the compute-optimal recipe for an 8B model. The result is quality approaching much larger compute-optimal models in a package cheap to serve — the dominant strategy when a model will handle trillions of inference tokens.
How do I convert these FLOPs into GPU-hours and dollars?+
Divide by your fleet's sustained throughput: GPUs × peak FLOPS × MFU (typically 30–45% for large-scale training). Our companion training GPU-hours calculator does exactly this and adds electricity and rental-cost estimates.
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live