ToolJoltTools

Chinchilla Training FLOPs Calculator

Total training compute C = 6·N·D, the Chinchilla-optimal token count, and your over/under-training ratio.

Training compute (×10²³ FLOPs)
Chinchilla-optimal tokens (B)
Tokens vs optimal (×)

Defaults are Llama 3 8B: 15T tokens is ~94× the Chinchilla optimum — deliberately 'over-trained' because a small model served billions of times repays extra training compute many times over.

Formula

C = 6·N·D FLOPs · Chinchilla optimum: D* ≈ 20 tokens per parameter
References: Hoffmann et al. (2022), Training Compute-Optimal Large Language Models (Chinchilla); Touvron et al. (2024), The Llama 3 Herd of Models

About Chinchilla Training FLOPs Calculator

The Chinchilla paper reset how labs budget training runs: for a fixed compute budget, model size and data should scale together, with roughly 20 tokens per parameter at the optimum. This calculator computes total training FLOPs from the 6ND rule, the compute-optimal token count for your model size, and how far above or below it your plan sits. Modern open models intentionally blow past the optimum — Llama 3 8B trained on 94× Chinchilla — because serving economics, not training economics, dominate; the verdict badge tells you which regime you are in.

How to use Chinchilla Training FLOPs Calculator

  1. 1Enter your values into Chinchilla Training FLOPs Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Chinchilla Training FLOPs Calculator?

  • Computes Chinchilla Training FLOPs instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: C = 6.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why 6 FLOPs per parameter per token?+

Forward pass costs 2 FLOPs per parameter (multiply-add); the backward pass computes gradients w.r.t. both activations and weights, costing roughly twice the forward — 4 more. Total ≈ 6·N per token, the standard accounting used in scaling-law papers.

Is 20 tokens/parameter still the right optimum?+

As a compute-optimal rule it has held up well, with refinements (data quality, repeated epochs, learning-rate schedules) shifting it modestly. But 'optimal' assumes you only care about training cost — inference-aware analyses justify training small models far longer.

What does it mean that Llama 3 is '94× over-trained'?+

Meta spent ~94× more tokens than the compute-optimal recipe for an 8B model. The result is quality approaching much larger compute-optimal models in a package cheap to serve — the dominant strategy when a model will handle trillions of inference tokens.

How do I convert these FLOPs into GPU-hours and dollars?+

Divide by your fleet's sustained throughput: GPUs × peak FLOPS × MFU (typically 30–45% for large-scale training). Our companion training GPU-hours calculator does exactly this and adds electricity and rental-cost estimates.

Related tools

Related ML & AI tools

Sponsored