Training Tokens-per-Dollar Calculator
Normalize GPU offers into the metric that decides training budgets — billions of tokens trained per dollar.
This is how to compare a $2.50 H100 against a $0.80 A100 or a cheap spot 4090 HONESTLY — per token trained, not per hour. With defaults, a 1B model Chinchilla-trains (20B tokens) for around $90 of compute.
Formula
Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.
About Training Tokens-per-Dollar Calculator
GPU marketplaces quote dollars per hour; training budgets are spent in tokens. The conversion — peak FLOPS times achievable MFU over the 6ND training cost — collapses every offer into one comparable number: tokens trained per dollar. A discounted A100 frequently beats a premium H100 on this metric for small models, and a spot-priced 4090 can beat both until interconnect needs kill it. The third output makes the result visceral: the dollar cost to take YOUR model size to its Chinchilla-optimal token count at that hardware's rate.
How to use Training Tokens-per-Dollar Calculator
- 1Enter your values into Training Tokens-per-Dollar Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Training Tokens-per-Dollar Calculator?
- ✓Computes Training Tokens-per-Dollar instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: tokens/hr = peak × MFU ÷ (6.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Why does the cheaper, slower GPU often win this metric?+
Because training small models can't exploit big-GPU advantages: an H100 at 3× the A100's price needs to deliver 3× the achieved (not peak) throughput, but kernel overheads, batch limits and interconnect mean it often delivers 2–2.5× on small runs. Tokens/$ exposes that gap; $/hr hides it.
What MFU should I plug in for each card class?+
Realistic single-node figures: H100 with tuned stack 38–45%; A100 35–42%; consumer 4090s 25–35% (no NVLink, smaller VRAM forces accumulation). Multi-node Ethernet setups subtract 5–15 points. When comparing offers, hold MFU honest per-platform rather than copying one number.
Does this account for multi-GPU scaling losses?+
Not directly — it prices one GPU's tokens. For clusters, multiply tokens/hr by GPU count times a scaling efficiency (0.85–0.95 within a node, less across). The per-dollar ranking between hardware types usually survives; the absolute cost shifts.
How accurate is the Chinchilla cost output?+
It is the compute floor: 20 tokens/param at your sustained rate, no restarts, no evals, no failed runs. Real projects land 1.3–2× higher (see our GPU-hours calculator's FAQ). It is best used comparatively — the same optimism applies to every GPU offer equally.
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live