Training Tokens-per-Dollar Calculator

Normalize GPU offers into the metric that decides training budgets — billions of tokens trained per dollar.

Model parameters (B)GPU price ($/hr)GPU peak (TFLOPS)Achievable MFU (%)

—

Training throughput (M tok/hr)

—

Tokens per dollar (M)

—

Cost to Chinchilla-train this model ($)

This is how to compare a $2.50 H100 against a $0.80 A100 or a cheap spot 4090 HONESTLY — per token trained, not per hour. With defaults, a 1B model Chinchilla-trains (20B tokens) for around $90 of compute.

Formula

tokens/hr = peak × MFU ÷ (6·P) × 3600 · tokens/$ = ÷ price · Chinchilla cost = 20·P tokens at that rate

References: Hoffmann et al. (2022), Chinchilla; Chowdhery et al. (2022), PaLM (MFU)

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.

About Training Tokens-per-Dollar Calculator

GPU marketplaces quote dollars per hour; training budgets are spent in tokens. The conversion — peak FLOPS times achievable MFU over the 6ND training cost — collapses every offer into one comparable number: tokens trained per dollar. A discounted A100 frequently beats a premium H100 on this metric for small models, and a spot-priced 4090 can beat both until interconnect needs kill it. The third output makes the result visceral: the dollar cost to take YOUR model size to its Chinchilla-optimal token count at that hardware's rate.

How to use Training Tokens-per-Dollar Calculator

1Enter your values into Training Tokens-per-Dollar Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Training Tokens-per-Dollar Calculator?

✓Computes Training Tokens-per-Dollar instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: tokens/hr = peak × MFU ÷ (6.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why does the cheaper, slower GPU often win this metric?+

Because training small models can't exploit big-GPU advantages: an H100 at 3× the A100's price needs to deliver 3× the achieved (not peak) throughput, but kernel overheads, batch limits and interconnect mean it often delivers 2–2.5× on small runs. Tokens/$ exposes that gap; $/hr hides it.

What MFU should I plug in for each card class?+

Realistic single-node figures: H100 with tuned stack 38–45%; A100 35–42%; consumer 4090s 25–35% (no NVLink, smaller VRAM forces accumulation). Multi-node Ethernet setups subtract 5–15 points. When comparing offers, hold MFU honest per-platform rather than copying one number.

Does this account for multi-GPU scaling losses?+

Not directly — it prices one GPU's tokens. For clusters, multiply tokens/hr by GPU count times a scaling efficiency (0.85–0.95 within a node, less across). The per-dollar ranking between hardware types usually survives; the absolute cost shifts.

How accurate is the Chinchilla cost output?+

It is the compute floor: 20 tokens/param at your sustained rate, no restarts, no evals, no failed runs. Real projects land 1.3–2× higher (see our GPU-hours calculator's FAQ). It is best used comparatively — the same optimism applies to every GPU offer equally.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live