LLM Training GPU-Hours & Cost Calculator
Turn a training budget (params × tokens) into GPU-hours, wall-clock days and rental dollars at your MFU.
Defaults: a 7B model on 2T tokens at 40% MFU on 64 H100s ≈ 24 days. MFU of 35–45% is realistic at scale; small clusters with slow interconnect often sit nearer 25%.
Formula
Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.
About LLM Training GPU-Hours & Cost Calculator
Before any training run, three numbers decide feasibility: GPU-hours, calendar time and dollars. This calculator chains the standard 6ND compute rule with your hardware's peak throughput and a realistic Model-FLOPs-Utilization to produce all three. MFU — the fraction of theoretical FLOPS your training loop actually sustains — is the honest efficiency metric introduced by the PaLM paper; 40% is good at scale, 50%+ is exceptional. Compare GPUs, cluster sizes and token budgets to find where your budget actually breaks.
How to use LLM Training GPU-Hours & Cost Calculator
- 1Enter your values into LLM Training GPU-Hours & Cost Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use LLM Training GPU-Hours & Cost Calculator?
- ✓Computes LLM Training GPU-Hours & Cost instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: GPU-hours = 6.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
What MFU should I assume?+
Well-tuned large-scale runs on H100s with FlashAttention and fused kernels report 38–45%. Small clusters bottlenecked on PCIe or Ethernet, or runs with heavy activation checkpointing, can fall to 20–30%. Measure a short run before extrapolating a long one.
Why do real runs cost more than this estimate?+
Restarts and node failures, evaluation passes, ablations, data pipeline stalls and the runs you throw away. Teams commonly budget 1.3–2× the napkin number. The calculator gives the irreducible floor — the 6ND physics — not the project cost.
Does this apply to fine-tuning too?+
Yes — fine-tuning is the same 6ND arithmetic with a much smaller D. A 7B full fine-tune on 1B tokens is ~42e18 FLOPs ≈ 30 H100-hours at 40% MFU. LoRA reduces optimizer memory, not these forward/backward FLOPs through the base model.
FP8 training — does it halve the time?+
On H100-class hardware FP8 roughly doubles peak matmul FLOPS, but end-to-end speedups are typically 1.2–1.5× because non-matmul work and communication do not accelerate. Model it here by raising effective TFLOPS or MFU accordingly.
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live