MFU (Model FLOPs Utilization) Calculator
Measure how efficiently your training run uses its GPUs — observed tokens/s vs theoretical peak.
MFU counts only the model's useful 6ND FLOPs — recomputation from gradient checkpointing deliberately does NOT count, which is why MFU is stricter and more honest than 'hardware FLOPS utilization'.
Formula
About MFU (Model FLOPs Utilization) Calculator
MFU answers the only efficiency question that matters in training: of the FLOPS you are paying for, how many go into the model? Feed it your observed cluster-wide tokens per second and it returns the utilization percentage benchmarked across the industry — PaLM reported 46%, well-tuned Llama-style runs land 38–45%, and a naive PyTorch loop on Ethernet-linked nodes can languish under 20%. Because the numerator uses the canonical 6ND accounting, recomputation and failed steps count against you — exactly as they should when each H100-hour costs real money.
How to use MFU (Model FLOPs Utilization) Calculator
- 1Enter your values into MFU (Model FLOPs Utilization) Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use MFU (Model FLOPs Utilization) Calculator?
- ✓Computes MFU (Model FLOPs Utilization) instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: MFU = 6.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
What is a good MFU in 2026?+
On H100/H200 clusters with FlashAttention-3, tensor-parallel overlap and fused optimizers: 40–50% for dense models at scale. MoE models run lower (25–35%) due to routing imbalance and all-to-all communication. Below 25% on dense means an addressable bottleneck.
MFU vs HFU — what is the difference?+
HFU (hardware FLOPs utilization) counts every executed FLOP, including activation recomputation. MFU counts only the ideal 6ND. With full checkpointing HFU ≈ 1.3–1.4× MFU; quoting HFU makes a run look better without making it cheaper.
Why does my MFU drop when I add nodes?+
Communication: gradient all-reduce and tensor-parallel all-gathers grow with cluster size while compute per GPU stays fixed. Past the point where comms can no longer hide behind compute, MFU decays — better overlap, larger micro-batches or faster interconnect restore it.
Does sequence length affect MFU?+
Yes, in both directions: longer sequences make attention's quadratic FLOPs (excluded from 6ND) a larger share of work — depressing measured MFU — while improving matmul shapes and amortizing fixed costs. Compare runs at matching sequence lengths only.
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live