ToolJoltTools

MFU (Model FLOPs Utilization) Calculator

Measure how efficiently your training run uses its GPUs — observed tokens/s vs theoretical peak.

MFU (%)
Achieved (TFLOPS/GPU)

MFU counts only the model's useful 6ND FLOPs — recomputation from gradient checkpointing deliberately does NOT count, which is why MFU is stricter and more honest than 'hardware FLOPS utilization'.

Formula

MFU = 6·N·tokens_per_sec ÷ (n_GPUs × peak_FLOPS)
References: Chowdhery et al. (2022), PaLM: Scaling Language Modeling with Pathways (App. B)

About MFU (Model FLOPs Utilization) Calculator

MFU answers the only efficiency question that matters in training: of the FLOPS you are paying for, how many go into the model? Feed it your observed cluster-wide tokens per second and it returns the utilization percentage benchmarked across the industry — PaLM reported 46%, well-tuned Llama-style runs land 38–45%, and a naive PyTorch loop on Ethernet-linked nodes can languish under 20%. Because the numerator uses the canonical 6ND accounting, recomputation and failed steps count against you — exactly as they should when each H100-hour costs real money.

How to use MFU (Model FLOPs Utilization) Calculator

  1. 1Enter your values into MFU (Model FLOPs Utilization) Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use MFU (Model FLOPs Utilization) Calculator?

  • Computes MFU (Model FLOPs Utilization) instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: MFU = 6.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

What is a good MFU in 2026?+

On H100/H200 clusters with FlashAttention-3, tensor-parallel overlap and fused optimizers: 40–50% for dense models at scale. MoE models run lower (25–35%) due to routing imbalance and all-to-all communication. Below 25% on dense means an addressable bottleneck.

MFU vs HFU — what is the difference?+

HFU (hardware FLOPs utilization) counts every executed FLOP, including activation recomputation. MFU counts only the ideal 6ND. With full checkpointing HFU ≈ 1.3–1.4× MFU; quoting HFU makes a run look better without making it cheaper.

Why does my MFU drop when I add nodes?+

Communication: gradient all-reduce and tensor-parallel all-gathers grow with cluster size while compute per GPU stays fixed. Past the point where comms can no longer hide behind compute, MFU decays — better overlap, larger micro-batches or faster interconnect restore it.

Does sequence length affect MFU?+

Yes, in both directions: longer sequences make attention's quadratic FLOPs (excluded from 6ND) a larger share of work — depressing measured MFU — while improving matmul shapes and amortizing fixed costs. Compare runs at matching sequence lengths only.

Related tools

Related ML & AI tools

Sponsored