Matrix Multiplication FLOPs Calculator

Shape check + exact FLOPs/memory of (M×K)·(K×N) — and whether the matmul is compute- or bandwidth-bound on your GPU.

M (rows of A)K (inner / shared dim)N (cols of B)Precision

—

FLOPs (G)

—

Memory moved (A+B+C) (MB)

—

Arithmetic intensity (FLOP/byte)

Set M=1 to see decode-time reality: a 1×4096 × 4096×4096 GEMV has intensity ~2 FLOP/byte — hopelessly bandwidth-bound, which is the entire reason batching exists.

Formula

FLOPs = 2·M·K·N · bytes = (MK + KN + MN)·sizeof · intensity = FLOPs/bytes — compare to GPU's FLOPS/bandwidth ratio

References: Williams et al. (2009), Roofline: An Insightful Visual Performance Model; NVIDIA, GPU Performance Background (matmul efficiency guide)

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.

About Matrix Multiplication FLOPs Calculator

2MKN — three numbers and a doubling, yet this formula plus the roofline model explains most of modern AI systems engineering. This calculator gives the FLOPs of any matrix product, the minimum memory traffic, and their ratio: arithmetic intensity. Compare that intensity against your GPU's FLOPS-to-bandwidth ratio (H100: ~295 FLOP/byte at BF16) and you know immediately whether the operation can saturate the tensor cores or will idle waiting on HBM. The M=1 case is the punchline — single-token LLM decode is a GEMV with intensity ~2, which is why batching, speculative decoding and weight quantization dominate inference engineering.

How to use Matrix Multiplication FLOPs Calculator

1Enter your values into Matrix Multiplication FLOPs Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Matrix Multiplication FLOPs Calculator?

✓Computes Matrix Multiplication FLOPs instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: FLOPs = 2.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why 2·M·K·N and not M·K·N?+

Each output element is a length-K dot product: K multiplies and K−1 adds ≈ 2K operations, times M·N outputs. Hardware specs and papers consistently use this 2× convention, while 'MACs' counts multiply-accumulates without it — know which one a number means.

What makes a matmul compute-bound?+

Arithmetic intensity above the hardware's FLOPS/bandwidth ratio. Square 4096³ matmuls at BF16 reach ~1365 FLOP/byte — comfortably compute-bound everywhere. Skinny matrices (small M or N) crater intensity, which is why kernel libraries fuse and batch them.

Why is LLM decoding bandwidth-bound?+

Generating one token multiplies a 1×H activation by every weight matrix: M=1 GEMVs with intensity ≈ 2 FLOP/byte at BF16. The GPU must stream all weights from HBM for 0.001% of its compute capability. Batching B requests raises M to B, recovering intensity linearly.

Does this cover attention and convolutions too?+

Yes — both lower to matmuls. Attention is two batched matmuls (see our attention FLOPs tool); a conv is an implicit GEMM of the im2col matrix. The same intensity analysis explains why depthwise convs and small attention heads underutilize GPUs.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live