Attention FLOPs Calculator
Quadratic cost of self-attention: score and value FLOPs for a full sequence, vs the linear projection cost.
The crossover s = 2H (8192 tokens at H=4096) is where the quadratic term overtakes the projections — exactly why long-context models invest in attention-efficiency tricks.
Formula
Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.
About Attention FLOPs Calculator
Self-attention's quadratic cost is the most quoted scaling fact in deep learning, but the actual numbers surprise people: at short contexts attention is nearly free compared to the projection matmuls around it, and the famous s² term only dominates past s = 2H. This calculator computes both parts for a full sequence (prefill) — the QKᵀ score matrix and the attention-weighted value sum — with the causal-mask halving that real LLMs enjoy. Use it to see precisely when your context length makes attention the budget item, and what FlashAttention does and does not save.
How to use Attention FLOPs Calculator
- 1Enter your values into Attention FLOPs Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Attention FLOPs Calculator?
- ✓Computes Attention FLOPs instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: score+value FLOPs = 4.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Does FlashAttention reduce these FLOPs?+
No — FlashAttention performs the same 4Ls²H FLOPs. Its win is memory: it never materializes the s×s score matrix in HBM, tiling it through on-chip SRAM instead. That removes the O(s²) memory wall and speeds things up by cutting memory traffic, not arithmetic.
Why does causal masking halve the work?+
Each token attends only to itself and earlier positions, so the score matrix is lower-triangular — half the s² entries. Kernels that skip the masked upper triangle (as FlashAttention does block-wise) realize the ~2× saving; naive implementations compute then mask, wasting it.
At what context does attention dominate total model FLOPs?+
Attention FLOPs equal projection FLOPs at s = 2H. Including the FFN (~2× projection cost), attention becomes the single largest component around s ≈ 6H — roughly 25K tokens for H=4096. Past 100K, prefill is overwhelmingly attention.
How do GQA and sliding windows change this?+
GQA shrinks the KV cache, not these FLOPs — scores still involve all query heads. Sliding-window attention genuinely cuts FLOPs to 4·L·s·w·H (window w), turning quadratic into linear; that is Mistral's trick for cheap 32K contexts.
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live