MoE Active-Parameter Calculator
Total vs active parameters of a mixture-of-experts transformer — experts, top-k routing and shared blocks.
Defaults reconstruct Mixtral 8x7B: ~46.5B total / ~12.9B active. VRAM follows the total; latency and per-token compute follow the active count.
Formula
About MoE Active-Parameter Calculator
Mixture-of-experts models advertise two parameter counts — '8x7B', '671B total, 37B active' — and the relationship between them is exactly what this calculator exposes. Only the FFN is replicated into experts; attention, embeddings and norms stay shared. Routing picks the top-k experts per token, so per-token compute follows the active count while memory follows the total. The defaults rebuild Mixtral 8x7B's published figures; change expert count or top-k to design your own sparsity ratio and understand the serving consequences.
How to use MoE Active-Parameter Calculator
- 1Enter your values into MoE Active-Parameter Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use MoE Active-Parameter Calculator?
- ✓Computes MoE Active-Parameter instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: total = attn + L.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Why is 8x7B equal to 46.7B and not 56B?+
Because only the FFN experts multiply. The shared attention layers, embeddings and norms of the underlying 7B are counted once, not eight times: 8 expert FFNs + 1 shared backbone ≈ 46.7B. The '8x7B' name is branding, not arithmetic.
Does higher top-k make a MoE better?+
More active experts raise quality and per-token compute roughly linearly — top-2 became standard as the efficiency sweet spot. DeepSeek-V3 routes to 8 of 256 fine-grained experts plus shared ones, showing the design space is wider than Mixtral's coarse 2-of-8.
Can I serve a MoE on less VRAM by loading fewer experts?+
Not safely — the router may select any expert for any token, and expert-choice distributions are input-dependent. Offloading cold experts to CPU/NVMe works for latency-insensitive use, but VRAM sizing must assume all experts resident for production.
Is a 47B-total/13B-active MoE better than a dense 13B?+
Substantially, at equal per-token compute — that is MoE's whole bargain. It typically matches dense models 2–3× its active size (Mixtral ≈ dense 30B class) while costing 13B-class FLOPs, paying for it in memory footprint and serving complexity.
Related tools
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live