MoE Active-Parameter Calculator

Total vs active parameters of a mixture-of-experts transformer — experts, top-k routing and shared blocks.

LayersHidden sizeExpert FFN sizeExperts per layerExperts active per token (top-k)Vocab size

—

Total parameters (B)

—

Active per token (B)

—

Sparsity (total/active) (×)

Defaults reconstruct Mixtral 8x7B: ~46.5B total / ~12.9B active. VRAM follows the total; latency and per-token compute follow the active count.

Formula

total = attn + L·E·(3HF) + emb · active = attn + L·k·(3HF) + emb — only FFN experts are replicated; attention & embeddings are shared

References: Jiang et al. (2024), Mixtral of Experts; Fedus et al. (2021), Switch Transformers

About MoE Active-Parameter Calculator

Mixture-of-experts models advertise two parameter counts — '8x7B', '671B total, 37B active' — and the relationship between them is exactly what this calculator exposes. Only the FFN is replicated into experts; attention, embeddings and norms stay shared. Routing picks the top-k experts per token, so per-token compute follows the active count while memory follows the total. The defaults rebuild Mixtral 8x7B's published figures; change expert count or top-k to design your own sparsity ratio and understand the serving consequences.

How to use MoE Active-Parameter Calculator

1Enter your values into MoE Active-Parameter Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use MoE Active-Parameter Calculator?

✓Computes MoE Active-Parameter instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: total = attn + L.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why is 8x7B equal to 46.7B and not 56B?+

Because only the FFN experts multiply. The shared attention layers, embeddings and norms of the underlying 7B are counted once, not eight times: 8 expert FFNs + 1 shared backbone ≈ 46.7B. The '8x7B' name is branding, not arithmetic.

Does higher top-k make a MoE better?+

More active experts raise quality and per-token compute roughly linearly — top-2 became standard as the efficiency sweet spot. DeepSeek-V3 routes to 8 of 256 fine-grained experts plus shared ones, showing the design space is wider than Mixtral's coarse 2-of-8.

Can I serve a MoE on less VRAM by loading fewer experts?+

Not safely — the router may select any expert for any token, and expert-choice distributions are input-dependent. Offloading cold experts to CPU/NVMe works for latency-insensitive use, but VRAM sizing must assume all experts resident for production.

Is a 47B-total/13B-active MoE better than a dense 13B?+

Substantially, at equal per-token compute — that is MoE's whole bargain. It typically matches dense models 2–3× its active size (Mixtral ≈ dense 30B class) while costing 13B-class FLOPs, paying for it in memory footprint and serving complexity.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live