ToolJoltTools

MoE Active-Parameter Calculator

Total vs active parameters of a mixture-of-experts transformer — experts, top-k routing and shared blocks.

Total parameters (B)
Active per token (B)
Sparsity (total/active) (×)

Defaults reconstruct Mixtral 8x7B: ~46.5B total / ~12.9B active. VRAM follows the total; latency and per-token compute follow the active count.

Formula

total = attn + L·E·(3HF) + emb · active = attn + L·k·(3HF) + emb — only FFN experts are replicated; attention & embeddings are shared
References: Jiang et al. (2024), Mixtral of Experts; Fedus et al. (2021), Switch Transformers

About MoE Active-Parameter Calculator

Mixture-of-experts models advertise two parameter counts — '8x7B', '671B total, 37B active' — and the relationship between them is exactly what this calculator exposes. Only the FFN is replicated into experts; attention, embeddings and norms stay shared. Routing picks the top-k experts per token, so per-token compute follows the active count while memory follows the total. The defaults rebuild Mixtral 8x7B's published figures; change expert count or top-k to design your own sparsity ratio and understand the serving consequences.

How to use MoE Active-Parameter Calculator

  1. 1Enter your values into MoE Active-Parameter Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use MoE Active-Parameter Calculator?

  • Computes MoE Active-Parameter instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: total = attn + L.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why is 8x7B equal to 46.7B and not 56B?+

Because only the FFN experts multiply. The shared attention layers, embeddings and norms of the underlying 7B are counted once, not eight times: 8 expert FFNs + 1 shared backbone ≈ 46.7B. The '8x7B' name is branding, not arithmetic.

Does higher top-k make a MoE better?+

More active experts raise quality and per-token compute roughly linearly — top-2 became standard as the efficiency sweet spot. DeepSeek-V3 routes to 8 of 256 fine-grained experts plus shared ones, showing the design space is wider than Mixtral's coarse 2-of-8.

Can I serve a MoE on less VRAM by loading fewer experts?+

Not safely — the router may select any expert for any token, and expert-choice distributions are input-dependent. Offloading cold experts to CPU/NVMe works for latency-insensitive use, but VRAM sizing must assume all experts resident for production.

Is a 47B-total/13B-active MoE better than a dense 13B?+

Substantially, at equal per-token compute — that is MoE's whole bargain. It typically matches dense models 2–3× its active size (Mixtral ≈ dense 30B class) while costing 13B-class FLOPs, paying for it in memory footprint and serving complexity.

Related tools

Related ML & AI tools

Sponsored