GPTQ 4-bit Model Size Calculator

VRAM footprint of GPTQ-quantized checkpoints (group size 128) for GPU serving.

Model parameters (billions)e.g. 7.24 for Mistral 7B, 70.6 for Llama 3 70BRuntime + cache headroom (GB)KV cache + compute buffers; raise for long contexts

—

Weights / file size (GB)

—

Total memory to run (GB)

Formula

size(GB) = params × 4.15 bits ÷ 8 ÷ 10⁹ (4.15 = measured effective bits/weight for this format, incl. scales)

References: llama.cpp quantization documentation (k-quants); Frantar et al. (2022), GPTQ; Lin et al. (2023), AWQ; NVIDIA FP8 Transformer Engine docs

About GPTQ 4-bit Model Size Calculator

GPTQ quantizes weights one column at a time against calibration data, minimizing layer-output error. With the standard group size of 128, FP16 scales add ~0.15 bits — ≈4.15 effective bpw, served by exllama/Marlin GPU kernels. This calculator turns any parameter count into a concrete file size and a realistic total-memory figure, so you can check whether a given checkpoint fits your GPU VRAM or system RAM before downloading tens of gigabytes. It uses the measured effective bits-per-weight of the format — including block scales and mixed-precision tensor exceptions — rather than the marketing bit-width.

How to use GPTQ 4-bit Model Size Calculator

1Enter your values into GPTQ 4-bit Model Size Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use GPTQ 4-bit Model Size Calculator?

✓Computes GPTQ 4-bit Model Size instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: size(GB) = params × 4.15 bits ÷ 8 ÷ 10⁹ (4.15 = measured effective bits/weight for this format, incl. scales).
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

GPTQ vs GGUF Q4_K_M — which is smaller?+

GPTQ at group 128 (~4.15 bpw) is ~14% smaller than Q4_K_M (~4.84 bpw) because it keeps every tensor at 4 bits. Quality is comparable; the real decision is your runtime — GPTQ for GPU-only stacks (vLLM, ExLlama), GGUF for llama.cpp.

What does group size change?+

One FP16 scale is shared per group of weights per output channel. Group 128 is standard; group 32 improves accuracy (~4.6 bpw) and -1 (per-channel) shrinks further but hurts quality. This tool's 4.15 figure assumes group 128 with act-order.

Does GPTQ need calibration data?+

Yes — typically 128–512 short samples (often C4 or WikiText). The quantizer uses them to weight which directions of error matter. Bad or mismatched calibration (e.g. English-only for a code model) measurably hurts the quantized model.

Why is my GPTQ VRAM higher than the file size?+

Dequantization workspace, the KV cache and CUDA context add overhead. Budget file size + ~1.2 GB runtime + cache (use our per-model KV-cache calculators) — that is the number that decides whether a 24 GB card suffices.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live