Linear/Dense Layer Calculator

Parameters, FLOPs and activation memory of fully-connected layers — the MLP-head and projection workhorse.

Input featuresOutput featuresBatch size (tokens/samples)Bias

—

Parameters (M)

—

FLOPs per batch (M)

—

Output activation (FP32) (MB)

Defaults are a ResNet-50 classifier head (2048→1000): 2.05M parameters. Before global average pooling existed, VGG's flatten+FC head alone was 102M parameters — the design mistake that GAP fixed.

Formula

params = Fin·Fout + Fout(bias) · FLOPs = 2·B·Fin·Fout · activations = B·Fout·sizeof(dtype)

References: Lin et al. (2013), Network in Network (introduced GAP heads)

About Linear/Dense Layer Calculator

The fully-connected layer is deep learning's atom: every attention projection, every MLP block, every classifier head reduces to Fin×Fout weights doing 2·B·Fin·Fout FLOPs. This calculator prices a single one — parameters, compute and the often-forgotten output-activation memory — so you can audit a head design in seconds. The historical footnote in the defaults is instructive: ResNet's 2M-parameter GAP head replaced VGG's 102M-parameter flatten monstrosity, one of the clearest architecture lessons ever taught by a single number.

How to use Linear/Dense Layer Calculator

1Enter your values into Linear/Dense Layer Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Linear/Dense Layer Calculator?

✓Computes Linear/Dense Layer instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: params = Fin.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why was VGG's FC head 102M parameters?+

Flattening a 7×7×512 feature map gives 25,088 inputs; the first FC maps them to 4096 → 25088×4096 ≈ 103M weights in ONE layer, more than half of VGG-16's total. Global average pooling collapses 7×7 to 1×1 first, shrinking the head ~50×, with no accuracy loss.

How do linear layers relate to 1×1 convolutions?+

A 1×1 conv IS a linear layer applied at every spatial position (shared weights). Same parameter formula (Cin·Cout), FLOPs multiplied by H×W positions. Frameworks even implement one with the other; the distinction is purely which axes the weights sweep.

When should I drop the bias?+

When normalization directly follows (LayerNorm/BatchNorm have their own shift — the bias is redundant and slows fused kernels), and in most modern LLM blocks. Keep it on the final logits head and on small MLPs without normalization, where it genuinely shifts decision boundaries.

Is the activation memory ever the bottleneck for linear layers?+

In training with long sequences, yes: a B×S×F output at F=14336 (Llama FFN), S=8K, B=8 is 14.7 GB at BF16 — stored for backward. This is precisely what activation checkpointing trades compute to avoid; see our activation-memory calculator.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live