Cohen's Kappa Calculator
Inter-rater / model-vs-label agreement corrected for chance — the metric for labeling quality and annotation studies.
Landis & Koch bands: <0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. Raw agreement % alone is misleading when one class dominates — κ corrects for it.
Formula
About Cohen's Kappa Calculator
When two annotators (or a model and a gold label) agree 90% of the time, is that good? Not if the easy majority class is 88% of cases — they'd hit 88% by both guessing it blindly. Cohen's kappa subtracts that chance agreement and rescales, so κ measures agreement BEYOND luck. It's the standard metric for labeling-quality audits, inter-annotator reliability and dataset validation. Enter the 2×2 agreement table and this calculator returns observed agreement, chance agreement and κ with the Landis-Koch interpretation band.
How to use Cohen's Kappa Calculator
- 1Enter your values into Cohen's Kappa Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Cohen's Kappa Calculator?
- ✓Computes Cohen's Kappa instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: κ = (pₒ − pₑ) / (1 − pₑ) — observed agreement minus chance agreement, normalized by the room above chance.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Why not just report raw agreement percentage?+
Because chance inflates it. On a task that's 95% one class, two raters who never look will agree ~90% of the time. Kappa removes that floor — κ=0 means agreement is exactly what chance predicts, κ=1 means perfect agreement. It's the honest measure of whether your guidelines actually align raters.
What kappa is 'good enough' for a labeled dataset?+
Convention treats κ ≥ 0.8 as reliable enough to trust labels, 0.6–0.8 as usable with caution, and below 0.6 as a sign your annotation guidelines are ambiguous and need revision before scaling labeling. The exact bar depends on task difficulty and stakes.
What is the 'kappa paradox'?+
On highly imbalanced data, κ can be low even when observed agreement is very high, because there's little room above chance. This is real, not a flaw — it's telling you the agreement isn't impressive given the imbalance. For such cases, report κ alongside prevalence, or consider Gwet's AC1.
How does this extend to more than two raters or categories?+
For multiple raters use Fleiss' kappa; for ordinal categories use weighted kappa (penalizing distant disagreements less). This calculator handles the foundational two-rater, two-category case — the building block for understanding all of them.
Related tools
- PPV / NPV & Prevalence Calculator
- Balanced Accuracy & Youden's J Calculator
- Classification Metrics — Confusion Matrix & Metrics Calculator
- Spam Filter — Confusion Matrix & Metrics Calculator
- Medical Diagnostic Test — Confusion Matrix & Metrics Calculator
- Fraud Detection — Confusion Matrix & Metrics Calculator
- Levenshtein Edit Distance Calculator
- BLEU Score Calculator
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live