Evaluation & Benchmarks¶
How we measure whether AI actually works — the science, and real difficulty, of knowing if a model is any good.
Evaluation & Benchmarks is one of the core areas in the AI University map of AI. Explore the diagram, then dive into each topic — every subtopic grows into its own deep-dive over time.
flowchart LR
M[Model] --> BENCH[Benchmarks] --> SCORE[Scores]
M --> HUMAN[Human ratings] --> SCORE
M --> JUDGE[LLM-as-judge] --> SCORE
SCORE --> DECIDE{Ship or iterate?}
Key topics¶
-
Metrics
Accuracy, precision/recall, F1, BLEU/ROUGE, perplexity — and when each misleads.
-
Benchmarks & leaderboards
MMLU, GSM8K, HumanEval, MMMU and friends — standardized tests, and how they get gamed.
-
LLM-as-judge
Using strong models to grade outputs, with their biases and calibration issues.
-
Human evaluation
Preference ratings, head-to-head arenas (Elo), and inter-annotator agreement.
-
Red-teaming & safety evals
Probing for harmful, jailbroken, or unsafe behavior before release.
-
Contamination & validity
Test-set leakage, overfitting to benchmarks, and building evals you can trust.
Related areas¶
NLP & Large Language Models · AI Safety, Alignment & Ethics · Building with AI
Learn this properly
Want hands-on training in evaluation & benchmarks? Explore AI University courses and AI School camps for kids.