Evaluation & Benchmarks¶

How we measure whether AI actually works — the science, and real difficulty, of knowing if a model is any good.

Evaluation & Benchmarks is one of the core areas in the AI University map of AI. Explore the diagram, then dive into each topic — every subtopic grows into its own deep-dive over time.

flowchart LR
  M[Model] --> BENCH[Benchmarks] --> SCORE[Scores]
  M --> HUMAN[Human ratings] --> SCORE
  M --> JUDGE[LLM-as-judge] --> SCORE
  SCORE --> DECIDE{Ship or iterate?}

Key topics¶

Metrics

Accuracy, precision/recall, F1, BLEU/ROUGE, perplexity — and when each misleads.
Benchmarks & leaderboards

MMLU, GSM8K, HumanEval, MMMU and friends — standardized tests, and how they get gamed.
LLM-as-judge

Using strong models to grade outputs, with their biases and calibration issues.
Human evaluation

Preference ratings, head-to-head arenas (Elo), and inter-annotator agreement.
Red-teaming & safety evals

Probing for harmful, jailbroken, or unsafe behavior before release.
Contamination & validity

Test-set leakage, overfitting to benchmarks, and building evals you can trust.

NLP & Large Language Models · AI Safety, Alignment & Ethics · Building with AI

Learn this properly

Want hands-on training in evaluation & benchmarks? Explore AI University courses and AI School camps for kids.

Evaluation & Benchmarks¶

Key topics¶

Related areas¶