Interpretability & Explainability¶

Opening the black box — understanding why a model made a prediction, and what it has actually learned inside.

Interpretability & Explainability is one of the core areas in the AI University map of AI. Explore the diagram, then dive into each topic — every subtopic grows into its own deep-dive over time.

flowchart LR
  IN[/Input/] --> MODEL[[Neural network]] --> PRED[/Prediction/]
  MODEL -. attributions .-> WHY[Why this output?]
  MODEL -. circuits .-> WHAT[What did it learn?]

Key topics¶

Feature attribution

Which inputs mattered? SHAP, LIME, integrated gradients, and saliency maps.
Probing representations

Testing what information is encoded in a model's internal activations.
Mechanistic interpretability

Reverse-engineering circuits and features inside networks — induction heads, superposition, sparse autoencoders.
Concept-based explanations

Explaining models in terms of human-understandable concepts.
Global vs local

Explaining one prediction vs a model's overall behavior.
Faithfulness

The hard question of whether an explanation reflects the true reason for a decision.

Deep Learning · AI Safety, Alignment & Ethics · AI Ethics & Governance

Learn this properly

Want hands-on training in interpretability & explainability? Explore AI University courses and AI School camps for kids.

Interpretability & Explainability¶

Key topics¶

Related areas¶