Interpretability & Explainability¶
Opening the black box — understanding why a model made a prediction, and what it has actually learned inside.
Interpretability & Explainability is one of the core areas in the AI University map of AI. Explore the diagram, then dive into each topic — every subtopic grows into its own deep-dive over time.
flowchart LR
IN[/Input/] --> MODEL[[Neural network]] --> PRED[/Prediction/]
MODEL -. attributions .-> WHY[Why this output?]
MODEL -. circuits .-> WHAT[What did it learn?]
Key topics¶
-
Feature attribution
Which inputs mattered? SHAP, LIME, integrated gradients, and saliency maps.
-
Probing representations
Testing what information is encoded in a model's internal activations.
-
Mechanistic interpretability
Reverse-engineering circuits and features inside networks — induction heads, superposition, sparse autoencoders.
-
Concept-based explanations
Explaining models in terms of human-understandable concepts.
-
Global vs local
Explaining one prediction vs a model's overall behavior.
-
Faithfulness
The hard question of whether an explanation reflects the true reason for a decision.
Related areas¶
Deep Learning · AI Safety, Alignment & Ethics · AI Ethics & Governance
Learn this properly
Want hands-on training in interpretability & explainability? Explore AI University courses and AI School camps for kids.