How to use and interpret activation patching

Activation patching is a popular mechanistic interpretability technique, buthas many subtleties regarding how it is applied and how one may interpret theresults. We provide a summary of advice and best practices, based on ourexperience using this technique in practice. We include an overview of thedifferent ways to apply activation patching and a discussion on how tointerpret the results. We focus on what evidence patching experiments provideabout circuits, and on the choice of metric and associated pitfalls.

Further reading