Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

Although social media platforms are a prominent arena for users to engage ininterpersonal discussions and express opinions, the facade and anonymityoffered by social media may allow users to spew hate speech and offensivecontent. Given the massive scale of such platforms, there arises a need toautomatically identify and flag instances of hate speech. Although several hatespeech detection methods exist, most of these black-box methods are notinterpretable or explainable by design. To address the lack ofinterpretability, in this paper, we propose to use state-of-the-art LargeLanguage Models (LLMs) to extract features in the form of rationales from theinput text, to train a base hate speech classifier, thereby enabling faithfulinterpretability by design. Our framework effectively combines the textualunderstanding capabilities of LLMs and the discriminative power ofstate-of-the-art hate speech classifiers to make these classifiers faithfullyinterpretable. Our comprehensive evaluation on a variety of English languagesocial media hate speech datasets demonstrate: (1) the goodness of theLLM-extracted rationales, and (2) the surprising retention of detectorperformance even after training to ensure interpretability. All code and datawill be made available at https://github.com/AmritaBh/shield.

Further reading