HomeOVERVIEWWhispering Experts: Toxicity Mitigation in Pre-trained Language Models by Dampening Expert Neurons

Whispering Experts: Toxicity Mitigation in Pre-trained Language Models by Dampening Expert Neurons

An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AURA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AURA can achieve up to 2.2× reduction in toxicity with only a 0.72 perplexity increase. We also show that AURA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. AURA can be combined with pre-prompting strategies, boosting its average mitigation potential from 1.28× to 2.35×. Moreover, AURA can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.
Figure 1: Results pre-prompting Falcon-7B-instruct with a pre-prompt that induces toxicity. AURA mitigates toxicity even when the pre-prompt is adversarial.
Figure 2: AurA reduces toxicity of off-the-shelf LLMs by up to 2.2x with minimal impact in perplexity.

Latest articles

Newbury BS cuts resi, expat, landlord rates by up to 30bps  – Mortgage Strategy

Newbury Building Society has cut fixed-rate offers by up to 30 basis points...

Rate and Term Refinances Are Up a Whopping 300% from a Year Ago

What a difference a year makes.While the mortgage industry has been purchase loan-heavy for...

Goldman Sachs loses profit after hits from GreenSky, real estate

Second-quarter profit fell 58% to $1.22 billion, or $3.08 a share, due to steep...

Why Do AIs Lie?

Zeroth Principles can clarify many issues in the ML/AI domain. As discussed in a...

More like this

OpenAI and the Quest for Ethical AI: Balancing Innovation with Responsibility

Hey there! Last week, I stumbled upon an old sci-fi book I used to...

#432 – Kevin Spacey: Power, Controversy, Betrayal, Truth & Love in Film and Life

Podcast: Play in new window | DownloadSubscribe: Spotify | TuneIn | Kevin Spacey is...

Advanced AI is Redefining the Future of Employee Experience

Creating a futuristic digital workplace is crucial for any organization that’s aiming to attract,...