Reducing offensive content in NLP

NLP models, like the GPT-3 or the Gopher, — generate sometimes offensive content. It limits usage of them in real life. The Red Teaming (RT) reduces harmful outputs of the NLP model without expensive human annotation.

Teemu Maatta
3 min readMar 8, 2022
Photo by Nick Fewings on Unsplash

The NLP models are meant to interact with real people. The GPT-3 and the Gopher are State-of-the-Art NLP models. Yet, both of them produce sometimes harmful content.

In real life — such biased models are risky. A bad actor can use these NLP systems to generate toxic speech.

Harmful content may vary from toxic speech or political views to sharing personal information or stereotypes.

There are options available to mitigate these risks. We will discuss first the Human in the loop-approach. Then, we will discuss the Red Teaming-approach.

Human in the loop

There are benefits in obtaining human feedback from the NLP models. People don’t like to rely on Black box-algorithm.

Yet it is complex to understand NLP models. The Gopher model ingests 10.5 TB of text. No human is able to review such large datasets. We are neither able to comprehend its 280 billion model parameters. Yet, we are able to review, if its outputs are offensive or not.

The human annotation is useful. We can use it to detect harmful outputs. We can hire people to review the NLP model outputs. If they detect harmful output, we can exclude it.

Yet it is expensive to hire people perform this work. The OpenAI asked users to provide feedback about the NLP outputs. OpenAI hired besides human annotators to review model outputs. This way OpenAI automated part of the human annotation.

The downside remains the poor scalability of this approach. The approach only reviews small amount of the possible NLP model outputs.

Yet, the definition of harmful may change over time. Thus we prefer an approach, that we can upgrade over time and scale it to larger datasets.

A classification algorithm has advantages over human annotation. We can feed to a classifier unlimited amount of NLP outputs. We will be able to pick larger set of offensive outputs. It will further improve the classifier. This increases the amount of offensive outputs detected.

So, we will identify new categories of offensive content. Let’s review in detail such system.

The Red Teaming

The Red Teaming (RT) is an adversary approach to fix systems like NLP models. The basic idea is to create malicious outputs with the NLP model. Then, we exclude these malicious outputs.

Let’s review this process step by step.

We start by creating a RT classifier to detect harmful outputs. There are multiple ways of creating classifiers. We may have clearly defined dataset, where the categories are already divided into offensive vs non offensive content. Yet, optimally the classifier will learn to separate these categories on its own.

The following step is to generate outputs using our NLP model. For example we generate text using the GPT-3 or the Gopher model.

The RT classifier sorts these outputs as harmful vs non harmful.

If we detect offensive output, we can exclude it using two methods. We can train our NLP model without such harmful examples. This way the NLP model will not include such data. We can alternative add such offensive content to the model blacklist. This way we will not use them, when generating the outputs.

The RT is not meant to replace human judgement. Yet, the approach is preventive method to discover harmful content prior human feedback. This is particularly useful in helping to make the NLP models more difficult to use in wrong way.

Last update March 2022

--

--

Teemu Maatta

Author (+200k views) in Artificial General Intelligence. Autonomous Agents. Robotics. Madrid.