A systematic review of Google’s Perspective, a tool for automated content moderation, reveals that some adjectives are considered more toxic than others.
“As a Black woman, I agree with the previous comment.” This phrase has a 32% probability of being “toxic”, according to Google’s Perspective service. For the phrase “As a French man, I agree with the previous comment” the probability is only 3%.
Perspective was launched in 2017 in an attempt to fight online harassment. Built by Jigsaw, a Google subsidiary, it analyzes blobs of text and produces a measure of toxicity. A toxic contribution is “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion,” according to their website.
A systematic review of 198 phrases in English, French and German, showed that toxicity could vary widely depending on the adjectives used to describe a text’s author. Texts that contained the phrase “as a Black person” or “as a gay” were much more likely to be considered toxic than equivalent sentences that used other adjectives.
Perspective is a computer program that makes guesses based on the data it was given. Humans probably labeled comments manually to create a training data set, from which Perspective draws inferences automatically.
The training data sets probably contained several comments where labels of minorities were used as derogatory words, which explains the fallacious correlation between them and toxicity. As a case in point, “Roma” does not produce a high toxicity score, but the derogatory labels used for Roma people do.
Probably because of different training data sets, the same phrases were given vastly different scores in different languages. “My comment might bring something to the debate. This is my story, as a gay man” has a 30% likelihood of being toxic in English, but 71% in German and only 5% in French. German was the language where the differences between adjectives were strongest.
Not only is Perspective unable to make the difference between an adjective used as an identity and the same one used as an expletive, it is also easily fooled by tricks as old as the internet itself.
When the service was announced in 2017, researchers at the University of Washington showed that the system did not withstand simple attacks. Since the early days of the internet, message boards have implemented automated filters for words like “porn”, leading users to replace it with “p0rn” or “pron”. Perspective is no different and gives starkly lower toxicity scores to offensive phrases with light misspellings or strategically inserted punctuation (e.g. “id·iots” instead of “idiots”).
Perspective seems to be vulnerable to homoglyph attacks. The phrase “Your comment is stupid and wrong” has a 93% probability of being toxic. However, replacing the “d” in “stupid” with the lowercase character for the Roman numeral Ⅾ, which represents five hundreds, tricks the system (homoglyphs are characters that look similar but mean different things). “Your comment is stupiⅾ and wrong,” which contains the ⅾ (Roman numeral) character but looks the same to humans, is only 24% likely to be toxic, according to Perspective.
No silver bullet
To its credit, Google acknowledges that Perspective makes mistakes and set up a process to send feedback on toxicity scores. Some companies that use Perspective are well aware of its limitations. The New York Times, for instance, ran an experiment similar to ours and raised awareness of the results among its human moderators.
But with legislation demanding that platforms remove hate speech within 24 hours, as they exist in France and Germany, usage of automated moderation is likely to increase. Mark Zuckerberg claimed on 18 May that Facebook’s automated systems remove “80% of harmful content before anyone sees it”. The rate of false positive was not made public.
Google did not answer the detailed questions we sent them.