By Nicolas Kayser-Bril • firstname.lastname@example.org • GPG Key
A Google service that automatically labels images produced starkly different results depending on skin tone on a given image. The company fixed the issue, but the problem is likely much broader.
In the fight against the novel coronavirus, many countries ordered that citizens have their temperature checked at train stations or airports. The device needed in such situations, a hand-held thermometer, has risen from a specialist item to a common sight.
A branch of Artificial Intelligence known as “computer vision” focuses on automated image labeling. Most computer vision systems were trained on data sets that contained very few images of hand-held thermometers. As a result, they cannot label the device correctly.
In an experiment that became viral on Twitter, AlgorithmWatch showed that Google Vision Cloud, a computer vision service, labeled an image of a dark-skinned individual holding a thermometer “gun” while a similar image with a light-skinned individual was labeled “electronic device”. A subsequent experiment showed that the image of a dark-skinned hand holding a thermometer was labelled “gun” and that the same image with a salmon-colored overlay on the hand was enough for the computer to label it “monocular”.
Google since updated its algorithm. As of 6 April, it does not return a “gun” label anymore.
In a statement to AlgorithmWatch, Tracy Frey, director of Product Strategy and Operations at Google, wrote that “this result [was] unacceptable. The connection with this outcome and racism is important to recognize, and we are deeply sorry for any harm this may have caused.”
“Our investigation found some objects were mis-labeled as firearms and these results existed across a range of skin tones. We have adjusted the confidence scores to more accurately return labels when a firearm is in a photograph.” Ms Frey added that Google had found “no evidence of systemic bias related to skin tone.”
Agathe Balayn, a PhD candidate at the Delft University of Technology on the topic of bias in automated systems, concurs. She tested several images in Google’s service and came to the conclusion that the example might be “a case of inaccuracy without a statistical bias.” In the absence of more rigorous testing, it is impossible to say that the system is biased, she wrote.
It is easy to understand why computer vision produces different outcomes based on skin complexion. Such systems processed millions of pictures that were painstakingly labeled by humans (the work you do when you click on the squares containing cars or bridges to prove that you are not a robot, for instance) and draw automated inferences from them.
Computer vision does not recognize any object in the human sense. It relies on patterns that were relevant in the training data. Research has shown that computer vision labeled dogs as wolves as soon as they were photographed against a snowy background, and that cows were labeled dogs when they stood on beaches.
Because dark-skinned individuals probably featured much more often in scenes depicting violence in the training data set, a computer making automated inferences on an image of a dark-skinned hand is much more likely to label it with a term from the lexical field of violence.
Other computer vision systems show similar biases. In December, Facebook refused to let an Instagram user from Brazil advertise a picture, arguing that it contained weapons. In fact, it was a drawing of a boy and Formula One driver Lewis Hamilton. Both characters had dark skins.
Labeling errors could have consequences in the physical world. Deborah Raji, a tech fellow at New York University’s AI Now Institute and a specialist in computer vision, wrote in an email that, in the United States, weapon recognition tools are used in schools, concerts halls, apartment complexes and supermarkets. In Europe, automated surveillance deployed by some police forces probably use it as well. Because most of these systems are similar to Google Vision Cloud, “they could easily have the same biases”, Ms Raji wrote. As a result, dark-skinned individuals are more likely to be flagged as dangerous even if they hold an object as harmless as a hand-held thermometer.
Nakeema Stefflbauer, founder and CEO of FrauenLoop, a community of technologists with a focus on inclusivity, wrote in an email that bias in computer vision software would “definitely” impact the lives of dark-skinned individuals. Because the rate of mis-identification is consistently higher for women and dark-skinned people, the spread of computer vision for surveillance would disproportionately affect them, she added.
Referring to the examples of Ousmane Bah, a teenager who was wrongly accused of theft at an Apple Store because of faulty face recognition, and of Amara K. Majeed, who was wrongly accused of taking part in the 2019 Sri Lanka bombings after her face was misidentified, Ms Stefflbauer foresees that, absent effective regulation, whole groups could end up avoiding certain buildings or neighborhoods. Individuals could face de facto restrictions in their movements, were biased computer vision to be more widely deployed, she added.
In her statement, Ms Frey, the Google director, wrote that fairness was one of Google’s “core AI principles” and that they were “committed to making progress in developing machine learning with fairness as a critical measure of successful machine learning.”
But Google’s image recognition tools have returned racially biased results before. In 2015, Google Photos labelled two dark-skins individuals “gorillas”. The company apologized but, according to a report by Wired, did not fix the issue. Instead, it simply stopped returning the “gorilla” label, even for pictures of that specific mammal.
That technology companies still produce racially biased products can be explained by at least two reasons, according to AI Now’s Deborah Raji. Firstly, their teams are overwhelmingly white and male, making it unlikely that results that discriminate against other groups will be found and addressed at the development stage. Secondly, “companies are now just beginning to establish formal processes to test for and report these kinds of failures in the engineering of these systems,” she wrote. “External accountability is currently the main method of alerting these engineering teams,” she added.
“Unfortunately, by the time someone complains, many have already been disproportionately impacted by the model’s biased performance.”
8 April: Edited to add Ms Balayn’s statement.