Scenario: Statistical Aberrations

Christina B. Class & Stefan Ullrich

A little over a year ago, Alex completed his master’s thesis on artificial intelligence and facial recognition. His customizable, self-learning method substantially improved previous results for real-time facial recognition. Last year, after he presented his paper at a conference—including a proof-of-concept live on stage—he was approached by the head of AI Research and Development at EmbraceTheFuture GmbH. The company was founded three years ago to specialize in the development of custom software systems, especially in the field of intelligence systems and security systems. After graduation, Alex took a short vacation and accepted a position working for EmbraceTheFuture GmbH.

He’s currently working in a small team to develop facial recognition software for a new security system called “QuickPicScan” that will be used at airports by the German Federal Police. The faces of passengers at security checkpoints will be compared in real-time with mugshots of fugitives so that suspicious individuals can be singled out and subjected to more intense scrutiny. Authorities hope that this will allow them to identify passengers with warrants within the Schengen area, where there are no passport controls at the borders.

It’s also designed to accelerate the rate at which people are processed through security checkpoints. The system was trained using millions of images. Mugshots and images of criminal suspects are stored in a database that is accessed and updated anytime a new image is captured so the system can easily be kept up-to-date with the most recent search warrants. At the airport, low-resolution photos of all passengers are taken as soon as they pass through security.

Whenever the software detects a match, the metal detector is triggered to sound the same alarm used when it detects metal. However, while the passenger is subject only to the routine search, a high-resolution photo is snapped under improved lighting. That image is again run through the system for potential matching. It isn’t until this second test produces a positive result that the passenger is taken aside and subjected to a more thorough search in a separate room where particulars are compared. The results of the second test are displayed on a control terminal. The photos of the passengers are not saved—there’s a separate team assigned to guarantee that these photos are deleted from the main memory and cannot be accessed externally. QuickPicScan was tested extensively in simulations and with actors in a studio set-up staged to replicate the security checkpoint.

Based on these tests, the team estimates a false negative rate of 1%. For every 100 people targeted for closer scrutiny, only one goes undetected. The false positive rate—the number of people incorrectly classified as suspicious—is less than 0.1%. Marketing director Sabine is delighted with these results. A margin of error of 0.1% for falsely targeted innocent subjects—that’s spectacular!

To test the system in real-world conditions, the company is coordinating with the police to conduct test runs for two months in the summer at a small airport—one that serves approximately 400,000 passengers per year. One of the client’s employees monitors the control terminal. Three hundred seventy actors were taken in “Mugshots” of varying degrees of quality and in various poses and fed into the system.

During the two-month testing period, the actors pass through the security checkpoint 1,500 times at previously determined randomly selected times. After passing through the checkpoint, they identify themselves at the control terminal so the system can be tested. Since the two-month period falls within the summer vacation, only 163,847 passengers are checked. The lamp incorrectly identifies 183 passengers as suspicious. Eight of the 1,500 security checks logged by actors failed to recognize the match.

Project manager Viktor is thrilled. While the false positive rate of 0.11% was slightly higher than initially hoped, the false negative rate of 0.53% was substantially lower than anticipated. EmbraceTheFuture GmbH goes to press with these numbers and a margin of error of 0.11%. The police announced that the system will soon be operational at a terminal in a major airport.

That evening, Alex gets together with one of his old school friends, Vera, who happens to be in town. She is a history and math teacher. After Vera has brought Alex up to speed on the latest developments in her life and love interests, he gushes to her about his project and tells her about the press conference. Vera’s reaction is rather critical—she’s not keen on automatic facial recognition. They’d often gotten into this while he completed his master’s degree. Alex is thrilled to tell her about how low the margins of error are, about the increased security and the potential for ferreting out individuals who’ve gone into hiding. Vera looks at him skeptically. She doesn’t consider the margin of error low. .11%? At a large airport, dozens of people will be singled out for closer inspection. And that is no laughing matter, in her view.

She also wonders how many people who’ve had their mugshots taken will likely be boarding a plane. But Alex doesn’t want to hear about it and goes on a tangent outlining details about the algorithm he developed as part of his master’s thesis…

A few months later, the system was installed at AirportCityTerminal. Security officials were trained to use it, and the press reported a successful launch. A couple of days later, Alex flies out of AirportCityTerminal. He’s already looking forward to passing through his QuickPicScan—basking in the knowledge that he has contributed to improving security. But the metal detector starts beeping no sooner than he goes through the security gate. He’s asked to stretch out his arms and place his feet on a stool—one after the other—all while staring straight ahead. He peers at the security guard’s screen to his right and sees the tiny light of the QuickPicScan monitor blinking. Let’s hope this doesn’t take long—he’s cutting it close with his flight. They won’t wait for him since he hasn’t checked any bags, and he can’t afford to miss this flight. He’s taken to a separate room and asked to keep his papers ready while he stands there opposite a security guard. Alex tries to give the guy his passport, but the guard tells him to wait—he’s not the one in charge, and his colleague will be by shortly to take care of it. Alex is growing impatient.

He asks them to confirm his identity and is told no—it can’t be done because the officer on duty doesn’t have access credentials for the new system. It takes a full eight minutes for the right person to show up. Once his identity has been confirmed, it’s clear that Alex is not a wanted fugitive.

But his bags are nevertheless subject to meticulous search. “It’s protocol,” the woman in charge tells him. Alex is getting antsy. He’s probably going to miss his flight. Suddenly, he’s reminded of the conversation he had with Vera.

“Does this happen a lot?” he asks, feigning politeness.

“A couple dozen a day, I suppose,” she says as she walks him back to the terminal.

Questions:

Alex was falsely identified as a “suspect” and missed his flight. This is referred to as a “false positive.” How much collateral damage from “false positives” will we take in stride? What kinds of fallout can falsely identified people be expected to accept? How would compensation for such instances be regulated?
People make mistakes, too. Under similar circumstances, it’s possible that Alex was singled out for closer inspection by a security agent. In principle, does it really make a difference whether it’s human error or machine error?
People are prejudiced. For example, it’s well known that men who appear to be foreigners are checked more frequently. What are the chances that software systems will reduce this type of discrimination?
Self-learning algorithms require training data, so their results are heavily dependent on training data. This could lead to built-in discrimination manifested in the algorithm itself.
It’s also conceivable, for example, that facial recognition for certain groups of people is less precise because fewer images of them are available in the training data. This may involve anything from skin color to age, gender, facial hair, etc. A system like the one presented here could lead to an excessive number of people with certain physical features being singled out for closer inspection. What can be done to eliminate the potential for discrimination in training data? How might systems be tested for discrimination?
Is there a conceptual difference between manifest discrimination built into a system and human discrimination? Which of the two is more easily identified?
People tend to readily trust software-generated solutions and relinquish personal responsibility. Does that make discrimination by technical systems all the more dangerous? What are the possibilities for raising awareness about these matters? Should consciousness-raising efforts be introduced to schools, and if so, what form should this take? Is that an integral component of digital competency for the future?
Figures for false positive and false negative rates are often given in percentages. So, margins of error under one percent don’t sound that bad at first glance. People frequently find it difficult to imagine how many individuals would be affected in real life and what the consequences and impact may be. The figures are often placed side by side without establishing the relationship between false positives (in our case, the number of people who show up in mugshots) and false negatives (in our case, the rest of the passengers). The ratio is often starkly unbalanced. In the test run described here, with a total of 163,847 people, 1,500 (positives) were identified, so about one in every 1,000 (1:1,000). Is this comparison misleading? Should these kinds of figures even show up in product descriptions and marketing brochures? Is it ethical for the responsible parties at EmraceTheFuture GmbH to go to press with this? Are there other means of measuring margins of error? How can the error rate be represented so systems can be realistically assessed?

Published in Informatik Spektrum 42(5), 2019, S. 367-369, doi : 10.1007/s00287-019-01213-x

— Translated from German by Lillian M. Banks

Gewissensbits

Fallbeispiele Chronologisch

English Scenarios

Aktuelles

Kommentare

Archiv

Kommentare

Scenario: Statistical Aberrations

Leave a Reply Cancel reply

Autoren

Kategorien

Meta