How Inter-Annotator Agreement Drives Confidence in Federal AI

With IAA, Agencies Can Measure and Monitor AI Capabilities

When it comes to visual data, artificial intelligence is very good – but it isn’t perfect yet.

IEEE reports on research in which an algorithm was tricked into identifying a cat as an airplane. And beyond ‘adversarial images’ that were purposely manipulated to confuse computer systems, researchers are also cataloguing naturally occurring examples that can reduce computer vision accuracy by 90% or more.   

“We have systems that require millions of images, and people tend to believe that the AI will somehow just figure it out,” said Dr. Marc Bosch, a computer vision senior principal at Accenture Federal Services, where researchers are investigating the science of Inter-Annotator Agreement (IAA) to help to bring clarity to the situation. “IAA gives you a way to measure and monitor the process.”

Human judgements

Federal government is investing heavily in AI to speed routine processes and free up human labor for higher tasks. Agencies are looking to spend almost $1 billion on artificial intelligence R&D, according to a supplemental report to the president’s budget request.

Citizens expect accountability and transparency around AI uses, and they expect the outputs to be correct. “As the technology advances, we will need to develop rigorous scientific testing that ensures secure, trustworthy and safe AI,” according to the National Institutes of Standards and Technology.

Computers learn by ingesting images and data that have been labelled by humans. Problems may arise from the way these objects are classified, since humans are intrinsically biased -- and not necessarily in the pejorative sense of being prejudiced against individuals or groups. Rather, objective “truth” can simply be hard to define in certain situations.

One way to reduce bias and ensure higher accuracy is to have multiple annotators work on the same data set. Even then, different individuals may view the same object in different ways. IAA offers a quantifiable way to note those points of divergence, to guide that process, and to help improve the standards and criteria for annotation.

“One of the biggest limitations that we have on the accuracy of the system is the training data that goes into them,” said Dr. Ian McCulloh, chief data scientist at Accenture Federal Services. “Humans have gone through and made judgements: ‘Here is a broken femur.’ Sometimes two assessments don’t agree, and that affects the precision of the algorithm.”

Research published by Bosch and McCulloh (with Joseph Nassar and Viveca Pavon-Harr) found that not checking the annotation process can result in significant loss of algorithm precision.  And by highlighting the places where human annotators reached different conclusions in the training data, IAA makes it possible to fine-tune the system. Agencies may retrain or replace an annotator who often gets it wrong and can use IAA to measure improvements.

IAA helps strike a balance between precision (a truck is really a truck), and the volume necessary for recall (finding all the trucks in an image). Simply exclude an inconsistent annotator and precision improves, but recall is reduced. IAA balances the scales, since you will always accumulate more data. By leveraging IAA up front, you’ll add precision without building up the technical debt that limits long-term performance.

Measurability is key to the promise of IAA. “This gives you a metric to compare across the population of annotators, to identify which annotators are not agreeing with the majority,” Bosch said. “You need that metric in order to make corrections.”

A subset of artificial intelligence, computer vision (CV) leverages AI to interpret visual imagery. It’s most often thought of in the defense space – the ability for a satellite to instantly differentiate between a truck and a tank. But it has other government applications as well, from object recognition in logistics to reviewing CT scans in healthcare.

CV is typically highly labor intensive, with operators classifying tens of thousands of objects to build the algorithm, and it’s notoriously subjective as well. “These systems can fail in ways that seem difficult to understand and hard to predict – such as showing higher rates of error on the faces of people with darker skin,” Pew Research reports.

This is especially concerning when it’s the federal government putting these systems into play. “The use of these systems in areas such as health care, financial services and criminal justice has sparked fears that they may end up amplifying existing cultural and social biases under the guise of algorithmic neutrality,” according to Pew Research.

IAA offers a way to strengthen this potential weak link in the AI chain.

Real-world example

Dr. Katherine Schulz has been on the front lines of improving CV implementation in the federal space. As a senior analytics manager at Accenture, she is working with a large federal agency that uses image recognition to streamline a logistics operation moving millions of units around the globe every day.

“Right now, there is a lot of human labor involved. If something fails a scan, for example, then the human needs to be looking directly at that product to see where it needs to go,” she said. “To reduce that workload, we need better performance and better accuracy in the AI systems.”

By using IAA, it’s possible to hone in on the points of disconnect, to quantify the individual operators whose classification efforts are skewing the outcomes.

“You can identify where one annotator interpreted something differently than others,” Schulz said. “Then you can go back to those particular training sets and remove those images that were done by that particular person. You can have them re-annotate it, you can add them back into your set and you can see how that affects your model’s performance as well.”

By measuring and documenting divergence between annotators, it also becomes possible to highlight the gray areas, the situations where an image is legitimately ambiguous and where human intervention can best be applied.

“Those boundary cases are challenging,” Schulz said. “IAA allows us to have those important internal discussions, to say: What do we do with something that looks like this? How would you even annotate that? We can talk through it and determine how best to do it going forward.”

Greater transparency

This practical application of Accenture’s IAA research to computer vision has a couple of important implications. For this particular agency, it has helped to streamline operations, bringing greater accuracy to the system while reducing human interventions.  The goal is to reduce overall disagreement rates, which might typically average over 30% but can be reduced to less than 10% using IAA effectively.

For government, their research offers a method to check the quality of annotations during the labeling of images and to measure performance.  This can build trust in AI systems by attesting to the quality of the labeling process, training data and ground truth or reference data used by the algorithm.

“There is this idea that AI is a black box, it’s just the result alchemy or magic,” Bosch said. “There is a misconception that you just start putting data into this model and then the AI will do the rest.”

By putting the human back into the equation, in a measurable way, IAA reminds government technologists that agencies are accountable for the outcomes of these machine-driven processes.

“It's not the responsibility of somebody in Silicon Valley to come and tell you what's right and wrong,” McCulloh said. “There are more than 300,000 individual parts in an F-35 and many may look alike. Domain experts need to weigh-in directly to ensure real accuracy.  That’s the role that the government must play.”

By highlighting the places where human annotators differ, IAA can help government to better fulfill its responsibilities. Applied over time, IAA can drive effective training – both of human operators and AI algorithms – to enable agencies to deliver on citizen expectations around transparency and accountability.

IAA allows the subject matter experts and the technologists to work together to build and continuously improve a trustworthy AI solution. “By putting metrics to human disagreement, you improve the training data, which improves the way the AI performs,” McCulloh said.

This content is made possible by our sponsor, Accenture. The editorial staff of NextGov was not involved in its preparation.

NEXT STORY: Mission Analytics Just Got Smarter, Faster and Easier