How can feds evaluate the effectiveness of different AIs for various government tasks?

Andriy Onufriyenko/Getty Images

If you work with them enough, AI models almost start to seem like people, with each one having a specific set of strengths, weaknesses and quirks.

Lots of people reacted to my previous column where former Department of Homeland Security official Joel Meyer responded to a report stating that with just a little bit more time and power, artificial intelligences could begin to lay the groundwork for eliminating all human life. Meyer debunked much of the popular report, but did explain that AI technology does represent a host of potential dangers that the government is already starting to guard against. And some of those AI defenses apparently involve AIs taking a role by helping to counter other intelligences acting with malicious intent.

Probably because of that last part, and the image of AI versus AI battles, many of the questions I received involved how to tell which AI was the most useful or powerful, not just for fighting or defending against other AIs, but also for mundane tasks too. Don’t forget that while the idea of an AI going rogue and attacking humans is a concern that gets a lot of the spotlight, there are also thousands of AI applications on the job right now. AIs can do everything from sorting mail to tirelessly acting as quality control officers on assembly lines. And yes, they are also serving in the military, but only in limited roles for now. And these days, feds can even create their own mini-AIs to perform various tasks without too much effort.

But while ChatGPT gets the most attention, there are many other generative AIs available right now. Most of them are designed to be able to do almost anything, with very few created with a specialized task in mind. So, how does someone know which AI is best suited to take on a specific role at an agency?

As a technology reviewer, one of the most interesting tools that I use is the Large Model Systems Organization — or LMSYS Org — Chatbot Arena. LMSYS Org is a research organization founded by the students and faculty from UC Berkeley in conjunction with a few other universities. It’s totally free to use, and the arena is a powerful way to put generative AIs head to head to check for their efficiency with specific tasks.

Now, the purpose of the arena is to collect votes from users who evaluate the performance of generative AIs when the models being tested are unknown. That eliminates any potential bias from the results, but unfortunately also means that you can’t load up the arena with two AI models of your choosing. If that were possible, then it might make for a more powerful tool for reviewing specific models. But that is not its purpose. Even so, most users are likely going to be surprised to see just how different any two generative AIs will sometimes perform. In total, there are 38 generative AIs ready to compete in the arena.

To initiate your own test, first head over to the arena. The website will set up two anonymous AI models for you, designated at first only as Model A and Model B. They will be placed side by side on your screen, with a single command line to provide your input. You are then free to describe various tasks or ask questions which will be submitted to each AI at the same time. 

You can watch each AI generate a response to your query live, and once finished, you are free to evaluate their answers for as long as you like. You then have the option to vote on which model accomplished your assigned task better, resubmit the same question to both AIs again to measure consistency, or ask as many new questions as you like until you are confident about which model represents the better-performing AI. As soon as you vote, the models are revealed. At that point you can keep submitting questions or assigning tasks to your competitors, or start over with two new, random AI models.

Ready, set, fight!

I set up a new arena battle for this column, and used some of the same questions and tasks that I have used before to evaluate each model’s effectiveness. The first task I assigned to them was to write a program in C that would grab the current time from key cities around the world — I left it up to the AIs to pick which cities they wanted to use — and then label and display those times.

Both AIs started working right away, with Model B blazing through that task much more quickly than Model A. However, the program offered by Model B was a bit more primitive and only offered five cities, while Model A picked nine cities and spread them out more evenly around the world. Model B also gave a very short description of the program it created, including how it used the time function to get the correct time and then offset that based on each specific city. That was good, however, Model A gave a much more detailed description, including how I could change the program variables to switch from using a 12-hour clock to a 24-hour one if I wanted. So, Model A was slower, but the results were much more impressive and useful. I also copied and compiled both programs to make sure that they worked, which they did.

For the second test, I went from programming to a more creative task, asking both models to write me an epic poem about King Arthur and his knights, and specifically about their search for the holy grail. Both jumped to it again, with Model B continuing to be the speed demon and finishing quickly. The much shorter poem offered by Model B was completed while Model A was still composing the middle stanzas of its epic poem. Both models correctly identified the themes of the search for the holy grail, specifically that it was not actually the grail that was the ultimate reward, but the kinship felt and lessons learned by the knights doing the searching. In general, Model A’s poem was a bit more epic, with the only strange thing being that it accidentally put two words with Chinese characters in the middle of the poem. Since I had run across that error before, I was pretty sure I knew the AI hidden behind Model A, but continued the test with the third task.

My final test was presented to see how up to date the tested AI models were. I asked them to explain the importance and the technical specifications behind the new Wi-Fi 6E standard. For that task, both proved surprisingly capable, but only Model A explained how some of the peripheral technologies, like orthogonal frequency-division multiple access and multi-user multiple input, multiple output were used to further enhance the speed and reliability of the Wi-Fi 6E standard. Model B ignored that because they are not part of the core standard. Model A also made another mistake with a Chinese character placed in the middle of its response, which I think accidentally swapped with the English word “density” in a sentence that should have ended with “better performance for areas with high device density.”

Looking back through all of the results, I decided that Model A had won the day. It provided more useful and complete data, despite the random Chinese character errors. Once I voted, both AIs were revealed. Model A, the winner in this case, was quen1.5-32b-chat, as I expected. That AI was created by Alibaba Cloud, and I assume its native language is probably Chinese. I have seen Chinese characters slipping into responses from time to time whenever using that model. Even so, it performed better overall. 

Surprisingly, Model B was gpt-3.5-turbo-0125, one of my favorites because it gets used for gaming quite a bit these days. It seemed to be mostly concentrated on speed, and while its responses were accurate, they were not nearly as detailed or as helpful as the other model’s answers.

You can check out the leaderboard to see which AIs users currently rate as their top picks, or set up your own test and take two random AIs head to head in the arena. Even if you don’t want to vote on AIs, running an arena battle is an interesting experience that will help to show some of the key differences in how AIs act and think. If you work with them enough, they almost start to seem like people, with each one having a specific set of strengths, weaknesses and quirks — something to keep in mind if you are looking to use AI technology to help out with an agency project or critical task.

John Breeden II is an award-winning journalist and reviewer with over 20 years of experience covering technology. He is the CEO of the Tech Writers Bureau, a group that creates technological thought leadership content for organizations of all sizes. Twitter: @LabGuys