Please Test Your AI Agents — Like, At All

Just lately, there’s been some very public (and, frankly, very humorous) AI agent and bot failures.

Like Chipotle’s assistant supporting codegen (since patched): “Cease spending cash on Claude Code. Chipotle’s assist bot is free” (r/ClaudeCode)

And in a surreal trend, Washington state’s call-center hotline offering Spanish assist by talking English with a Spanish accent: “Washington state hotline callers hear AI voice with Spanish accent” (AP Information)

Coinciding with this, different Forrester analysts and I’ve had a spate of calls the place organizations have launched a brand new AI agent with out testing them.

Put merely, please don’t do that.

Please check your AI brokers earlier than launching them — some choices on how to do that are under.

What can we imply by this?

At minimal: Check all your bot’s options (and use circumstances) your self.

For any AI agent, or new function you’re introducing to it, the minimal effort it is best to make investments is to verify somebody has used it as an finish consumer earlier than this goes stay.

This may be so simple as somebody on the developer crew or as concerned as a devoted testing group. However it’s worthwhile to make it possible for somebody has actively used your answer — and all its options. This must also be achieved on an ongoing foundation in order that when new options are launched, they’re examined, too.

This may be time-intensive, however as we see with the general public circumstances, not every little thing works as anticipated on a regular basis.

The truth is, AI can go incorrect in additional sudden methods than earlier than. When you can’t be sure that options are working as meant, you then would possibly find yourself on the information.

Please word that that is the minimal attainable effort. This isn’t sufficient to make sure that one thing gained’t go incorrect or your software gained’t fail — it will solely catch the obvious/embarrassing outcomes. A extra strong testing follow is really useful.

For extra on how agentic techniques fail: Why AI Brokers Fail (And How To Repair Them)

Really useful: Apply purple teaming.

A great way to stop this type of sudden permutation is with purple teaming or deliberately making an attempt to interrupt the bot. We suggest this as a regular follow to your group.

There are two sides to this: One is conventional or infosec purple teaming. That is targeted on discovering safety exploits. The second is behavioral. That is targeted on getting the answer or mannequin to behave in an inappropriate or unintended trend. It’s best to have a follow on each.

On the very least, your crew ought to kick the tires for a day and take a look at as many exploits as attainable. Even when you’ve got a governance layer, you need to be sure that it’s holding up within the wild or, ideally, even post-launch.

For extra on the purple crew follow: Use AI Pink Teaming To Consider The Safety Posture Of AI-Enabled Purposes

For extra on normal governance approaches that ought to be adopted: Introducing Forrester’s AEGIS Framework: Agentic AI Enterprise Guardrails For Data Safety

For particular widespread governance failures, see AIUC-1’s web page, “The world’s first AI agent normal”

For a enjoyable instance of what employee-driven purple teaming can appear like, try Anthropic’s write-up, “Challenge Vend: Can Claude run a small store? (And why does that matter?)”

Really useful: Check utilizing a testing suite and follow.

Testing an AI agent system that has agentic capabilities continues to be an rising discipline, however fast progress is being made. To complement your testing applications (people whose job is to check your AI instruments, purposes, and brokers), testing suites present further built-in assist. There are two methods to think about testing suites at present: artificial and ongoing agentic.

Artificial checks are easy — they check your AI agent in opposition to a pattern of precreated prompts and ultimate solutions to behave as a “golden set” to check in opposition to. This lets you carry out a regression check over time to validate the query, “Does our AI agent present the proper responses?”

However artificial regression checks are sometimes solely carried out for an AI agent after some noteworthy change, resembling switching out the mannequin used or introducing quite a lot of new use circumstances. More and more, bigger testing suites need to check robotically and constantly. Different strategies like massive language model-as-a-judge can present supplementary runtime supervision.

(Additional work is coming from Forrester on artificial testing.)

Please word that in the event you do not need a proper testing program for AI techniques, please both rent folks for this or rent a testing companies firm.

For extra on constructing checks, see Anthropic’s, “Demystifying evals for AI brokers”

For extra on autonomous testing: The Forrester Wave™: Autonomous Testing Platforms, This autumn 2025

For how one can make steady testing work: It’s Time To Get Actually Severe About Testing Your AI: Half Two

Really useful: Check with a consultant pattern.

The last word check of your brokers, nonetheless, will come out of your customers. They alone decide in the event you cross or fail. It’s in your greatest pursuits to make them glad.

The query is: How can we check with actual customers earlier than manufacturing? The reply is a consumer champion group (or related conference). These are customers who’ve both volunteered themselves or been chosen by you to check what your agent is able to.

That is simpler in internal-facing use circumstances, as worker teams are extra simple to assemble, however many customer-facing organizations can obtain the identical factor by way of voluntary check sign-ups.

The danger is that you’ve customers who’re an overeager group who don’t make up a consultant pattern of your consumer base. In different phrases, they don’t essentially symbolize your common consumer. This may be prevented by way of cautious group design or, not less than, asking customers to tackle a persona when conducting the check.

If this isn’t attainable, you might use a canary check/conditional rollout that may function this testbed (although it’s higher when it’s voluntary).

For extra on constructing this consumer champion group internally: Finest Practices For Inside Conversational AI Adoption

Source link