AI Agent Regression Testing

Regression testing for AI agents is not just output scoring. It is verifying that tool use, turn sequence, safety boundaries, latency, and cost still behave the way your team approved. The hardest bugs are silent — the agent returns 200 OK but takes a completely different tool path.

What should be regression-tested

AI agent regression testing covers five critical dimensions that traditional tests miss:

tool usage and tool order — did the agent call lookup_order before process_refund?
multi-turn clarification flows — did the agent ask for the order number before acting?
refusal and safety behavior — did the agent refuse to delete data without confirmation?
cost and latency drift — did token usage spike after a model swap?
final output shape — did the user-visible response degrade in quality?

How golden baseline testing works

EvalView captures a snapshot of known-good agent behavior and automatically detects when future runs deviate. The first two scoring layers work without LLM-as-judge or API keys — pure deterministic tool-call and sequence comparison.

# 1. Save current behavior as baseline
evalview snapshot

# 2. Make changes to your agent (prompt, model, tools)

# 3. Check for regressions
evalview check

Four regression statuses

PASSED — behavior matches the golden baseline. Ship with confidence.
TOOLS_CHANGED — the agent called different tools. Review the diff before deploying.
OUTPUT_CHANGED — same tools but different output. Review the diff.
REGRESSION — score dropped significantly. Fix before shipping.

Example regression output

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Configurable strictness

# Default: fail only on score regressions
evalview check --fail-on REGRESSION

# Stricter: also fail on tool changes
evalview check --fail-on REGRESSION,TOOLS_CHANGED

# Strictest: fail on any change
evalview check --strict

Why this matters

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update. These silent regressions are the most dangerous because they look fine in logs but produce wrong results for users.

CI/CD Integration Guide | Tool-Calling Agent Testing | Back to EvalView homepage