Tool-Calling Agent Testing

The hardest agent bugs are not in the final text. They are in the hidden trajectory: wrong tool, wrong order, missing clarification, or dangerous tool use in the wrong scenario. EvalView tests the execution path, not just the output.

What to assert in tool-calling agents

tool presence — did the agent call the expected tools?
tool sequence — did it call them in the right order?
forbidden tool usage — did the agent avoid dangerous tools in sensitive scenarios?
clarification before action — did the agent ask for missing info before executing?
output quality — did the final response make sense given the tools called?

Tool categories for flexible matching

Different agents use different tool names for the same action. EvalView's tool categories let you test by intent instead of exact tool name:

# Brittle — fails if agent uses a different tool name
expected:
  tools:
    - read_file

# Flexible — passes for read_file, bash cat, text_editor, etc.
expected:
  categories:
    - file_read

Built-in tool categories

file_read — matches read_file, bash, text_editor, cat, view
file_write — matches write_file, bash, text_editor, edit_file
file_list — matches list_directory, bash, ls, find
search — matches grep, ripgrep, bash, search_files
shell — matches bash, shell, terminal, execute
web — matches web_search, browse, fetch_url, curl
git — matches git, bash, git_commit, git_push
python — matches python, bash, python_repl, execute_python

Custom categories

# .evalview/config.yaml
tool_categories:
  database:
    - postgres_query
    - mysql_execute
    - sql_run
  my_custom_api:
    - internal_api_call
    - legacy_endpoint

Multi-turn testing

Many tool-calling agents require multi-turn conversations. EvalView evaluates each turn independently while giving the LLM judge full conversation context:

name: support-escalation
turns:
  - query: "My order is wrong"
    expected:
      tools: ["lookup_order"]
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "check_policy"]
      forbidden_tools: ["delete_order", "issue_refund"]

Generating tests from a live agent

# Generate from a running agent
evalview generate --agent http://localhost:8000

# Generate from traffic logs
evalview generate --from-log traffic.jsonl

# Capture a real conversation as a test
evalview capture --agent http://localhost:8000/invoke --multi-turn

Regression Testing Guide | MCP Server Testing | Back to EvalView homepage