Tool-Calling Agent Testing

The hardest agent bugs are not in the final text. They are in the hidden trajectory: wrong tool, wrong order, missing clarification, or dangerous tool use in the wrong scenario. EvalView tests the execution path, not just the output.

What to assert in tool-calling agents

Tool categories for flexible matching

Different agents use different tool names for the same action. EvalView's tool categories let you test by intent instead of exact tool name:

# Brittle — fails if agent uses a different tool name
expected:
  tools:
    - read_file

# Flexible — passes for read_file, bash cat, text_editor, etc.
expected:
  categories:
    - file_read

Built-in tool categories

Custom categories

# .evalview/config.yaml
tool_categories:
  database:
    - postgres_query
    - mysql_execute
    - sql_run
  my_custom_api:
    - internal_api_call
    - legacy_endpoint

Multi-turn testing

Many tool-calling agents require multi-turn conversations. EvalView evaluates each turn independently while giving the LLM judge full conversation context:

name: support-escalation
turns:
  - query: "My order is wrong"
    expected:
      tools: ["lookup_order"]
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "check_policy"]
      forbidden_tools: ["delete_order", "issue_refund"]

Generating tests from a live agent

# Generate from a running agent
evalview generate --agent http://localhost:8000

# Generate from traffic logs
evalview generate --from-log traffic.jsonl

# Capture a real conversation as a test
evalview capture --agent http://localhost:8000/invoke --multi-turn

Regression Testing Guide | MCP Server Testing | Back to EvalView homepage