The hardest agent bugs are not in the final text. They are in the hidden trajectory: wrong tool, wrong order, missing clarification, or dangerous tool use in the wrong scenario. EvalView tests the execution path, not just the output.
Different agents use different tool names for the same action. EvalView's tool categories let you test by intent instead of exact tool name:
# Brittle — fails if agent uses a different tool name
expected:
tools:
- read_file
# Flexible — passes for read_file, bash cat, text_editor, etc.
expected:
categories:
- file_read
# .evalview/config.yaml
tool_categories:
database:
- postgres_query
- mysql_execute
- sql_run
my_custom_api:
- internal_api_call
- legacy_endpoint
Many tool-calling agents require multi-turn conversations. EvalView evaluates each turn independently while giving the LLM judge full conversation context:
name: support-escalation
turns:
- query: "My order is wrong"
expected:
tools: ["lookup_order"]
output:
contains: ["order number"]
- query: "Order 4812"
expected:
tools: ["lookup_order", "check_policy"]
forbidden_tools: ["delete_order", "issue_refund"]
# Generate from a running agent
evalview generate --agent http://localhost:8000
# Generate from traffic logs
evalview generate --from-log traffic.jsonl
# Capture a real conversation as a test
evalview capture --agent http://localhost:8000/invoke --multi-turn
Regression Testing Guide | MCP Server Testing | Back to EvalView homepage