EvalView — Regression Testing for AI Agents
Snapshot behavior, detect regressions, block broken agents before production.
EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.
Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.
Quick Start
pip install evalview
evalview init # Detect agent, create starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every change
Key Features
- Golden baseline regression detection with 4 statuses (PASSED, TOOLS_CHANGED, OUTPUT_CHANGED, REGRESSION)
- Multi-turn conversation testing with per-turn judge scoring
- Production monitoring with Slack alerts
- 14 framework adapters (LangGraph, CrewAI, OpenAI, Anthropic, HuggingFace, Ollama, MCP)
- Statistical testing with pass@k reliability metrics
- LLM-as-judge evaluation (GPT, Claude, Gemini, DeepSeek, Ollama — free)
- CI/CD with GitHub Actions, PR comments, cost/latency/model alerts
- Real traffic capture via proxy
- Test generation from live agents
- SKILL.md validation for Claude Code and OpenAI Codex
- MCP contract testing for interface drift detection
- Works fully offline with Ollama
How EvalView Compares
LangSmith is for observability — it shows what your agent did. EvalView is for regression testing — it tells you whether your agent broke. They're complementary.
Braintrust scores agent quality. EvalView detects when behavior changes automatically through golden baseline diffing.
Supported Frameworks
LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP servers, and any HTTP API.
Free and Open Source
Apache 2.0 license. Core regression detection works without any API keys. Use Ollama for completely free, fully offline evaluation.
GitHub | PyPI | Documentation