Snapshot behavior, detect regressions, block broken agents before production.
EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.
Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.
Quick Start
pip install evalview
evalview init # Detect agent, create starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every change
evalview demo # See it live, no API key needed
What It Catches
PASSED — Behavior matches baseline. Ship with confidence.
TOOLS_CHANGED — Different tools called. Review the diff.
OUTPUT_CHANGED — Same tools, output shifted. Review the diff.
REGRESSION — Score dropped significantly. Fix before shipping.
Semantic similarity — output meaning via embeddings (~$0.00004/test)
LLM-as-judge — output quality scored by GPT, Claude, Gemini, DeepSeek, or Ollama (~$0.01/test)
The first two layers alone catch most regressions — fully offline, zero cost.
Key Features
Golden baseline regression detection with tool call and parameter diffing
Multi-turn conversation testing with per-turn judge scoring
Multi-reference baselines (up to 5 variants for non-deterministic agents)
Production monitoring with Slack alerts
Statistical testing with pass@k reliability metrics
Real traffic capture via proxy
Test generation from live agents
CI/CD with GitHub Actions, PR comments, cost/latency/model change alerts
SKILL.md validation for Claude Code and OpenAI Codex
MCP contract testing for interface drift detection
Works fully offline with Ollama
How EvalView Compares
vs. LangSmith
LangSmith is for observability — it shows what your agent did. EvalView is for regression testing — it tells you whether your agent broke. They're complementary.
Braintrust scores agent quality. EvalView detects when behavior changes automatically through golden baseline diffing. EvalView is fully free and open source.