EvalView — Regression Testing for AI Agents

Snapshot behavior, detect regressions, block broken agents before production.

EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.

Quick Start

pip install evalview
evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change
evalview demo        # See it live, no API key needed

What It Catches

Four Scoring Layers

The first two layers alone catch most regressions — fully offline, zero cost.

Key Features

How EvalView Compares

vs. LangSmith

LangSmith is for observability — it shows what your agent did. EvalView is for regression testing — it tells you whether your agent broke. They're complementary.

Read EvalView vs LangSmith comparison

vs. Braintrust

Braintrust scores agent quality. EvalView detects when behavior changes automatically through golden baseline diffing. EvalView is fully free and open source.

Read EvalView vs Braintrust comparison

vs. Langfuse

Langfuse is for LLM observability. EvalView is for regression testing with golden baselines and CI gating.

Read EvalView vs Langfuse comparison | Read EvalView vs DeepEval comparison

Supported Frameworks

Works with LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP servers, and any HTTP API.

Guides

Free and open source under the Apache 2.0 license.

View on GitHub | Install from PyPI