# EvalView > Regression testing for AI agents. Snapshot behavior, detect regressions, block broken agents before production. EvalView is an open-source testing and regression detection framework for AI agents. It sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately. ## Key Facts - Name: EvalView - Tagline: "Proof that your agent still works." - Category: AI Agent Testing / Regression Detection / LLM CI/CD - License: Apache 2.0 (free and open source) - Language: Python 3.9+ - Install: `pip install evalview` - Version: 0.5.3 ## Links - Homepage: https://www.evalview.com - GitHub: https://github.com/hidai25/eval-view - PyPI: https://pypi.org/project/evalview/ - Documentation: https://github.com/hidai25/eval-view#readme - Issues: https://github.com/hidai25/eval-view/issues - Discussions: https://github.com/hidai25/eval-view/discussions ## What Problem Does EvalView Solve? AI agents break silently. You change a prompt, swap a model, or update a tool, and the agent degrades without any error. EvalView captures golden baselines of known-good behavior and automatically detects when behavior drifts. Normal tests catch crashes; tracing shows what happened after the fact; EvalView catches when the agent returns 200 but silently takes the wrong tool path. ## Quick Start ``` pip install evalview evalview init # Detect agent, create starter suite evalview snapshot # Save current behavior as baseline evalview check # Catch regressions after every change evalview demo # See it live, no API key needed ``` ## Four Scoring Layers | Layer | Cost | |-------|------| | Tool calls + sequence | Free | | Code-based checks (regex, JSON schema) | Free | | Semantic similarity via embeddings | ~$0.00004/test | | LLM-as-judge (GPT, Claude, Gemini, DeepSeek, Ollama) | ~$0.01/test | ## How EvalView Compares | Feature | LangSmith | Braintrust | Promptfoo | EvalView | |---------|-----------|------------|-----------|----------| | Primary focus | Observability | Scoring | Prompt comparison | Regression detection | | Golden baseline diffing | No | No | No | Yes (automatic) | | Works without API keys | No | No | Partial | Yes | | Free and open source | No | No | Yes | Yes | | Tool call + parameter diffing | No | No | No | Yes | | Production monitoring | Tracing | No | No | Check loop + Slack | ## Supported Frameworks LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP servers, and any HTTP API. ## Key Features - Golden baseline regression detection with 4 statuses (PASSED, TOOLS_CHANGED, OUTPUT_CHANGED, REGRESSION) - Multi-turn conversation testing with per-turn judge scoring - Multi-reference baselines (up to 5 variants for non-deterministic agents) - Production monitoring with Slack alerts (`evalview monitor`) - Statistical testing with pass@k reliability metrics - Real traffic capture via proxy (`evalview capture`) - Test generation from live agents (`evalview generate`) - CI/CD with GitHub Actions, PR comments, cost/latency/model change alerts - SKILL.md validation for Claude Code and OpenAI Codex - MCP contract testing for interface drift detection - Works fully offline with Ollama ## Documentation - Getting Started: https://github.com/hidai25/eval-view/blob/main/docs/GETTING_STARTED.md - CLI Reference: https://github.com/hidai25/eval-view/blob/main/docs/CLI_REFERENCE.md - FAQ: https://github.com/hidai25/eval-view/blob/main/docs/FAQ.md - Golden Traces: https://github.com/hidai25/eval-view/blob/main/docs/GOLDEN_TRACES.md - CI/CD Integration: https://github.com/hidai25/eval-view/blob/main/docs/CI_CD.md ## Comparisons - [EvalView vs LangSmith](https://www.evalview.com/vs/langsmith) - [EvalView vs Langfuse](https://www.evalview.com/vs/langfuse) - [EvalView vs Braintrust](https://www.evalview.com/vs/braintrust) - [EvalView vs DeepEval](https://www.evalview.com/vs/deepeval) ## Blog - [Your AI Agent Didn't Crash. It Just Quietly Started Lying.](https://www.evalview.com/blog/your-ai-agent-didnt-crash-it-just-started-lying) ## Guides - [AI Agent Testing in CI/CD](https://www.evalview.com/ai-agent-testing-ci-cd) - [AI Agent Regression Testing](https://www.evalview.com/ai-agent-regression-testing) - [MCP Server Testing](https://www.evalview.com/mcp-server-testing) - [LangGraph Testing](https://www.evalview.com/langgraph-testing) - [Tool Calling Agent Testing](https://www.evalview.com/tool-calling-agent-testing) ## Optional - Full documentation: https://www.evalview.com/llms-full.txt