# EvalView

> Regression testing for AI agents. Snapshot behavior, detect regressions, block broken agents before production.

EvalView is an open-source testing and regression detection framework for AI agents. It sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

## Key Facts

- Name: EvalView
- Tagline: "Proof that your agent still works."
- Category: AI Agent Testing / Regression Detection / LLM CI/CD
- License: Apache 2.0 (free and open source)
- Language: Python 3.9+
- Install: `pip install evalview`
- Version: 0.5.3

## Links

- Homepage: https://www.evalview.com
- GitHub: https://github.com/hidai25/eval-view
- PyPI: https://pypi.org/project/evalview/
- Documentation: https://github.com/hidai25/eval-view#readme
- Issues: https://github.com/hidai25/eval-view/issues
- Discussions: https://github.com/hidai25/eval-view/discussions

## What Problem Does EvalView Solve?

AI agents break silently. You change a prompt, swap a model, or update a tool, and the agent degrades without any error. EvalView captures golden baselines of known-good behavior and automatically detects when behavior drifts. Normal tests catch crashes; tracing shows what happened after the fact; EvalView catches when the agent returns 200 but silently takes the wrong tool path.

## Quick Start

```
pip install evalview
evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change
evalview demo        # See it live, no API key needed
```

## Four Scoring Layers

| Layer | Cost |
|-------|------|
| Tool calls + sequence | Free |
| Code-based checks (regex, JSON schema) | Free |
| Semantic similarity via embeddings | ~$0.00004/test |
| LLM-as-judge (GPT, Claude, Gemini, DeepSeek, Ollama) | ~$0.01/test |

## How EvalView Compares

| Feature | LangSmith | Braintrust | Promptfoo | EvalView |
|---------|-----------|------------|-----------|----------|
| Primary focus | Observability | Scoring | Prompt comparison | Regression detection |
| Golden baseline diffing | No | No | No | Yes (automatic) |
| Works without API keys | No | No | Partial | Yes |
| Free and open source | No | No | Yes | Yes |
| Tool call + parameter diffing | No | No | No | Yes |
| Production monitoring | Tracing | No | No | Check loop + Slack |

## Supported Frameworks

LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP servers, and any HTTP API.

## Key Features

- Golden baseline regression detection with 4 statuses (PASSED, TOOLS_CHANGED, OUTPUT_CHANGED, REGRESSION)
- Multi-turn conversation testing with per-turn judge scoring
- Multi-reference baselines (up to 5 variants for non-deterministic agents)
- Production monitoring with Slack alerts (`evalview monitor`)
- Statistical testing with pass@k reliability metrics
- Real traffic capture via proxy (`evalview capture`)
- Test generation from live agents (`evalview generate`)
- CI/CD with GitHub Actions, PR comments, cost/latency/model change alerts
- SKILL.md validation for Claude Code and OpenAI Codex
- MCP contract testing for interface drift detection
- Works fully offline with Ollama

## Documentation

- Getting Started: https://github.com/hidai25/eval-view/blob/main/docs/GETTING_STARTED.md
- CLI Reference: https://github.com/hidai25/eval-view/blob/main/docs/CLI_REFERENCE.md
- FAQ: https://github.com/hidai25/eval-view/blob/main/docs/FAQ.md
- Golden Traces: https://github.com/hidai25/eval-view/blob/main/docs/GOLDEN_TRACES.md
- CI/CD Integration: https://github.com/hidai25/eval-view/blob/main/docs/CI_CD.md

## Comparisons

- [EvalView vs LangSmith](https://www.evalview.com/vs/langsmith)
- [EvalView vs Langfuse](https://www.evalview.com/vs/langfuse)
- [EvalView vs Braintrust](https://www.evalview.com/vs/braintrust)
- [EvalView vs DeepEval](https://www.evalview.com/vs/deepeval)

## Blog

- [Your AI Agent Didn't Crash. It Just Quietly Started Lying.](https://www.evalview.com/blog/your-ai-agent-didnt-crash-it-just-started-lying)

## Guides

- [AI Agent Testing in CI/CD](https://www.evalview.com/ai-agent-testing-ci-cd)
- [AI Agent Regression Testing](https://www.evalview.com/ai-agent-regression-testing)
- [MCP Server Testing](https://www.evalview.com/mcp-server-testing)
- [LangGraph Testing](https://www.evalview.com/langgraph-testing)
- [Tool Calling Agent Testing](https://www.evalview.com/tool-calling-agent-testing)

## Optional

- Full documentation: https://www.evalview.com/llms-full.txt