- English
- français
- Deutsche
- Contact us
- Docs
- Login

If an agent only performs on a curated notebook, it is not production-ready. Real customers expect reliability across dozens of apps, strict compliance, and predictable costs. That is the daily reality for IT middle management.
Recent research shows the gap between benchmark wins and real tasks. On GAIA, humans scored 92 percent while a top model with tools managed about 15 percent.¹² AgentBench finds similar agent shortfalls in interactive environments that look more like the messy, stateful world your systems live in.³⁴
For leaders accountable for uptime and risk, “works on my laptop” is not a test plan. You need to let agents touch your own data, tools, and edge cases, without touching production.
Upsun provides every Git branch with a live, production-grade environment that includes cloned services such as databases and caches. That means you can spin up a realistic production clone in minutes. See how environments map to branches and how preview environments inherit data for realistic testing.
Need to protect sensitive fields while keeping data shape and distributions? Use custom sanitization patterns so preview databases are useful and PII-free. Read the sanitizing guide and examples.⁵
Upsun is built for both humans and AI agents. It exposes structured config and predictable APIs, and your assistants connect through MCP servers for rich, real-time context about your stack. Deploy MCP servers on Upsun and wire PostgreSQL MCP to a clone safely.
A credible AI agent evaluation plan exercises your RAG pipelines, tool calls, timeouts, retries, permissions, and failure paths on the same schemas and services you run in production. Upsun’s branch-per-environment approach standardizes the workflow and reduces the “unknown unknowns” that only appear with real workloads.
Quick start:
# Create an isolated prod clone for agent testing
upsun branch agent-evals
# Tail logs while agents run their scenarios
upsun log -e agent-evals appSee CLI reference.
Do not just eyeball outputs. Like classic application testing, run proper RAG evaluation against your org’s content. A practical framework scores three things: context relevance, groundedness, and answer relevance.⁶
Many toolkits can handle AI evaluations, but if you are just starting, Langchain is your best bet to start creating and running your LLM tests.
In Upsun, you can run these evaluations as part of your branch workflow and maintain observability with logs and continuous profiling. See log access and profiling.
Agents improve when they can actually fetch, transform, and write through the interfaces your teams use today. With Upsun, MCP, and agent-to-agent patterns can run in the cloned environment against your real APIs and data models, so you catch permission gaps, throttling, or schema drift long before release. Explore developer articles.

