• Formerly Platform.sh
  • Contact us
  • Docs
  • Login
Watch a demoFree trial
Blog
Blog
BlogProductCase studiesCompany news
Blog

Why you need real-world data to evaluate your AI agents

AIpreview environmentsdata cloningprivacy
11 November 2025
Share

Problem: A lab demo is not a production test

If an agent only performs on a curated notebook, it is not production-ready. Real customers expect reliability across dozens of apps, strict compliance, and predictable costs. That is the daily reality for IT middle management.

Recent research shows the gap between benchmark wins and real tasks. On GAIA, humans scored 92 percent while a top model with tools managed about 15 percent.¹² AgentBench finds similar agent shortfalls in interactive environments that look more like the messy, stateful world your systems live in.³⁴

For leaders accountable for uptime and risk, “works on my laptop” is not a test plan. You need to let agents touch your own data, tools, and edge cases, without touching production.

Upsun solution: evaluate against a safe production clone

Upsun provides every Git branch with a live, production-grade environment that includes cloned services such as databases and caches. That means you can spin up a realistic production clone in minutes. See how environments map to branches and how preview environments inherit data for realistic testing.

Need to protect sensitive fields while keeping data shape and distributions? Use custom sanitization patterns so preview databases are useful and PII-free. Read the sanitizing guide and examples.⁵

Upsun is built for both humans and AI agents. It exposes structured config and predictable APIs, and your assistants connect through MCP servers for rich, real-time context about your stack. Deploy MCP servers on Upsun and wire PostgreSQL MCP to a clone safely.

AI agent testing needs production data

A credible AI agent evaluation plan exercises your RAG pipelines, tool calls, timeouts, retries, permissions, and failure paths on the same schemas and services you run in production. Upsun’s branch-per-environment approach standardizes the workflow and reduces the “unknown unknowns” that only appear with real workloads.

Quick start:

# Create an isolated prod clone for agent testing

upsun branch agent-evals

 

# Tail logs while agents run their scenarios

upsun log -e agent-evals app

See CLI reference.

RAG evaluation in your production clone

Do not just eyeball outputs. Like classic application testing, run proper RAG evaluation against your org’s content. A practical framework scores three things: context relevance, groundedness, and answer relevance.⁶ 

Many toolkits can handle AI evaluations, but if you are just starting, Langchain is your best bet to start creating and running your LLM tests

In Upsun, you can run these evaluations as part of your branch workflow and maintain observability with logs and continuous profiling. See log access and profiling.

Wire the MCP and A2A workflows to real services

Agents improve when they can actually fetch, transform, and write through the interfaces your teams use today. With Upsun, MCP, and agent-to-agent patterns can run in the cloned environment against your real APIs and data models, so you catch permission gaps, throttling, or schema drift long before release. Explore developer articles.

Implementation details: from lab demo to production 

  1. Create a production clone per feature branch. Each branch gets an environment with cloned services and assets.
  2. Sanitize sensitive data. Use the Upsun patterns to replace PII while maintaining realistic structure and distributions.⁵
  3. Connect your agent to real tools. Add MCP servers and any A2A workflows against the clone’s URLs.
  4. Automate RAG evals. Evaluate the relevance, groundedness, and quality of your answers in your content. Track and compare improvements and regressions per branch.⁶⁷
  5. Observe everything. Stream logs and profiles to catch timeouts, rate limits, and memory leaks early. Profile performance with our included Blackfire service.  See observability overview.
  6. Merge to production. Validate under stress and load, then merge with confidence into production.

Why Upsun is the best platform for this

  • Speed and simplicity. A single YAML config drives multi-service orchestration and repeatable environments.
  • Standardization. Consistent delivery across teams reduces surprises and simplifies audits.
  • Security and compliance. Policies and protections operate up to the application layer.
  • Multi-cloud options. Keep cost control and vendor freedom as you scale.

Sources

  1. GAIA: a benchmark for General AI Assistants (arXiv)
  2. GAIA: a benchmark for General AI Assistants (ICLR 2024 proceedings)
  3. AgentBench: Evaluating LLMs as Agents (ICLR 2024 proceedings)
  4. AgentBench: Evaluating LLMs as Agents (ar5iv HTML) 
  5. Sanitizing databases on preview environments (Upsun Docs)
  6. IBM RAG Cookbook: Result evaluation and the RAG triad 
  7. NVIDIA NeMo Microservices: RAG evaluation type 
  8. NVIDIA NeMo Evaluator overview 
  9. Increase observability with logs and profiling (Upsun Docs)
  10. Ragas metrics overview
  11. Ragas evaluate API

Stay updated

Subscribe to our monthly newsletter for the latest updates and news.

Your greatest work
is just on the horizon

Free trial
© 2025 Upsun. All rights reserved.