Experiments (Beta)¶
Experiments let you systematically compare agent configurations, measure their performance against datasets, and promote the best-performing variant to production. While the Playground is for ad-hoc testing, experiments provide a structured framework for data-driven agent optimization.
What is an Experiment?¶
An experiment is a controlled comparison of two or more agent variants evaluated against a shared dataset using the same scorers and quality gates. The goal is to determine which configuration performs best and should be promoted.
graph TD
Exp["Experiment"] --> VA["Variant A<br/>Claude Sonnet + Prompt v1"]
Exp --> VB["Variant B<br/>Claude Sonnet + Prompt v2"]
Exp --> VC["Variant C<br/>GPT-4o + Prompt v2"]
VA --> DS["Dataset<br/>(shared test cases)"]
VB --> DS
VC --> DS
DS --> Scorers["Scorers<br/>Semantic similarity, LLM judge"]
Scorers --> Gates["Quality Gates<br/>pass_rate >= 0.9"]
Gates --> Results["Results & Ranking"]
Setting Up an Experiment¶
1. Define Variants¶
Each variant is a configuration snapshot of the agent with specific overrides. Variants differ in one or more of:
| Parameter | Example Variation |
|---|---|
| Model | Claude Sonnet vs GPT-4o vs Gemini Pro |
| System prompt | Concise instructions vs detailed instructions |
| Temperature | 0.1 (deterministic) vs 0.5 (balanced) vs 0.9 (creative) |
| Tools | Full toolset vs reduced toolset |
| Max tokens | 1024 vs 4096 |
Change one variable at a time
For clear conclusions, vary only one parameter per variant pair. If Variant A uses a different model and a different prompt than Variant B, you cannot attribute any performance difference to either change alone.
2. Select a Dataset¶
Choose a dataset that represents the agent's expected workload. Good evaluation datasets include:
- Diverse inputs -- Cover the range of queries users will send
- Expected outputs -- Ground-truth answers for scoring
- Edge cases -- Unusual, adversarial, or ambiguous inputs
- Realistic distribution -- Roughly match production traffic patterns
Datasets require at minimum an input field (the user prompt). For scoring, include an expected_output field.
3. Configure Scorers¶
Scorers evaluate each agent response. You can combine multiple scorers with different weights:
Measures how close the agent's response is to the expected output using embedding similarity.
Uses a second model to evaluate the response against a rubric.
Strict comparison against expected output.
4. Define Quality Gates¶
Gates set the bar for what constitutes a passing evaluation. A variant must clear all gates to be eligible for promotion.
{
"gates": [
{
"type": "and",
"operands": [
{
"type": "condition",
"condition": {"metric": "pass_rate", "op": ">=", "value": 0.9}
},
{
"type": "condition",
"condition": {"metric": "latency_p95_ms", "op": "<=", "value": 3000}
},
{
"type": "condition",
"condition": {"metric": "weighted_score", "op": ">=", "value": 0.75}
}
]
}
]
}
Running the Experiment¶
Once configured, the experiment runner:
- Iterates through each dataset sample
- Executes every variant against each sample
- Collects outputs, latencies, and token usage
- Runs all scorers on each output
- Aggregates metrics per variant
- Evaluates quality gates
Execution time
Experiments can take significant time depending on dataset size and variant count. An experiment with 3 variants and 200 dataset rows executes 600 agent calls. The platform runs these asynchronously and notifies you when results are ready.
Analyzing Results¶
Per-Variant Metrics¶
Each variant receives aggregated metrics:
| Metric | Description |
|---|---|
| Pass rate | Percentage of samples that passed all scorers |
| Weighted score | Weighted average across scorer scores |
| Latency (avg, p50, p95, p99) | Response time distribution |
| Token usage | Average input/output/total tokens per sample |
| Gate status | Whether the variant passed all quality gates |
Per-Sample Results¶
Drill into individual samples to understand failures:
{
"sample_index": 42,
"input_data": {"prompt": "What is the cancellation policy?"},
"expected_output": {"answer": "You can cancel within 30 days for a full refund."},
"actual_output": {"answer": "Contact support for cancellation details."},
"scorer_results": {
"answer_relevance": {"score": 0.45, "passed": false},
"helpfulness": {"score": 0.6, "passed": false}
},
"passed": false,
"execution_time_ms": 1240
}
Comparison Matrix¶
The results include pairwise deltas between all variants:
| Variant A | Variant B | Variant C | |
|---|---|---|---|
| Pass Rate | 0.87 | 0.94 | 0.91 |
| Weighted Score | 0.78 | 0.86 | 0.82 |
| Avg Latency (ms) | 980 | 1,240 | 1,180 |
| Avg Tokens | 1,820 | 2,100 | 1,950 |
| Gates Passed | No | Yes | Yes |
In this example, Variant B has the highest quality scores and passes all gates, making it the winner despite higher latency.
Promoting a Winner¶
After identifying the best variant, you can promote its configuration to the agent spec:
- Review the winning variant's configuration -- model, prompt, temperature, tools
- Update the agent spec in the Agent Builder with the winning settings
- Bump the version to track the change in the component changelog
- Deploy the updated solution through deployment rings
Promote through proper channels
Do not copy experiment overrides directly to production. Update the agent spec, run any required approval workflows, and promote through your organization's deployment rings. This ensures all changes are tracked and auditable.
Best Practices¶
Dataset Design¶
- Minimum 50 samples for statistically meaningful results. 200+ is better.
- Include negative cases -- inputs the agent should refuse or escalate.
- Refresh regularly -- Add new samples from production traces using Trace-to-Test conversion.
- Label carefully -- Ambiguous expected outputs produce noisy scores.
Experiment Design¶
- Isolate variables -- Change one thing per variant to identify what drives improvement.
- Run multiple times -- Non-deterministic models produce different outputs each run. Run the experiment 2-3 times and look for consistent patterns.
- Balance quality and cost -- A higher-quality model may win on accuracy but cost 10x more per request. Include cost metrics in your gates.
Scorer Selection¶
- Combine scorer types -- Use semantic similarity for content quality and LLM judge for subjective criteria.
- Weight appropriately -- If accuracy matters more than style, give the exact match scorer a higher weight.
- Calibrate thresholds -- Run a baseline experiment first to understand your dataset's score distribution before setting pass/fail thresholds.