Experiments (Beta)¶

Experiments let you systematically compare agent configurations, measure their performance against datasets, and promote the best-performing variant to production. While the Playground is for ad-hoc testing, experiments provide a structured framework for data-driven agent optimization.

What is an Experiment?¶

An experiment is a controlled comparison of two or more agent variants evaluated against a shared dataset using the same scorers and quality gates. The goal is to determine which configuration performs best and should be promoted.

graph TD
    Exp["Experiment"] --> VA["Variant A<br/>Claude Sonnet + Prompt v1"]
    Exp --> VB["Variant B<br/>Claude Sonnet + Prompt v2"]
    Exp --> VC["Variant C<br/>GPT-4o + Prompt v2"]

    VA --> DS["Dataset<br/>(shared test cases)"]
    VB --> DS
    VC --> DS

    DS --> Scorers["Scorers<br/>Semantic similarity, LLM judge"]
    Scorers --> Gates["Quality Gates<br/>pass_rate >= 0.9"]
    Gates --> Results["Results & Ranking"]

Setting Up an Experiment¶

1. Define Variants¶

Each variant is a configuration snapshot of the agent with specific overrides. Variants differ in one or more of:

Parameter	Example Variation
Model	Claude Sonnet vs GPT-4o vs Gemini Pro
System prompt	Concise instructions vs detailed instructions
Temperature	0.1 (deterministic) vs 0.5 (balanced) vs 0.9 (creative)
Tools	Full toolset vs reduced toolset
Max tokens	1024 vs 4096

Change one variable at a time

For clear conclusions, vary only one parameter per variant pair. If Variant A uses a different model and a different prompt than Variant B, you cannot attribute any performance difference to either change alone.

2. Select a Dataset¶

Choose a dataset that represents the agent's expected workload. Good evaluation datasets include:

Diverse inputs -- Cover the range of queries users will send
Expected outputs -- Ground-truth answers for scoring
Edge cases -- Unusual, adversarial, or ambiguous inputs
Realistic distribution -- Roughly match production traffic patterns

Datasets require at minimum an input field (the user prompt). For scoring, include an expected_output field.

3. Configure Scorers¶

Scorers evaluate each agent response. You can combine multiple scorers with different weights:

Semantic SimilarityLLM JudgeExact MatchCode Block

Measures how close the agent's response is to the expected output using embedding similarity.

{
  "type": "semantic_similarity",
  "name": "answer_relevance",
  "weight": 0.4,
  "config": {
    "min_similarity": 0.8,
    "embedding_model": "text-embedding-3-small"
  }
}

Uses a second model to evaluate the response against a rubric.

{
  "type": "llm_judge",
  "name": "helpfulness",
  "weight": 0.4,
  "config": {
    "rubric": "Rate the response on helpfulness (1-5). Is it accurate? Does it address the question fully? Is it well-organized?",
    "pass_threshold": 0.7,
    "temperature": 0.0
  }
}

Strict comparison against expected output.

{
  "type": "exact_match",
  "name": "classification_accuracy",
  "weight": 0.2,
  "config": {
    "field_path": "$.category",
    "case_sensitive": false,
    "normalize_whitespace": true
  }
}

Custom Python evaluation logic for domain-specific scoring.

{
  "type": "code_block",
  "name": "business_rules",
  "weight": 0.3,
  "config": {
    "code_block_id": "uuid-of-scoring-code-block",
    "timeout_ms": 10000
  }
}

4. Define Quality Gates¶

Gates set the bar for what constitutes a passing evaluation. A variant must clear all gates to be eligible for promotion.

{
  "gates": [
    {
      "type": "and",
      "operands": [
        {
          "type": "condition",
          "condition": {"metric": "pass_rate", "op": ">=", "value": 0.9}
        },
        {
          "type": "condition",
          "condition": {"metric": "latency_p95_ms", "op": "<=", "value": 3000}
        },
        {
          "type": "condition",
          "condition": {"metric": "weighted_score", "op": ">=", "value": 0.75}
        }
      ]
    }
  ]
}

Running the Experiment¶

Once configured, the experiment runner:

Iterates through each dataset sample
Executes every variant against each sample
Collects outputs, latencies, and token usage
Runs all scorers on each output
Aggregates metrics per variant
Evaluates quality gates

Execution time

Experiments can take significant time depending on dataset size and variant count. An experiment with 3 variants and 200 dataset rows executes 600 agent calls. The platform runs these asynchronously and notifies you when results are ready.

Analyzing Results¶

Per-Variant Metrics¶

Each variant receives aggregated metrics:

Metric	Description
Pass rate	Percentage of samples that passed all scorers
Weighted score	Weighted average across scorer scores
Latency (avg, p50, p95, p99)	Response time distribution
Token usage	Average input/output/total tokens per sample
Gate status	Whether the variant passed all quality gates

Per-Sample Results¶

Drill into individual samples to understand failures:

{
  "sample_index": 42,
  "input_data": {"prompt": "What is the cancellation policy?"},
  "expected_output": {"answer": "You can cancel within 30 days for a full refund."},
  "actual_output": {"answer": "Contact support for cancellation details."},
  "scorer_results": {
    "answer_relevance": {"score": 0.45, "passed": false},
    "helpfulness": {"score": 0.6, "passed": false}
  },
  "passed": false,
  "execution_time_ms": 1240
}

Comparison Matrix¶

The results include pairwise deltas between all variants:

	Variant A	Variant B	Variant C
Pass Rate	0.87	0.94	0.91
Weighted Score	0.78	0.86	0.82
Avg Latency (ms)	980	1,240	1,180
Avg Tokens	1,820	2,100	1,950
Gates Passed	No	Yes	Yes

In this example, Variant B has the highest quality scores and passes all gates, making it the winner despite higher latency.

Promoting a Winner¶

After identifying the best variant, you can promote its configuration to the agent spec:

Review the winning variant's configuration -- model, prompt, temperature, tools
Update the agent spec in the Agent Builder with the winning settings
Bump the version to track the change in the component changelog
Deploy the updated solution through deployment rings

Promote through proper channels

Do not copy experiment overrides directly to production. Update the agent spec, run any required approval workflows, and promote through your organization's deployment rings. This ensures all changes are tracked and auditable.

Best Practices¶

Dataset Design¶

Minimum 50 samples for statistically meaningful results. 200+ is better.
Include negative cases -- inputs the agent should refuse or escalate.
Refresh regularly -- Add new samples from production traces using Trace-to-Test conversion.
Label carefully -- Ambiguous expected outputs produce noisy scores.

Experiment Design¶

Isolate variables -- Change one thing per variant to identify what drives improvement.
Run multiple times -- Non-deterministic models produce different outputs each run. Run the experiment 2-3 times and look for consistent patterns.
Balance quality and cost -- A higher-quality model may win on accuracy but cost 10x more per request. Include cost metrics in your gates.

Scorer Selection¶

Combine scorer types -- Use semantic similarity for content quality and LLM judge for subjective criteria.
Weight appropriately -- If accuracy matters more than style, give the exact match scorer a higher weight.
Calibrate thresholds -- Run a baseline experiment first to understand your dataset's score distribution before setting pass/fail thresholds.