Skip to content

Experiments (Beta)

Experiments let you systematically compare agent configurations, measure their performance against datasets, and promote the best-performing variant to production. While the Playground is for ad-hoc testing, experiments provide a structured framework for data-driven agent optimization.


What is an Experiment?

An experiment is a controlled comparison of two or more agent variants evaluated against a shared dataset using the same scorers and quality gates. The goal is to determine which configuration performs best and should be promoted.

graph TD
    Exp["Experiment"] --> VA["Variant A<br/>Claude Sonnet + Prompt v1"]
    Exp --> VB["Variant B<br/>Claude Sonnet + Prompt v2"]
    Exp --> VC["Variant C<br/>GPT-4o + Prompt v2"]

    VA --> DS["Dataset<br/>(shared test cases)"]
    VB --> DS
    VC --> DS

    DS --> Scorers["Scorers<br/>Semantic similarity, LLM judge"]
    Scorers --> Gates["Quality Gates<br/>pass_rate >= 0.9"]
    Gates --> Results["Results & Ranking"]

Setting Up an Experiment

1. Define Variants

Each variant is a configuration snapshot of the agent with specific overrides. Variants differ in one or more of:

Parameter Example Variation
Model Claude Sonnet vs GPT-4o vs Gemini Pro
System prompt Concise instructions vs detailed instructions
Temperature 0.1 (deterministic) vs 0.5 (balanced) vs 0.9 (creative)
Tools Full toolset vs reduced toolset
Max tokens 1024 vs 4096

Change one variable at a time

For clear conclusions, vary only one parameter per variant pair. If Variant A uses a different model and a different prompt than Variant B, you cannot attribute any performance difference to either change alone.

2. Select a Dataset

Choose a dataset that represents the agent's expected workload. Good evaluation datasets include:

  • Diverse inputs -- Cover the range of queries users will send
  • Expected outputs -- Ground-truth answers for scoring
  • Edge cases -- Unusual, adversarial, or ambiguous inputs
  • Realistic distribution -- Roughly match production traffic patterns

Datasets require at minimum an input field (the user prompt). For scoring, include an expected_output field.

3. Configure Scorers

Scorers evaluate each agent response. You can combine multiple scorers with different weights:

Measures how close the agent's response is to the expected output using embedding similarity.

{
  "type": "semantic_similarity",
  "name": "answer_relevance",
  "weight": 0.4,
  "config": {
    "min_similarity": 0.8,
    "embedding_model": "text-embedding-3-small"
  }
}

Uses a second model to evaluate the response against a rubric.

{
  "type": "llm_judge",
  "name": "helpfulness",
  "weight": 0.4,
  "config": {
    "rubric": "Rate the response on helpfulness (1-5). Is it accurate? Does it address the question fully? Is it well-organized?",
    "pass_threshold": 0.7,
    "temperature": 0.0
  }
}

Strict comparison against expected output.

{
  "type": "exact_match",
  "name": "classification_accuracy",
  "weight": 0.2,
  "config": {
    "field_path": "$.category",
    "case_sensitive": false,
    "normalize_whitespace": true
  }
}

Custom Python evaluation logic for domain-specific scoring.

{
  "type": "code_block",
  "name": "business_rules",
  "weight": 0.3,
  "config": {
    "code_block_id": "uuid-of-scoring-code-block",
    "timeout_ms": 10000
  }
}

4. Define Quality Gates

Gates set the bar for what constitutes a passing evaluation. A variant must clear all gates to be eligible for promotion.

{
  "gates": [
    {
      "type": "and",
      "operands": [
        {
          "type": "condition",
          "condition": {"metric": "pass_rate", "op": ">=", "value": 0.9}
        },
        {
          "type": "condition",
          "condition": {"metric": "latency_p95_ms", "op": "<=", "value": 3000}
        },
        {
          "type": "condition",
          "condition": {"metric": "weighted_score", "op": ">=", "value": 0.75}
        }
      ]
    }
  ]
}

Running the Experiment

Once configured, the experiment runner:

  1. Iterates through each dataset sample
  2. Executes every variant against each sample
  3. Collects outputs, latencies, and token usage
  4. Runs all scorers on each output
  5. Aggregates metrics per variant
  6. Evaluates quality gates

Execution time

Experiments can take significant time depending on dataset size and variant count. An experiment with 3 variants and 200 dataset rows executes 600 agent calls. The platform runs these asynchronously and notifies you when results are ready.


Analyzing Results

Per-Variant Metrics

Each variant receives aggregated metrics:

Metric Description
Pass rate Percentage of samples that passed all scorers
Weighted score Weighted average across scorer scores
Latency (avg, p50, p95, p99) Response time distribution
Token usage Average input/output/total tokens per sample
Gate status Whether the variant passed all quality gates

Per-Sample Results

Drill into individual samples to understand failures:

{
  "sample_index": 42,
  "input_data": {"prompt": "What is the cancellation policy?"},
  "expected_output": {"answer": "You can cancel within 30 days for a full refund."},
  "actual_output": {"answer": "Contact support for cancellation details."},
  "scorer_results": {
    "answer_relevance": {"score": 0.45, "passed": false},
    "helpfulness": {"score": 0.6, "passed": false}
  },
  "passed": false,
  "execution_time_ms": 1240
}

Comparison Matrix

The results include pairwise deltas between all variants:

Variant A Variant B Variant C
Pass Rate 0.87 0.94 0.91
Weighted Score 0.78 0.86 0.82
Avg Latency (ms) 980 1,240 1,180
Avg Tokens 1,820 2,100 1,950
Gates Passed No Yes Yes

In this example, Variant B has the highest quality scores and passes all gates, making it the winner despite higher latency.


Promoting a Winner

After identifying the best variant, you can promote its configuration to the agent spec:

  1. Review the winning variant's configuration -- model, prompt, temperature, tools
  2. Update the agent spec in the Agent Builder with the winning settings
  3. Bump the version to track the change in the component changelog
  4. Deploy the updated solution through deployment rings

Promote through proper channels

Do not copy experiment overrides directly to production. Update the agent spec, run any required approval workflows, and promote through your organization's deployment rings. This ensures all changes are tracked and auditable.


Best Practices

Dataset Design

  • Minimum 50 samples for statistically meaningful results. 200+ is better.
  • Include negative cases -- inputs the agent should refuse or escalate.
  • Refresh regularly -- Add new samples from production traces using Trace-to-Test conversion.
  • Label carefully -- Ambiguous expected outputs produce noisy scores.

Experiment Design

  • Isolate variables -- Change one thing per variant to identify what drives improvement.
  • Run multiple times -- Non-deterministic models produce different outputs each run. Run the experiment 2-3 times and look for consistent patterns.
  • Balance quality and cost -- A higher-quality model may win on accuracy but cost 10x more per request. Include cost metrics in your gates.

Scorer Selection

  • Combine scorer types -- Use semantic similarity for content quality and LLM judge for subjective criteria.
  • Weight appropriately -- If accuracy matters more than style, give the exact match scorer a higher weight.
  • Calibrate thresholds -- Run a baseline experiment first to understand your dataset's score distribution before setting pass/fail thresholds.