Skip to content

Playground (Beta)

The Playground is an interactive testing environment for agents. It lets you execute agents against prompts, compare different configurations side by side, and run batch evaluations against datasets -- all without deploying to a ring.


Execution Modes

The Playground supports three modes, each designed for a different stage of agent development:

graph LR
    Single["Single Run<br/>Quick testing"] --> Compare["A/B Compare<br/>Variant comparison"]
    Compare --> Eval["Dataset Eval<br/>Batch testing"]

Single Run

The simplest mode. Select an agent, enter a prompt, and see the result.

Basic Execution

  1. Select an agent from the dropdown
  2. Type a prompt
  3. Click Run

The response panel shows:

Field Description
Output The agent's text response
Status success, error, or timeout
Duration Execution time in milliseconds
Model Used Resolved model string (e.g., anthropic/claude-sonnet-4-20250514)
Prompt Used The system message sent to the model
Tools Used Which tools were invoked during execution
Token Usage Input tokens, output tokens, total tokens
Messages Full message history including tool calls and results

Overrides

The Playground lets you override any agent configuration without modifying the saved spec. This is the core workflow for iterating on agent behavior.

Override What It Does
Model override Swap the model (e.g., test Claude vs GPT-4o)
Prompt override Replace the system prompt with custom text
Tool overrides Change which tools are available
Temperature Adjust randomness (0.0 - 2.0)
Max tokens Change response length limit

Overrides are temporary

Playground overrides never modify the saved agent configuration. They only affect the current execution. To persist changes, update the agent spec in the Agent Builder.

Execution Pipeline

You can choose between two execution paths:

  • In-process (default) -- Runs the agent directly in the backend. Faster for development.
  • Deployment pipeline -- Deploys the agent to a ring, executes, then cleans up. Tests the full production path.

A/B Comparison

Compare two configurations of the same agent on the same prompt. This is useful for evaluating the impact of a model change, prompt edit, or tool selection.

Setting Up a Comparison

  1. Select an agent
  2. Configure Variant A (e.g., current prompt with Claude Sonnet)
  3. Configure Variant B (e.g., revised prompt with GPT-4o)
  4. Enter a prompt
  5. Click Compare

Both variants execute in parallel. The results panel shows side-by-side output with delta metrics.

What Gets Compared

Each variant can override independently:

  • Model
  • System prompt (inline text or prompt component reference)
  • Tools
  • Temperature
  • Max tokens

Delta Metrics

The comparison view highlights differences:

Metric Description
Token delta Difference in total tokens consumed
Latency delta Difference in execution time (ms)
Cost delta Difference in estimated cost (when cost data is available)

N-Way Comparison

For deeper analysis, you can compare up to 8 variants simultaneously:

{
  "agent_uuid": "...",
  "user_prompt": "Summarize the Q3 financial report",
  "variants": [
    {"label": "Claude Sonnet / Temp 0.2", "model_override": "...", "temperature": 0.2},
    {"label": "Claude Sonnet / Temp 0.7", "model_override": "...", "temperature": 0.7},
    {"label": "GPT-4o / Temp 0.2", "model_override": "...", "temperature": 0.2},
    {"label": "GPT-4o / Temp 0.7", "model_override": "...", "temperature": 0.7}
  ]
}

The response includes a summary.deltas matrix with pairwise comparisons across all variants.


Dataset Evaluation

Run your agent against an entire dataset to measure quality at scale. This mode executes the agent once per dataset row and scores the results.

Running an Evaluation

  1. Select an agent
  2. Select a dataset (must contain input and optionally expected_output columns)
  3. Configure overrides (optional)
  4. Select scorers and quality gates
  5. Click Evaluate

Scorers

Scorers measure the quality of each agent response:

Scorer Type Description
Exact Match Output matches expected output exactly (case-sensitive or normalized)
Semantic Similarity Embedding-based similarity score (configurable threshold)
JSON Schema Conformance Output matches a JSON schema
Numeric Threshold Extracted numeric value falls within a range
LLM Judge A second model evaluates the response against a rubric
Code Block Custom Python scoring logic

Quality Gates

Gates define pass/fail criteria for the evaluation run:

{
  "gates": [
    {
      "type": "condition",
      "condition": {"metric": "pass_rate", "op": ">=", "value": 0.9}
    },
    {
      "type": "condition",
      "condition": {"metric": "latency_p95_ms", "op": "<=", "value": 5000}
    }
  ]
}

Gates can be combined with and, or, and not operators for complex requirements.

Evaluation Results

The evaluation response includes:

Metric Description
total_samples Number of dataset rows evaluated
passed_samples Rows that passed all scorers
failed_samples Rows that failed one or more scorers
pass_rate Overall pass rate (0.0 - 1.0)
latency_avg_ms Average execution time
latency_p50_ms Median execution time
latency_p95_ms 95th percentile execution time
weighted_score Weighted average across all scorers

Async execution

Dataset evaluations run asynchronously. The Playground returns an eval_run_id that you can poll for status and results.


Trace-to-Test Conversion

Turn production traces into test cases. If you see an interesting or problematic interaction in your gateway traces, you can convert it into a dataset sample for regression testing.

  1. Select one or more trace IDs from the observability dashboard
  2. Choose an existing dataset or create a new one
  3. Optionally include the original response as expected_output
{
  "trace_ids": ["uuid-1", "uuid-2", "uuid-3"],
  "dataset_name": "Support Edge Cases",
  "include_expected_output": true
}

This creates dataset rows from real traffic, building a test suite grounded in actual usage.


Streaming Execution

All three execution modes have SSE (Server-Sent Events) streaming variants. Streaming endpoints emit real-time events as the agent runs, rather than waiting for the full response.

Endpoint Description Stream Events
POST /playground/execute/stream Execute with overrides (streaming) status, chunk, usage, result, error
POST /playground/compare/stream N-way comparison (streaming) Interleaved per-variant events
POST /playground/evaluate/stream Dataset evaluation (streaming) Per-sample progress events

All streaming endpoints accept the same request body as their non-streaming counterparts and return Content-Type: text/event-stream.

import httpx

base = "https://api.flow.marut.cloud/api/v1/orgs/{org_id}/workspaces/{workspace_id}"

with httpx.stream(
    "POST",
    f"{base}/playground/execute/stream",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "agent_uuid": "...",
        "user_prompt": "Summarize the Q3 financial report",
    },
) as response:
    for line in response.iter_lines():
        if line.startswith("data:"):
            print(line[5:].strip())

Feature flag

Streaming endpoints require the playground feature to be enabled on the workspace.


Using the Playground

All Playground operations are available through the web console.

Navigate to Playground. Select an agent from the dropdown, type a message, and click Run. The results panel shows the agent's output, token usage, latency, and tool call details.

Switch to Compare mode. Configure model, prompt, or temperature overrides for each variant. Enter a prompt and click Compare. Results appear side by side with delta metrics.

Switch to Evaluate mode. Select an agent and a dataset, choose scorers and quality gates, then click Evaluate. The evaluation runs asynchronously — results appear when complete.