Playground (Beta)¶

The Playground is an interactive testing environment for agents. It lets you execute agents against prompts, compare different configurations side by side, and run batch evaluations against datasets -- all without deploying to a ring.

Execution Modes¶

The Playground supports three modes, each designed for a different stage of agent development:

graph LR
    Single["Single Run<br/>Quick testing"] --> Compare["A/B Compare<br/>Variant comparison"]
    Compare --> Eval["Dataset Eval<br/>Batch testing"]

Single Run¶

The simplest mode. Select an agent, enter a prompt, and see the result.

Basic Execution¶

Select an agent from the dropdown
Type a prompt
Click Run

The response panel shows:

Field	Description
Output	The agent's text response
Status	`success`, `error`, or `timeout`
Duration	Execution time in milliseconds
Model Used	Resolved model string (e.g., `anthropic/claude-sonnet-4-20250514`)
Prompt Used	The system message sent to the model
Tools Used	Which tools were invoked during execution
Token Usage	Input tokens, output tokens, total tokens
Messages	Full message history including tool calls and results

Overrides¶

The Playground lets you override any agent configuration without modifying the saved spec. This is the core workflow for iterating on agent behavior.

Override	What It Does
Model override	Swap the model (e.g., test Claude vs GPT-4o)
Prompt override	Replace the system prompt with custom text
Tool overrides	Change which tools are available
Temperature	Adjust randomness (0.0 - 2.0)
Max tokens	Change response length limit

Overrides are temporary

Playground overrides never modify the saved agent configuration. They only affect the current execution. To persist changes, update the agent spec in the Agent Builder.

Execution Pipeline¶

You can choose between two execution paths:

In-process (default) -- Runs the agent directly in the backend. Faster for development.
Deployment pipeline -- Deploys the agent to a ring, executes, then cleans up. Tests the full production path.

A/B Comparison¶

Compare two configurations of the same agent on the same prompt. This is useful for evaluating the impact of a model change, prompt edit, or tool selection.

Setting Up a Comparison¶

Select an agent
Configure Variant A (e.g., current prompt with Claude Sonnet)
Configure Variant B (e.g., revised prompt with GPT-4o)
Enter a prompt
Click Compare

Both variants execute in parallel. The results panel shows side-by-side output with delta metrics.

What Gets Compared¶

Each variant can override independently:

Model
System prompt (inline text or prompt component reference)
Tools
Temperature
Max tokens

Delta Metrics¶

The comparison view highlights differences:

Metric	Description
Token delta	Difference in total tokens consumed
Latency delta	Difference in execution time (ms)
Cost delta	Difference in estimated cost (when cost data is available)

N-Way Comparison¶

For deeper analysis, you can compare up to 8 variants simultaneously:

{
  "agent_uuid": "...",
  "user_prompt": "Summarize the Q3 financial report",
  "variants": [
    {"label": "Claude Sonnet / Temp 0.2", "model_override": "...", "temperature": 0.2},
    {"label": "Claude Sonnet / Temp 0.7", "model_override": "...", "temperature": 0.7},
    {"label": "GPT-4o / Temp 0.2", "model_override": "...", "temperature": 0.2},
    {"label": "GPT-4o / Temp 0.7", "model_override": "...", "temperature": 0.7}
  ]
}

The response includes a summary.deltas matrix with pairwise comparisons across all variants.

Dataset Evaluation¶

Run your agent against an entire dataset to measure quality at scale. This mode executes the agent once per dataset row and scores the results.

Running an Evaluation¶

Select an agent
Select a dataset (must contain input and optionally expected_output columns)
Configure overrides (optional)
Select scorers and quality gates
Click Evaluate

Scorers¶

Scorers measure the quality of each agent response:

Scorer Type	Description
Exact Match	Output matches expected output exactly (case-sensitive or normalized)
Semantic Similarity	Embedding-based similarity score (configurable threshold)
JSON Schema Conformance	Output matches a JSON schema
Numeric Threshold	Extracted numeric value falls within a range
LLM Judge	A second model evaluates the response against a rubric
Code Block	Custom Python scoring logic

Quality Gates¶

Gates define pass/fail criteria for the evaluation run:

{
  "gates": [
    {
      "type": "condition",
      "condition": {"metric": "pass_rate", "op": ">=", "value": 0.9}
    },
    {
      "type": "condition",
      "condition": {"metric": "latency_p95_ms", "op": "<=", "value": 5000}
    }
  ]
}

Gates can be combined with and, or, and not operators for complex requirements.

Evaluation Results¶

The evaluation response includes:

Metric	Description
`total_samples`	Number of dataset rows evaluated
`passed_samples`	Rows that passed all scorers
`failed_samples`	Rows that failed one or more scorers
`pass_rate`	Overall pass rate (0.0 - 1.0)
`latency_avg_ms`	Average execution time
`latency_p50_ms`	Median execution time
`latency_p95_ms`	95th percentile execution time
`weighted_score`	Weighted average across all scorers

Async execution

Dataset evaluations run asynchronously. The Playground returns an eval_run_id that you can poll for status and results.

Trace-to-Test Conversion¶

Turn production traces into test cases. If you see an interesting or problematic interaction in your gateway traces, you can convert it into a dataset sample for regression testing.

Select one or more trace IDs from the observability dashboard
Choose an existing dataset or create a new one
Optionally include the original response as expected_output

{
  "trace_ids": ["uuid-1", "uuid-2", "uuid-3"],
  "dataset_name": "Support Edge Cases",
  "include_expected_output": true
}

This creates dataset rows from real traffic, building a test suite grounded in actual usage.

Streaming Execution¶

All three execution modes have SSE (Server-Sent Events) streaming variants. Streaming endpoints emit real-time events as the agent runs, rather than waiting for the full response.

Endpoint	Description	Stream Events
`POST /playground/execute/stream`	Execute with overrides (streaming)	`status`, `chunk`, `usage`, `result`, `error`
`POST /playground/compare/stream`	N-way comparison (streaming)	Interleaved per-variant events
`POST /playground/evaluate/stream`	Dataset evaluation (streaming)	Per-sample progress events

All streaming endpoints accept the same request body as their non-streaming counterparts and return Content-Type: text/event-stream.

import httpx

base = "https://api.flow.marut.cloud/api/v1/orgs/{org_id}/workspaces/{workspace_id}"

with httpx.stream(
    "POST",
    f"{base}/playground/execute/stream",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "agent_uuid": "...",
        "user_prompt": "Summarize the Q3 financial report",
    },
) as response:
    for line in response.iter_lines():
        if line.startswith("data:"):
            print(line[5:].strip())

Feature flag

Streaming endpoints require the playground feature to be enabled on the workspace.

Using the Playground¶

All Playground operations are available through the web console.

Single RunA/B CompareDataset Eval

Navigate to Playground. Select an agent from the dropdown, type a message, and click Run. The results panel shows the agent's output, token usage, latency, and tool call details.

Switch to Compare mode. Configure model, prompt, or temperature overrides for each variant. Enter a prompt and click Compare. Results appear side by side with delta metrics.

Switch to Evaluate mode. Select an agent and a dataset, choose scorers and quality gates, then click Evaluate. The evaluation runs asynchronously — results appear when complete.