Playground (Beta)¶
The Playground is an interactive testing environment for agents. It lets you execute agents against prompts, compare different configurations side by side, and run batch evaluations against datasets -- all without deploying to a ring.
Execution Modes¶
The Playground supports three modes, each designed for a different stage of agent development:
graph LR
Single["Single Run<br/>Quick testing"] --> Compare["A/B Compare<br/>Variant comparison"]
Compare --> Eval["Dataset Eval<br/>Batch testing"]
Single Run¶
The simplest mode. Select an agent, enter a prompt, and see the result.
Basic Execution¶
- Select an agent from the dropdown
- Type a prompt
- Click Run
The response panel shows:
| Field | Description |
|---|---|
| Output | The agent's text response |
| Status | success, error, or timeout |
| Duration | Execution time in milliseconds |
| Model Used | Resolved model string (e.g., anthropic/claude-sonnet-4-20250514) |
| Prompt Used | The system message sent to the model |
| Tools Used | Which tools were invoked during execution |
| Token Usage | Input tokens, output tokens, total tokens |
| Messages | Full message history including tool calls and results |
Overrides¶
The Playground lets you override any agent configuration without modifying the saved spec. This is the core workflow for iterating on agent behavior.
| Override | What It Does |
|---|---|
| Model override | Swap the model (e.g., test Claude vs GPT-4o) |
| Prompt override | Replace the system prompt with custom text |
| Tool overrides | Change which tools are available |
| Temperature | Adjust randomness (0.0 - 2.0) |
| Max tokens | Change response length limit |
Overrides are temporary
Playground overrides never modify the saved agent configuration. They only affect the current execution. To persist changes, update the agent spec in the Agent Builder.
Execution Pipeline¶
You can choose between two execution paths:
- In-process (default) -- Runs the agent directly in the backend. Faster for development.
- Deployment pipeline -- Deploys the agent to a ring, executes, then cleans up. Tests the full production path.
A/B Comparison¶
Compare two configurations of the same agent on the same prompt. This is useful for evaluating the impact of a model change, prompt edit, or tool selection.
Setting Up a Comparison¶
- Select an agent
- Configure Variant A (e.g., current prompt with Claude Sonnet)
- Configure Variant B (e.g., revised prompt with GPT-4o)
- Enter a prompt
- Click Compare
Both variants execute in parallel. The results panel shows side-by-side output with delta metrics.
What Gets Compared¶
Each variant can override independently:
- Model
- System prompt (inline text or prompt component reference)
- Tools
- Temperature
- Max tokens
Delta Metrics¶
The comparison view highlights differences:
| Metric | Description |
|---|---|
| Token delta | Difference in total tokens consumed |
| Latency delta | Difference in execution time (ms) |
| Cost delta | Difference in estimated cost (when cost data is available) |
N-Way Comparison¶
For deeper analysis, you can compare up to 8 variants simultaneously:
{
"agent_uuid": "...",
"user_prompt": "Summarize the Q3 financial report",
"variants": [
{"label": "Claude Sonnet / Temp 0.2", "model_override": "...", "temperature": 0.2},
{"label": "Claude Sonnet / Temp 0.7", "model_override": "...", "temperature": 0.7},
{"label": "GPT-4o / Temp 0.2", "model_override": "...", "temperature": 0.2},
{"label": "GPT-4o / Temp 0.7", "model_override": "...", "temperature": 0.7}
]
}
The response includes a summary.deltas matrix with pairwise comparisons across all variants.
Dataset Evaluation¶
Run your agent against an entire dataset to measure quality at scale. This mode executes the agent once per dataset row and scores the results.
Running an Evaluation¶
- Select an agent
- Select a dataset (must contain
inputand optionallyexpected_outputcolumns) - Configure overrides (optional)
- Select scorers and quality gates
- Click Evaluate
Scorers¶
Scorers measure the quality of each agent response:
| Scorer Type | Description |
|---|---|
| Exact Match | Output matches expected output exactly (case-sensitive or normalized) |
| Semantic Similarity | Embedding-based similarity score (configurable threshold) |
| JSON Schema Conformance | Output matches a JSON schema |
| Numeric Threshold | Extracted numeric value falls within a range |
| LLM Judge | A second model evaluates the response against a rubric |
| Code Block | Custom Python scoring logic |
Quality Gates¶
Gates define pass/fail criteria for the evaluation run:
{
"gates": [
{
"type": "condition",
"condition": {"metric": "pass_rate", "op": ">=", "value": 0.9}
},
{
"type": "condition",
"condition": {"metric": "latency_p95_ms", "op": "<=", "value": 5000}
}
]
}
Gates can be combined with and, or, and not operators for complex requirements.
Evaluation Results¶
The evaluation response includes:
| Metric | Description |
|---|---|
total_samples |
Number of dataset rows evaluated |
passed_samples |
Rows that passed all scorers |
failed_samples |
Rows that failed one or more scorers |
pass_rate |
Overall pass rate (0.0 - 1.0) |
latency_avg_ms |
Average execution time |
latency_p50_ms |
Median execution time |
latency_p95_ms |
95th percentile execution time |
weighted_score |
Weighted average across all scorers |
Async execution
Dataset evaluations run asynchronously. The Playground returns an eval_run_id that you can poll for status and results.
Trace-to-Test Conversion¶
Turn production traces into test cases. If you see an interesting or problematic interaction in your gateway traces, you can convert it into a dataset sample for regression testing.
- Select one or more trace IDs from the observability dashboard
- Choose an existing dataset or create a new one
- Optionally include the original response as
expected_output
{
"trace_ids": ["uuid-1", "uuid-2", "uuid-3"],
"dataset_name": "Support Edge Cases",
"include_expected_output": true
}
This creates dataset rows from real traffic, building a test suite grounded in actual usage.
Streaming Execution¶
All three execution modes have SSE (Server-Sent Events) streaming variants. Streaming endpoints emit real-time events as the agent runs, rather than waiting for the full response.
| Endpoint | Description | Stream Events |
|---|---|---|
POST /playground/execute/stream |
Execute with overrides (streaming) | status, chunk, usage, result, error |
POST /playground/compare/stream |
N-way comparison (streaming) | Interleaved per-variant events |
POST /playground/evaluate/stream |
Dataset evaluation (streaming) | Per-sample progress events |
All streaming endpoints accept the same request body as their non-streaming counterparts and return Content-Type: text/event-stream.
import httpx
base = "https://api.flow.marut.cloud/api/v1/orgs/{org_id}/workspaces/{workspace_id}"
with httpx.stream(
"POST",
f"{base}/playground/execute/stream",
headers={"Authorization": f"Bearer {token}"},
json={
"agent_uuid": "...",
"user_prompt": "Summarize the Q3 financial report",
},
) as response:
for line in response.iter_lines():
if line.startswith("data:"):
print(line[5:].strip())
Feature flag
Streaming endpoints require the playground feature to be enabled on the workspace.
Using the Playground¶
All Playground operations are available through the web console.
Navigate to Playground. Select an agent from the dropdown, type a message, and click Run. The results panel shows the agent's output, token usage, latency, and tool call details.
Switch to Compare mode. Configure model, prompt, or temperature overrides for each variant. Enter a prompt and click Compare. Results appear side by side with delta metrics.
Switch to Evaluate mode. Select an agent and a dataset, choose scorers and quality gates, then click Evaluate. The evaluation runs asynchronously — results appear when complete.