AI Gateway¶
The AI Gateway provides a unified, OpenAI-compatible interface for calling large language models (LLMs) from any provider. All AI model calls in the platform -- whether from agents, workflows, hosted services, or user-facing UIs -- route through the gateway.
What the Gateway Provides¶
| Capability | Description |
|---|---|
| Unified API | Single OpenAI-compatible endpoint for all providers |
| Provider abstraction | Switch models without changing application code |
| Authentication | Automatic credential injection for configured providers |
| Rate limiting | Per-organization and per-model rate controls |
| Cost tracking | Automatic token metering and credit consumption |
| Trace logging | Request/response capture for debugging and compliance |
| Fallback routing | Automatic failover between providers |
| Streaming | SSE streaming for chat completions |
Supported Models and Providers¶
The gateway supports models from multiple providers through a LiteLLM-based routing layer. Model names follow the provider/model convention:
| Provider | Example Models |
|---|---|
| OpenAI | openai/gpt-4o, openai/gpt-4o-mini, openai/o1 |
| Anthropic | anthropic/claude-sonnet-4-20250514, anthropic/claude-haiku-3.5 |
google/gemini-2.0-flash, google/gemini-2.5-pro |
|
| Mistral | mistral/mistral-large-latest |
| Cohere | cohere/command-r-plus |
The full list of available models depends on your organization's configuration and which providers have been enabled.
Making Requests¶
Chat Completions¶
The gateway exposes an OpenAI-compatible chat completions endpoint:
from flow_sdk import GatewayClient
async with GatewayClient() as gateway:
response = await gateway.chat_completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one paragraph."},
],
temperature=0.7,
max_tokens=200,
)
print(response["choices"][0]["message"]["content"])
import httpx
resp = httpx.post(
f"{api_url}/api/v1/gateway/v1/chat/completions",
headers={"Authorization": f"Bearer {token}"},
json={
"model": "openai/gpt-4o",
"messages": [
{"role": "user", "content": "Explain quantum computing in one paragraph."}
],
"temperature": 0.7,
"max_tokens": 200,
},
)
print(resp.json()["choices"][0]["message"]["content"])
Streaming¶
Stream responses token by token using server-sent events (SSE):
from flow_sdk import GatewayClient
async with GatewayClient() as gateway:
async for chunk in gateway.stream_chat_completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Write a short story."}],
):
content = chunk["choices"][0]["delta"].get("content", "")
print(content, end="", flush=True)
Embeddings¶
Generate vector embeddings for text:
async with GatewayClient() as gateway:
response = await gateway.embeddings(
model="openai/text-embedding-3-small",
input=["Manifest Platform documentation", "AI Gateway overview"],
)
vectors = [item["embedding"] for item in response["data"]]
Model Routing and Fallback¶
Default Model¶
Each organization and workspace can configure a default model. When a request does not specify a model, the default is used.
Fallback Chain¶
The gateway supports automatic fallback when a provider is unavailable. If the primary model returns an error (rate limit, outage, etc.), the request is retried against configured fallback models:
Primary: openai/gpt-4o
└── Fallback 1: anthropic/claude-sonnet-4-20250514
└── Fallback 2: google/gemini-2.0-flash
Fallback routing is transparent to the caller -- the response includes a header indicating which model actually served the request.
Routing Hierarchy¶
Model configuration follows a resolution order:
- Request-level -- model specified in the API call
- Component-level -- model configured on the agent or workflow
- Workspace-level -- workspace default model
- Organization-level -- org default model
- Platform-level -- platform default
Provider Configuration¶
Platform-Hosted Models¶
When enabled, organizations can use platform-hosted model inference without managing their own provider API keys. Platform-hosted models are configured under Admin > AI Gateway.
Self-Managed LLM Proxy¶
Organizations can point to their own LiteLLM proxy for full control over model routing and API keys. Configure your proxy URL and API key under Admin > AI Gateway > LLM Settings.
Workspace-level overrides
When the organization allows it, individual workspaces can override the LLM proxy settings. This is useful for teams that need access to different model providers.
Bring Your Own Keys (BYOK)¶
Organizations can configure their own provider API keys. Enable BYOK under Admin > Settings > Security. When enabled, individual users can also bring their own keys if allow_user_key_override is turned on.
Usage Tracking and Cost Management¶
Every request through the AI Gateway is metered:
- Input tokens -- tokens in the prompt/messages
- Output tokens -- tokens in the completion
- Model -- which model served the request
- Cost -- credit cost computed from the active rate card
Usage data flows into the billing system for real-time credit consumption tracking.
Gateway Traces¶
When tracing is enabled on the organization, the gateway logs request and response metadata for debugging and compliance. View gateway request traces under Admin > AI Gateway > Traces.
Trace records include:
| Field | Description |
|---|---|
| Model | Which model was called |
| Input/output tokens | Token counts |
| Latency | End-to-end response time |
| Status | Success or error code |
| Cost | Credit cost of the call |
| User | Who made the request |
Trace Configuration¶
Control tracing at the organization level:
| Mode | Description |
|---|---|
all |
Log every request |
errors_only |
Log only failed requests |
sampled |
Log a percentage of requests |
none |
No trace logging |
Rate Limiting¶
The gateway enforces rate limits at multiple levels:
- Organization-level -- total requests per minute across all users
- User-level -- per-user request limits
- Model-level -- per-model limits (some providers have stricter quotas)
When a rate limit is hit, the gateway returns 429 Too Many Requests with a Retry-After header.
from flow_sdk import GatewayClient, GatewayRateLimitError
async with GatewayClient() as gateway:
try:
response = await gateway.chat_completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
except GatewayRateLimitError:
print("Rate limited -- retry after backoff")
Error Handling¶
The gateway maps provider errors to consistent error responses:
| HTTP Status | Error | Description |
|---|---|---|
401 |
GatewayAuthError |
Invalid or missing authentication |
403 |
GatewayAuthError |
Insufficient permissions |
404 |
GatewayModelNotFoundError |
Requested model not available |
429 |
GatewayRateLimitError |
Rate limit exceeded |
502 |
GatewayProviderError |
Upstream provider failure |
The GatewayClient SDK raises typed exceptions for each error class, making it straightforward to handle failures:
from flow_sdk import (
GatewayClient,
GatewayAuthError,
GatewayModelNotFoundError,
GatewayProviderError,
GatewayRateLimitError,
)
async with GatewayClient() as gateway:
try:
response = await gateway.chat_completion(model="openai/gpt-4o", messages=[...])
except GatewayAuthError:
# Re-authenticate
pass
except GatewayModelNotFoundError:
# Fall back to a different model
pass
except GatewayRateLimitError:
# Backoff and retry
pass
except GatewayProviderError:
# Provider outage -- try alternative
pass