AI Gateway¶

The AI Gateway provides a unified, OpenAI-compatible interface for calling large language models (LLMs) from any provider. All AI model calls in the platform -- whether from agents, workflows, hosted services, or user-facing UIs -- route through the gateway.

What the Gateway Provides¶

Capability	Description
Unified API	Single OpenAI-compatible endpoint for all providers
Provider abstraction	Switch models without changing application code
Authentication	Automatic credential injection for configured providers
Rate limiting	Per-organization and per-model rate controls
Cost tracking	Automatic token metering and credit consumption
Trace logging	Request/response capture for debugging and compliance
Fallback routing	Automatic failover between providers
Streaming	SSE streaming for chat completions

Supported Models and Providers¶

The gateway supports models from multiple providers through a LiteLLM-based routing layer. Model names follow the provider/model convention:

Provider	Example Models
OpenAI	`openai/gpt-4o`, `openai/gpt-4o-mini`, `openai/o1`
Anthropic	`anthropic/claude-sonnet-4-20250514`, `anthropic/claude-haiku-3.5`
Google	`google/gemini-2.0-flash`, `google/gemini-2.5-pro`
Mistral	`mistral/mistral-large-latest`
Cohere	`cohere/command-r-plus`

The full list of available models depends on your organization's configuration and which providers have been enabled.

Making Requests¶

Chat Completions¶

The gateway exposes an OpenAI-compatible chat completions endpoint:

Python (GatewayClient)Python (httpx)curl

from flow_sdk import GatewayClient

async with GatewayClient() as gateway:
    response = await gateway.chat_completion(
        model="openai/gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in one paragraph."},
        ],
        temperature=0.7,
        max_tokens=200,
    )
    print(response["choices"][0]["message"]["content"])

import httpx

resp = httpx.post(
    f"{api_url}/api/v1/gateway/v1/chat/completions",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "model": "openai/gpt-4o",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in one paragraph."}
        ],
        "temperature": 0.7,
        "max_tokens": 200,
    },
)
print(resp.json()["choices"][0]["message"]["content"])

curl -X POST \
  "$API_URL/api/v1/gateway/v1/chat/completions" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Streaming¶

Stream responses token by token using server-sent events (SSE):

from flow_sdk import GatewayClient

async with GatewayClient() as gateway:
    async for chunk in gateway.stream_chat_completion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Write a short story."}],
    ):
        content = chunk["choices"][0]["delta"].get("content", "")
        print(content, end="", flush=True)

Embeddings¶

Generate vector embeddings for text:

async with GatewayClient() as gateway:
    response = await gateway.embeddings(
        model="openai/text-embedding-3-small",
        input=["Manifest Platform documentation", "AI Gateway overview"],
    )
    vectors = [item["embedding"] for item in response["data"]]

Model Routing and Fallback¶

Default Model¶

Each organization and workspace can configure a default model. When a request does not specify a model, the default is used.

Fallback Chain¶

The gateway supports automatic fallback when a provider is unavailable. If the primary model returns an error (rate limit, outage, etc.), the request is retried against configured fallback models:

Primary: openai/gpt-4o
  └── Fallback 1: anthropic/claude-sonnet-4-20250514
       └── Fallback 2: google/gemini-2.0-flash

Fallback routing is transparent to the caller -- the response includes a header indicating which model actually served the request.

Routing Hierarchy¶

Model configuration follows a resolution order:

Request-level -- model specified in the API call
Component-level -- model configured on the agent or workflow
Workspace-level -- workspace default model
Organization-level -- org default model
Platform-level -- platform default

Provider Configuration¶

Platform-Hosted Models¶

When enabled, organizations can use platform-hosted model inference without managing their own provider API keys. Platform-hosted models are configured under Admin > AI Gateway.

Self-Managed LLM Proxy¶

Organizations can point to their own LiteLLM proxy for full control over model routing and API keys. Configure your proxy URL and API key under Admin > AI Gateway > LLM Settings.

Workspace-level overrides

When the organization allows it, individual workspaces can override the LLM proxy settings. This is useful for teams that need access to different model providers.

Bring Your Own Keys (BYOK)¶

Organizations can configure their own provider API keys. Enable BYOK under Admin > Settings > Security. When enabled, individual users can also bring their own keys if allow_user_key_override is turned on.

Usage Tracking and Cost Management¶

Every request through the AI Gateway is metered:

Input tokens -- tokens in the prompt/messages
Output tokens -- tokens in the completion
Model -- which model served the request
Cost -- credit cost computed from the active rate card

Usage data flows into the billing system for real-time credit consumption tracking.

Gateway Traces¶

When tracing is enabled on the organization, the gateway logs request and response metadata for debugging and compliance. View gateway request traces under Admin > AI Gateway > Traces.

Trace records include:

Field	Description
Model	Which model was called
Input/output tokens	Token counts
Latency	End-to-end response time
Status	Success or error code
Cost	Credit cost of the call
User	Who made the request

Trace Configuration¶

Control tracing at the organization level:

Mode	Description
`all`	Log every request
`errors_only`	Log only failed requests
`sampled`	Log a percentage of requests
`none`	No trace logging

Rate Limiting¶

The gateway enforces rate limits at multiple levels:

Organization-level -- total requests per minute across all users
User-level -- per-user request limits
Model-level -- per-model limits (some providers have stricter quotas)

When a rate limit is hit, the gateway returns 429 Too Many Requests with a Retry-After header.

from flow_sdk import GatewayClient, GatewayRateLimitError

async with GatewayClient() as gateway:
    try:
        response = await gateway.chat_completion(
            model="openai/gpt-4o",
            messages=[{"role": "user", "content": "Hello"}],
        )
    except GatewayRateLimitError:
        print("Rate limited -- retry after backoff")

Error Handling¶

The gateway maps provider errors to consistent error responses:

HTTP Status	Error	Description
`401`	`GatewayAuthError`	Invalid or missing authentication
`403`	`GatewayAuthError`	Insufficient permissions
`404`	`GatewayModelNotFoundError`	Requested model not available
`429`	`GatewayRateLimitError`	Rate limit exceeded
`502`	`GatewayProviderError`	Upstream provider failure

The GatewayClient SDK raises typed exceptions for each error class, making it straightforward to handle failures:

from flow_sdk import (
    GatewayClient,
    GatewayAuthError,
    GatewayModelNotFoundError,
    GatewayProviderError,
    GatewayRateLimitError,
)

async with GatewayClient() as gateway:
    try:
        response = await gateway.chat_completion(model="openai/gpt-4o", messages=[...])
    except GatewayAuthError:
        # Re-authenticate
        pass
    except GatewayModelNotFoundError:
        # Fall back to a different model
        pass
    except GatewayRateLimitError:
        # Backoff and retry
        pass
    except GatewayProviderError:
        # Provider outage -- try alternative
        pass