Skip to content

AI Gateway

The AI Gateway provides a unified, OpenAI-compatible interface for calling large language models (LLMs) from any provider. All AI model calls in the platform -- whether from agents, workflows, hosted services, or user-facing UIs -- route through the gateway.

What the Gateway Provides

Capability Description
Unified API Single OpenAI-compatible endpoint for all providers
Provider abstraction Switch models without changing application code
Authentication Automatic credential injection for configured providers
Rate limiting Per-organization and per-model rate controls
Cost tracking Automatic token metering and credit consumption
Trace logging Request/response capture for debugging and compliance
Fallback routing Automatic failover between providers
Streaming SSE streaming for chat completions

Supported Models and Providers

The gateway supports models from multiple providers through a LiteLLM-based routing layer. Model names follow the provider/model convention:

Provider Example Models
OpenAI openai/gpt-4o, openai/gpt-4o-mini, openai/o1
Anthropic anthropic/claude-sonnet-4-20250514, anthropic/claude-haiku-3.5
Google google/gemini-2.0-flash, google/gemini-2.5-pro
Mistral mistral/mistral-large-latest
Cohere cohere/command-r-plus

The full list of available models depends on your organization's configuration and which providers have been enabled.

Making Requests

Chat Completions

The gateway exposes an OpenAI-compatible chat completions endpoint:

from flow_sdk import GatewayClient

async with GatewayClient() as gateway:
    response = await gateway.chat_completion(
        model="openai/gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in one paragraph."},
        ],
        temperature=0.7,
        max_tokens=200,
    )
    print(response["choices"][0]["message"]["content"])
import httpx

resp = httpx.post(
    f"{api_url}/api/v1/gateway/v1/chat/completions",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "model": "openai/gpt-4o",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in one paragraph."}
        ],
        "temperature": 0.7,
        "max_tokens": 200,
    },
)
print(resp.json()["choices"][0]["message"]["content"])
curl -X POST \
  "$API_URL/api/v1/gateway/v1/chat/completions" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Streaming

Stream responses token by token using server-sent events (SSE):

from flow_sdk import GatewayClient

async with GatewayClient() as gateway:
    async for chunk in gateway.stream_chat_completion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Write a short story."}],
    ):
        content = chunk["choices"][0]["delta"].get("content", "")
        print(content, end="", flush=True)

Embeddings

Generate vector embeddings for text:

async with GatewayClient() as gateway:
    response = await gateway.embeddings(
        model="openai/text-embedding-3-small",
        input=["Manifest Platform documentation", "AI Gateway overview"],
    )
    vectors = [item["embedding"] for item in response["data"]]

Model Routing and Fallback

Default Model

Each organization and workspace can configure a default model. When a request does not specify a model, the default is used.

Fallback Chain

The gateway supports automatic fallback when a provider is unavailable. If the primary model returns an error (rate limit, outage, etc.), the request is retried against configured fallback models:

Primary: openai/gpt-4o
  └── Fallback 1: anthropic/claude-sonnet-4-20250514
       └── Fallback 2: google/gemini-2.0-flash

Fallback routing is transparent to the caller -- the response includes a header indicating which model actually served the request.

Routing Hierarchy

Model configuration follows a resolution order:

  1. Request-level -- model specified in the API call
  2. Component-level -- model configured on the agent or workflow
  3. Workspace-level -- workspace default model
  4. Organization-level -- org default model
  5. Platform-level -- platform default

Provider Configuration

Platform-Hosted Models

When enabled, organizations can use platform-hosted model inference without managing their own provider API keys. Platform-hosted models are configured under Admin > AI Gateway.

Self-Managed LLM Proxy

Organizations can point to their own LiteLLM proxy for full control over model routing and API keys. Configure your proxy URL and API key under Admin > AI Gateway > LLM Settings.

Workspace-level overrides

When the organization allows it, individual workspaces can override the LLM proxy settings. This is useful for teams that need access to different model providers.

Bring Your Own Keys (BYOK)

Organizations can configure their own provider API keys. Enable BYOK under Admin > Settings > Security. When enabled, individual users can also bring their own keys if allow_user_key_override is turned on.

Usage Tracking and Cost Management

Every request through the AI Gateway is metered:

  • Input tokens -- tokens in the prompt/messages
  • Output tokens -- tokens in the completion
  • Model -- which model served the request
  • Cost -- credit cost computed from the active rate card

Usage data flows into the billing system for real-time credit consumption tracking.

Gateway Traces

When tracing is enabled on the organization, the gateway logs request and response metadata for debugging and compliance. View gateway request traces under Admin > AI Gateway > Traces.

Trace records include:

Field Description
Model Which model was called
Input/output tokens Token counts
Latency End-to-end response time
Status Success or error code
Cost Credit cost of the call
User Who made the request

Trace Configuration

Control tracing at the organization level:

Mode Description
all Log every request
errors_only Log only failed requests
sampled Log a percentage of requests
none No trace logging

Rate Limiting

The gateway enforces rate limits at multiple levels:

  1. Organization-level -- total requests per minute across all users
  2. User-level -- per-user request limits
  3. Model-level -- per-model limits (some providers have stricter quotas)

When a rate limit is hit, the gateway returns 429 Too Many Requests with a Retry-After header.

from flow_sdk import GatewayClient, GatewayRateLimitError

async with GatewayClient() as gateway:
    try:
        response = await gateway.chat_completion(
            model="openai/gpt-4o",
            messages=[{"role": "user", "content": "Hello"}],
        )
    except GatewayRateLimitError:
        print("Rate limited -- retry after backoff")

Error Handling

The gateway maps provider errors to consistent error responses:

HTTP Status Error Description
401 GatewayAuthError Invalid or missing authentication
403 GatewayAuthError Insufficient permissions
404 GatewayModelNotFoundError Requested model not available
429 GatewayRateLimitError Rate limit exceeded
502 GatewayProviderError Upstream provider failure

The GatewayClient SDK raises typed exceptions for each error class, making it straightforward to handle failures:

from flow_sdk import (
    GatewayClient,
    GatewayAuthError,
    GatewayModelNotFoundError,
    GatewayProviderError,
    GatewayRateLimitError,
)

async with GatewayClient() as gateway:
    try:
        response = await gateway.chat_completion(model="openai/gpt-4o", messages=[...])
    except GatewayAuthError:
        # Re-authenticate
        pass
    except GatewayModelNotFoundError:
        # Fall back to a different model
        pass
    except GatewayRateLimitError:
        # Backoff and retry
        pass
    except GatewayProviderError:
        # Provider outage -- try alternative
        pass