Models & Providers

AgentFlow uses LiteLLM as its model gateway, providing unified access to 100+ LLM providers through a single interface. Switch between models per-agent, per-request, or per-user — with full visibility into capabilities, context windows, and pricing.

Multi-provider architecture

All LLM calls go through LiteLLM, which handles:

Provider dispatch — routes to OpenAI, Anthropic, Google, xAI, Azure, AWS Bedrock, and 100+ others
Unified API — same request/response format regardless of provider
Automatic retries — exponential backoff with configurable retry count
Rate limit handling — concurrency semaphore prevents flooding provider rate limits
Semantic caching — optional deduplication of identical requests

Supported providers

Provider	Models	Highlights
OpenAI	GPT-5, GPT-5 Mini, GPT-4.1, o-series	Reasoning effort controls, vision, structured output
Anthropic	Claude Sonnet 4, Claude 4 Opus	Large context windows, vision
Google	Gemini 2.5, Gemini 3 Preview	Multi-modal, long context
xAI	Grok 3, Grok 3 Fast	Up to 2M token context
Azure OpenAI	GPT-4.1, GPT-5 (via Azure)	Enterprise compliance, private endpoints
AWS Bedrock	Claude, Titan, Llama	VPC-native, no data leaves AWS
100+ more	Via LiteLLM	Any provider LiteLLM supports works out of the box

Bring your own API keys

AgentFlow supports tenant-scoped, encrypted BYO LLM keys. Admins can save more than one provider key for the same tenant, for example Anthropic and Google Gemini, then choose exactly which models from those providers are allowed. BYO LLM is tenant-wide. When a tenant is in byo mode:

AgentFlow routes LLM calls only through tenant-supplied provider keys.
The public /api/v1/models catalog returns only models allowed by the tenant policy.
Explicit requests for disallowed models fail with 403 instead of falling back to platform defaults.
Backend defaults are remapped per use case, including agent chat, raw LLM chat, tool calls, sub-agents, title generation, KB enrichment, follow-up questions, autocomplete, query processing, planning, reflection, summaries, vision, embeddings, and reranking.
Platform keys are not used for that tenant. If no allowed/default model can satisfy a provider-locked use case such as embeddings or reranking, the call fails closed.

Keys are encrypted before storage. API responses never return the secret; they only expose non-sensitive metadata such as key_last_four so admins can identify which key is active.

Admin BYO workflow

Call GET /api/v1/llm/config/model-options to list models and backend use cases that can be mapped.
Save one provider at a time with POST /api/v1/llm/config, passing api_key, allowed_models, and default_models.
Repeat for additional providers, for example one Anthropic key for chat/tool workloads and one Google key for Gemini workloads.
Saving a provider key automatically enables BYO mode for the tenant. You can also switch modes explicitly with PUT /api/v1/llm/config/mode and {"mode": "byo"}.
Verify the effective merged policy with GET /api/v1/llm/config.

Example:

curl -X POST https://api.example.com/api/v1/llm/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "anthropic",
    "api_key": "sk-ant-...",
    "allowed_models": [
      "anthropic/claude-haiku-4-5-20251101",
      "anthropic/claude-sonnet-4-5-20250929"
    ],
    "default_models": {
      "chat": "anthropic/claude-sonnet-4-5-20250929",
      "tool": "anthropic/claude-haiku-4-5-20251101",
      "sub_agent": "anthropic/claude-haiku-4-5-20251101",
      "title": "anthropic/claude-haiku-4-5-20251101",
      "enrichment": "anthropic/claude-haiku-4-5-20251101"
    }
  }'

Add Google Gemini alongside it:

curl -X POST https://api.example.com/api/v1/llm/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "google",
    "api_key": "AIza...",
    "allowed_models": ["google/gemini-2.5-flash"],
    "default_models": {
      "follow_up_questions": "google/gemini-2.5-flash",
      "autocomplete": "google/gemini-2.5-flash"
    }
  }'

For knowledge bases, include an embedding-capable model in the policy and map embedding. Today that is typically an OpenAI embedding model:

{
  "provider": "openai",
  "allowed_models": ["openai/text-embedding-3-small"],
  "default_models": {
    "embedding": "openai/text-embedding-3-small"
  }
}

Model catalog API

List all available models with capabilities and pricing:

GET /api/v1/models

[
  {
    "id": "openai/gpt-4.1",
    "name": "GPT-4.1",
    "provider": "openai",
    "description": "Latest GPT-4.1 model",
    "capabilities": {
      "contextWindow": 1048576,
      "supportsVision": true,
      "supportsStreaming": true,
      "supportsTools": true,
      "supportsReasoning": true
    },
    "supportedReasoningEfforts": ["low", "medium", "high"],
    "costPer1kTokens": { "input": 0.002, "output": 0.008 },
    "maxTokens": 32768,
    "available": true
  }
]

Model selection

Models can be set at three levels, with per-request taking highest priority:

Per-agent default

from agentflow import AsyncAgentFlow

async with AsyncAgentFlow.from_profile("local") as client:
    agent_id = {agent.name: agent.id for agent in await client.agents.list()}["AnalyticsAgent"]
    agent = await client.agents.update(
        agent_id,
        llm_config={"model": "openai/gpt-4.1", "temperature": 0.3},
    )

Per-user default

Users set a preferred model via the settings API. All requests use this model unless overridden at the agent or request level.

PATCH /api/v1/settings
{ "model": { "selectedModel": "anthropic/claude-sonnet-4" } }

Per-request override

POST /api/v1/agent/{agent_id}/chat
{
  "message": "Analyze this image",
  "conversation_id": "conv_001",
  "message_id": "msg_001",
  "model": "openai/gpt-5.4-mini",
  "stream": true
}

Reasoning effort

For models that support reasoning, control reasoning depth with model-specific allowed values. The /api/v1/models response includes each model’s supportedReasoningEfforts.

Level	Use case
`none`	Skip reasoning entirely
`low`	Fast, simple tasks
`medium`	Balanced (default)
`high`	Complex analysis, multi-step reasoning
`xhigh`	Maximum reasoning depth

OpenAI GPT-5.4 models accept "none", "low", "medium", "high", and "xhigh". Some xAI models accept only "low" and "high". Do not send unsupported values such as "minimal" unless the target model advertises them.

POST /api/v1/agent/{agent_id}/chat
{
  "message": "Analyze the risk factors in this deal and recommend a mitigation strategy",
  "conversation_id": "conv_001",
  "message_id": "msg_002",
  "reasoning_effort": "high",
  "reasoning_summary": "concise",
  "stream": true
}

Reasoning summary

Control whether and how the model’s reasoning is surfaced:

Mode	Behavior
`"auto"`	Model decides whether to include reasoning
`"concise"`	Brief reasoning summary included
`"detailed"`	Full reasoning trace included

The reasoning summary appears in the SSE event metadata, separate from the main response content.

Structured output

Force the model to return valid JSON matching a specific schema:

Via SDK (Pydantic models)

from pydantic import BaseModel

class DealAnalysis(BaseModel):
    risk_level: str
    confidence: float
    key_factors: list[str]
    recommendation: str

result = await agent.run(
    "Analyze the Acme Corp deal",
    response_model=DealAnalysis,
)
print(result.risk_level)  # Typed access

Via REST API (JSON schema)

POST /api/v1/agent/{agent_id}/chat
{
  "message": "Analyze the Acme Corp deal",
  "conversation_id": "conv_001",
  "message_id": "msg_003",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "deal_analysis",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "risk_level": { "type": "string" },
          "confidence": { "type": "number" },
          "key_factors": { "type": "array", "items": { "type": "string" } }
        },
        "required": ["risk_level", "confidence", "key_factors"],
        "additionalProperties": false
      }
    }
  },
  "stream": true
}

Supported response_format types:

{"type": "json_object"} — model returns valid JSON (schema not enforced)
{"type": "json_schema", ...} — model returns JSON matching the exact schema (strict mode)

Vision

Models with vision support automatically handle image attachments:

POST /api/v1/agent/{agent_id}/chat
{
  "message": "What does this chart show?",
  "conversation_id": "conv_001",
  "message_id": "msg_004",
  "attachment_ids": ["file_abc123"],
  "image_detail": "high",
  "stream": true
}

image_detail controls resolution: "low" (faster, cheaper), "high" (full resolution), or "auto" (model decides, default). When images are present and no model override is specified, AgentFlow automatically selects a vision-capable model.

Context window management

AgentFlow tracks token usage against each model’s context window:

SSE events include context_window_size and context_usage_percentage in metadata
Per-request metrics: primary_total_tokens, primary_model, primary_context_usage_percentage
Chat-history compaction automatically triggers when approaching context limits
The LLM API itself enforces hard limits — oversized requests return clear errors through the SSE error handler

​Models & Providers

​Multi-provider architecture

​Supported providers

​Bring your own API keys

​Admin BYO workflow

​Model catalog API

​Model selection

​Per-agent default

​Per-user default

​Per-request override

​Reasoning effort

​Reasoning summary

​Structured output

​Via SDK (Pydantic models)

​Via REST API (JSON schema)

​Vision

​Context window management