Skip to main content

Models & Providers

AgentFlow uses LiteLLM as its model gateway, providing unified access to 100+ LLM providers through a single interface. Switch between models per-agent, per-request, or per-user — with full visibility into capabilities, context windows, and pricing.

Multi-provider architecture

All LLM calls go through LiteLLM, which handles:
  • Provider dispatch — routes to OpenAI, Anthropic, Google, xAI, Azure, AWS Bedrock, and 100+ others
  • Unified API — same request/response format regardless of provider
  • Automatic retries — exponential backoff with configurable retry count
  • Rate limit handling — concurrency semaphore prevents flooding provider rate limits
  • Semantic caching — optional deduplication of identical requests

Supported providers

ProviderModelsHighlights
OpenAIGPT-5, GPT-5 Mini, GPT-4.1, o-seriesReasoning effort controls, vision, structured output
AnthropicClaude Sonnet 4, Claude 4 OpusLarge context windows, vision
GoogleGemini 2.5, Gemini 3 PreviewMulti-modal, long context
xAIGrok 3, Grok 3 FastUp to 2M token context
Azure OpenAIGPT-4.1, GPT-5 (via Azure)Enterprise compliance, private endpoints
AWS BedrockClaude, Titan, LlamaVPC-native, no data leaves AWS
100+ moreVia LiteLLMAny provider LiteLLM supports works out of the box

Bring your own API keys

AgentFlow supports tenant-scoped, encrypted BYO LLM keys. Admins can save more than one provider key for the same tenant, for example Anthropic and Google Gemini, then choose exactly which models from those providers are allowed. BYO LLM is tenant-wide. When a tenant is in byo mode:
  • AgentFlow routes LLM calls only through tenant-supplied provider keys.
  • The public /api/v1/models catalog returns only models allowed by the tenant policy.
  • Explicit requests for disallowed models fail with 403 instead of falling back to platform defaults.
  • Backend defaults are remapped per use case, including agent chat, raw LLM chat, tool calls, sub-agents, title generation, KB enrichment, follow-up questions, autocomplete, query processing, planning, reflection, summaries, vision, embeddings, and reranking.
  • Platform keys are not used for that tenant. If no allowed/default model can satisfy a provider-locked use case such as embeddings or reranking, the call fails closed.
Keys are encrypted before storage. API responses never return the secret; they only expose non-sensitive metadata such as key_last_four so admins can identify which key is active.

Admin BYO workflow

  1. Call GET /api/v1/llm/config/model-options to list models and backend use cases that can be mapped.
  2. Save one provider at a time with POST /api/v1/llm/config, passing api_key, allowed_models, and default_models.
  3. Repeat for additional providers, for example one Anthropic key for chat/tool workloads and one Google key for Gemini workloads.
  4. Saving a provider key automatically enables BYO mode for the tenant. You can also switch modes explicitly with PUT /api/v1/llm/config/mode and {"mode": "byo"}.
  5. Verify the effective merged policy with GET /api/v1/llm/config.
Example:
curl -X POST https://api.example.com/api/v1/llm/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "anthropic",
    "api_key": "sk-ant-...",
    "allowed_models": [
      "anthropic/claude-haiku-4-5-20251101",
      "anthropic/claude-sonnet-4-5-20250929"
    ],
    "default_models": {
      "chat": "anthropic/claude-sonnet-4-5-20250929",
      "tool": "anthropic/claude-haiku-4-5-20251101",
      "sub_agent": "anthropic/claude-haiku-4-5-20251101",
      "title": "anthropic/claude-haiku-4-5-20251101",
      "enrichment": "anthropic/claude-haiku-4-5-20251101"
    }
  }'
Add Google Gemini alongside it:
curl -X POST https://api.example.com/api/v1/llm/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "google",
    "api_key": "AIza...",
    "allowed_models": ["google/gemini-2.5-flash"],
    "default_models": {
      "follow_up_questions": "google/gemini-2.5-flash",
      "autocomplete": "google/gemini-2.5-flash"
    }
  }'
For knowledge bases, include an embedding-capable model in the policy and map embedding. Today that is typically an OpenAI embedding model:
{
  "provider": "openai",
  "allowed_models": ["openai/text-embedding-3-small"],
  "default_models": {
    "embedding": "openai/text-embedding-3-small"
  }
}

Model catalog API

List all available models with capabilities and pricing:
GET /api/v1/models
[
  {
    "id": "openai/gpt-4.1",
    "name": "GPT-4.1",
    "provider": "openai",
    "description": "Latest GPT-4.1 model",
    "capabilities": {
      "contextWindow": 1048576,
      "supportsVision": true,
      "supportsStreaming": true,
      "supportsTools": true,
      "supportsReasoning": true
    },
    "supportedReasoningEfforts": ["low", "medium", "high"],
    "costPer1kTokens": { "input": 0.002, "output": 0.008 },
    "maxTokens": 32768,
    "available": true
  }
]

Model selection

Models can be set at three levels, with per-request taking highest priority:

Per-agent default

from agentflow import AsyncAgentFlow

async with AsyncAgentFlow.from_profile("local") as client:
    agent_id = {agent.name: agent.id for agent in await client.agents.list()}["AnalyticsAgent"]
    agent = await client.agents.update(
        agent_id,
        llm_config={"model": "openai/gpt-4.1", "temperature": 0.3},
    )

Per-user default

Users set a preferred model via the settings API. All requests use this model unless overridden at the agent or request level.
PATCH /api/v1/settings
{ "model": { "selectedModel": "anthropic/claude-sonnet-4" } }

Per-request override

POST /api/v1/agent/{agent_id}/chat
{
  "message": "Analyze this image",
  "conversation_id": "conv_001",
  "message_id": "msg_001",
  "model": "openai/gpt-5.4-mini",
  "stream": true
}

Reasoning effort

For models that support reasoning, control reasoning depth with model-specific allowed values. The /api/v1/models response includes each model’s supportedReasoningEfforts.
LevelUse case
noneSkip reasoning entirely
lowFast, simple tasks
mediumBalanced (default)
highComplex analysis, multi-step reasoning
xhighMaximum reasoning depth
OpenAI GPT-5.4 models accept "none", "low", "medium", "high", and "xhigh". Some xAI models accept only "low" and "high". Do not send unsupported values such as "minimal" unless the target model advertises them.
POST /api/v1/agent/{agent_id}/chat
{
  "message": "Analyze the risk factors in this deal and recommend a mitigation strategy",
  "conversation_id": "conv_001",
  "message_id": "msg_002",
  "reasoning_effort": "high",
  "reasoning_summary": "concise",
  "stream": true
}

Reasoning summary

Control whether and how the model’s reasoning is surfaced:
ModeBehavior
"auto"Model decides whether to include reasoning
"concise"Brief reasoning summary included
"detailed"Full reasoning trace included
The reasoning summary appears in the SSE event metadata, separate from the main response content.

Structured output

Force the model to return valid JSON matching a specific schema:

Via SDK (Pydantic models)

from pydantic import BaseModel

class DealAnalysis(BaseModel):
    risk_level: str
    confidence: float
    key_factors: list[str]
    recommendation: str

result = await agent.run(
    "Analyze the Acme Corp deal",
    response_model=DealAnalysis,
)
print(result.risk_level)  # Typed access

Via REST API (JSON schema)

POST /api/v1/agent/{agent_id}/chat
{
  "message": "Analyze the Acme Corp deal",
  "conversation_id": "conv_001",
  "message_id": "msg_003",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "deal_analysis",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "risk_level": { "type": "string" },
          "confidence": { "type": "number" },
          "key_factors": { "type": "array", "items": { "type": "string" } }
        },
        "required": ["risk_level", "confidence", "key_factors"],
        "additionalProperties": false
      }
    }
  },
  "stream": true
}
Supported response_format types:
  • {"type": "json_object"} — model returns valid JSON (schema not enforced)
  • {"type": "json_schema", ...} — model returns JSON matching the exact schema (strict mode)

Vision

Models with vision support automatically handle image attachments:
POST /api/v1/agent/{agent_id}/chat
{
  "message": "What does this chart show?",
  "conversation_id": "conv_001",
  "message_id": "msg_004",
  "attachment_ids": ["file_abc123"],
  "image_detail": "high",
  "stream": true
}
image_detail controls resolution: "low" (faster, cheaper), "high" (full resolution), or "auto" (model decides, default). When images are present and no model override is specified, AgentFlow automatically selects a vision-capable model.

Context window management

AgentFlow tracks token usage against each model’s context window:
  • SSE events include context_window_size and context_usage_percentage in metadata
  • Per-request metrics: primary_total_tokens, primary_model, primary_context_usage_percentage
  • Chat-history compaction automatically triggers when approaching context limits
  • The LLM API itself enforces hard limits — oversized requests return clear errors through the SSE error handler