Skip to main content

Models & Providers

AgentFlow uses LiteLLM as its model gateway, providing unified access to 100+ LLM providers through a single interface. Switch between models per-agent, per-request, or per-user — with full visibility into capabilities, context windows, and pricing.

Multi-provider architecture

All LLM calls go through LiteLLM, which handles:
  • Provider dispatch — routes to OpenAI, Anthropic, Google, xAI, Azure, AWS Bedrock, and 100+ others
  • Unified API — same request/response format regardless of provider
  • Automatic retries — exponential backoff with configurable retry count
  • Rate limit handling — concurrency semaphore prevents flooding provider rate limits
  • Semantic caching — optional deduplication of identical requests

Supported providers

ProviderModelsHighlights
OpenAIGPT-5, GPT-5 Mini, GPT-4.1, o-seriesReasoning effort controls, vision, structured output
AnthropicClaude Sonnet 4, Claude 4 OpusLarge context windows, vision
GoogleGemini 2.5, Gemini 3 PreviewMulti-modal, long context
xAIGrok 3, Grok 3 FastUp to 2M token context
Azure OpenAIGPT-4.1, GPT-5 (via Azure)Enterprise compliance, private endpoints
AWS BedrockClaude, Titan, LlamaVPC-native, no data leaves AWS
100+ moreVia LiteLLMAny provider LiteLLM supports works out of the box

Bring your own API keys

AgentFlow supports tenant-scoped API key management. Each tenant can configure their own provider API keys, ensuring LLM traffic routes through their accounts for billing, compliance, and data residency requirements.

Model catalog API

List all available models with capabilities and pricing:
GET /models
[
  {
    "id": "openai/gpt-4.1",
    "name": "GPT-4.1",
    "provider": "openai",
    "description": "Latest GPT-4.1 model",
    "capabilities": {
      "contextWindow": 1048576,
      "supportsVision": true,
      "supportsStreaming": true,
      "supportsTools": true,
      "supportsReasoning": true
    },
    "supportedReasoningEfforts": ["low", "medium", "high"],
    "costPer1kTokens": { "input": 0.002, "output": 0.008 },
    "maxTokens": 32768,
    "available": true
  }
]

Model selection

Models can be set at three levels, with per-request taking highest priority:

Per-agent default

agent = await af.Agent.create(
    name="AnalyticsAgent",
    llm_config={"model": "openai/gpt-4.1", "temperature": 0.3},
)

Per-user default

Users set a preferred model via the settings API. All requests use this model unless overridden at the agent or request level.
PATCH /settings
{ "model": { "selectedModel": "anthropic/claude-sonnet-4" } }

Per-request override

POST /agent/{agent_id}/chat
{
  "message": "Analyze this image",
  "model_override": "openai/gpt-5",
  "stream": true
}

Reasoning effort

For models that support reasoning (OpenAI o-series, GPT-5, and others), control the depth of chain-of-thought reasoning:
LevelUse case
noneSkip reasoning entirely
minimalLightest reasoning pass
lowFast, simple tasks
mediumBalanced (default)
highComplex analysis, multi-step reasoning
xhighMaximum reasoning depth
POST /agent/{agent_id}/chat
{
  "message": "Analyze the risk factors in this deal and recommend a mitigation strategy",
  "reasoning_effort": "high",
  "reasoning_summary": "concise",
  "stream": true
}

Reasoning summary

Control whether and how the model’s reasoning is surfaced:
ModeBehavior
"auto"Model decides whether to include reasoning
"concise"Brief reasoning summary included
"detailed"Full reasoning trace included
The reasoning summary appears in the SSE event metadata, separate from the main response content.

Structured output

Force the model to return valid JSON matching a specific schema:

Via SDK (Pydantic models)

from pydantic import BaseModel

class DealAnalysis(BaseModel):
    risk_level: str
    confidence: float
    key_factors: list[str]
    recommendation: str

result = await agent.run(
    "Analyze the Acme Corp deal",
    response_model=DealAnalysis,
)
print(result.risk_level)  # Typed access

Via REST API (JSON schema)

POST /agent/{agent_id}/chat
{
  "message": "Analyze the Acme Corp deal",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "deal_analysis",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "risk_level": { "type": "string" },
          "confidence": { "type": "number" },
          "key_factors": { "type": "array", "items": { "type": "string" } }
        },
        "required": ["risk_level", "confidence", "key_factors"],
        "additionalProperties": false
      }
    }
  },
  "stream": true
}
Supported response_format types:
  • {"type": "json_object"} — model returns valid JSON (schema not enforced)
  • {"type": "json_schema", ...} — model returns JSON matching the exact schema (strict mode)

Vision

Models with vision support automatically handle image attachments:
POST /agent/{agent_id}/chat
{
  "message": "What does this chart show?",
  "attachment_ids": ["file_abc123"],
  "image_detail": "high",
  "stream": true
}
image_detail controls resolution: "low" (faster, cheaper), "high" (full resolution), or "auto" (model decides, default). When images are present and no model override is specified, AgentFlow automatically selects a vision-capable model.

Context window management

AgentFlow tracks token usage against each model’s context window:
  • SSE events include context_window_size and context_usage_percentage in metadata
  • Per-request metrics: primary_total_tokens, primary_model, primary_context_usage_percentage
  • Conversation memory management automatically triggers when approaching context limits
  • The LLM API itself enforces hard limits — oversized requests return clear errors through the SSE error handler