Models & Providers
AgentFlow uses LiteLLM as its model gateway, providing unified access to 100+ LLM providers through a single interface. Switch between models per-agent, per-request, or per-user — with full visibility into capabilities, context windows, and pricing.Multi-provider architecture
All LLM calls go through LiteLLM, which handles:- Provider dispatch — routes to OpenAI, Anthropic, Google, xAI, Azure, AWS Bedrock, and 100+ others
- Unified API — same request/response format regardless of provider
- Automatic retries — exponential backoff with configurable retry count
- Rate limit handling — concurrency semaphore prevents flooding provider rate limits
- Semantic caching — optional deduplication of identical requests
Supported providers
| Provider | Models | Highlights |
|---|---|---|
| OpenAI | GPT-5, GPT-5 Mini, GPT-4.1, o-series | Reasoning effort controls, vision, structured output |
| Anthropic | Claude Sonnet 4, Claude 4 Opus | Large context windows, vision |
| Gemini 2.5, Gemini 3 Preview | Multi-modal, long context | |
| xAI | Grok 3, Grok 3 Fast | Up to 2M token context |
| Azure OpenAI | GPT-4.1, GPT-5 (via Azure) | Enterprise compliance, private endpoints |
| AWS Bedrock | Claude, Titan, Llama | VPC-native, no data leaves AWS |
| 100+ more | Via LiteLLM | Any provider LiteLLM supports works out of the box |
Bring your own API keys
AgentFlow supports tenant-scoped, encrypted BYO LLM keys. Admins can save more than one provider key for the same tenant, for example Anthropic and Google Gemini, then choose exactly which models from those providers are allowed. BYO LLM is tenant-wide. When a tenant is inbyo mode:
- AgentFlow routes LLM calls only through tenant-supplied provider keys.
- The public
/api/v1/modelscatalog returns only models allowed by the tenant policy. - Explicit requests for disallowed models fail with
403instead of falling back to platform defaults. - Backend defaults are remapped per use case, including agent chat, raw LLM chat, tool calls, sub-agents, title generation, KB enrichment, follow-up questions, autocomplete, query processing, planning, reflection, summaries, vision, embeddings, and reranking.
- Platform keys are not used for that tenant. If no allowed/default model can satisfy a provider-locked use case such as embeddings or reranking, the call fails closed.
key_last_four so admins can identify which key is active.
Admin BYO workflow
- Call
GET /api/v1/llm/config/model-optionsto list models and backend use cases that can be mapped. - Save one provider at a time with
POST /api/v1/llm/config, passingapi_key,allowed_models, anddefault_models. - Repeat for additional providers, for example one Anthropic key for chat/tool workloads and one Google key for Gemini workloads.
- Saving a provider key automatically enables BYO mode for the tenant. You can also switch modes explicitly with
PUT /api/v1/llm/config/modeand{"mode": "byo"}. - Verify the effective merged policy with
GET /api/v1/llm/config.
embedding. Today that is typically an OpenAI embedding model:
Model catalog API
List all available models with capabilities and pricing:Model selection
Models can be set at three levels, with per-request taking highest priority:Per-agent default
Per-user default
Users set a preferred model via the settings API. All requests use this model unless overridden at the agent or request level.Per-request override
Reasoning effort
For models that support reasoning, control reasoning depth with model-specific allowed values. The/api/v1/models response includes each model’s supportedReasoningEfforts.
| Level | Use case |
|---|---|
none | Skip reasoning entirely |
low | Fast, simple tasks |
medium | Balanced (default) |
high | Complex analysis, multi-step reasoning |
xhigh | Maximum reasoning depth |
"none", "low", "medium", "high", and "xhigh". Some xAI models accept only "low" and "high". Do not send unsupported values such as "minimal" unless the target model advertises them.
Reasoning summary
Control whether and how the model’s reasoning is surfaced:| Mode | Behavior |
|---|---|
"auto" | Model decides whether to include reasoning |
"concise" | Brief reasoning summary included |
"detailed" | Full reasoning trace included |
Structured output
Force the model to return valid JSON matching a specific schema:Via SDK (Pydantic models)
Via REST API (JSON schema)
response_format types:
{"type": "json_object"}— model returns valid JSON (schema not enforced){"type": "json_schema", ...}— model returns JSON matching the exact schema (strict mode)
Vision
Models with vision support automatically handle image attachments:image_detail controls resolution: "low" (faster, cheaper), "high" (full resolution), or "auto" (model decides, default).
When images are present and no model override is specified, AgentFlow automatically selects a vision-capable model.
Context window management
AgentFlow tracks token usage against each model’s context window:- SSE events include
context_window_sizeandcontext_usage_percentagein metadata - Per-request metrics:
primary_total_tokens,primary_model,primary_context_usage_percentage - Chat-history compaction automatically triggers when approaching context limits
- The LLM API itself enforces hard limits — oversized requests return clear errors through the SSE error handler

