Tenant-wide visibility
Track request volume, token usage, cost, failures, embeddings, and model mix across the selected time window.
Per-user accountability
See current-month user spend, effective caps, and selected-range cost by principal.
Live budget enforcement
Monthly tenant and user caps are checked before outbound LLM calls and reset automatically each month.
Flexible operations
Defaults can come from environment variables, then admins can override tenant and user caps in settings.
Dashboard experience
The Usage tab in settings is admin-only and built for operators who need to answer four questions quickly:| Question | Dashboard section |
|---|---|
| How much have we spent this month? | Tenant budget card with spend, cap, percent used, and remaining amount |
| Is usage accelerating? | Daily cost trend for 7d, 30d, 90d, or all |
| Which models are driving cost? | By-model donut chart with provider/model labels |
| Which agents and users are driving usage? | By-agent chart and by-user table with monthly caps |
Access control
Usage and budget endpoints require an authenticated human user with an admin scope. This prevents ordinary tenant users from viewing tenant-wide spend or changing caps.| Layer | Behavior |
|---|---|
| API guard | /api/v1/admin/usage/* and /api/v1/admin/budget* reject non-admin users with HTTP 403 |
| Scope source | Tokens can carry one of AGENTFLOW_ADMIN_SCOPES, and Auth0 role or permissionType values of admin map to agentflow:admin |
| SPA guard | The Usage settings tab and cap-edit controls are hidden unless the current user is an admin |
Admin authorization is enforced server-side. The SPA guard is an operator experience improvement, not the security boundary.
Tracking pipeline
Every tracked LLM call flows through the LiteLLM gateway. The gateway resolves tenant and user attribution from request context, checks budgets and shared request-rate limits, calls the provider, extracts usage, computes cost from the model registry, updates the live budget counter, and writes a durable event through the outbox. The same accounting path is used for chat completions, Responses API calls, batch secondary model calls, and LiteLLM embeddings used by knowledge-base ingestion or search. That means RAG-heavy tenants see embedding spend in the dashboard instead of only generation spend. Durable usage events include:| Field group | Examples |
|---|---|
| Attribution | tenant_name, user_id, principal_type, agent_id, agent_name, conversation_id, call_id |
| Model | provider, model_name, pricing_version |
| Tokens | input, output, cached-read, cache-write, and reasoning token counts |
| Cost | input, output, cached-read, cache-write, and total USD cost |
| Outcome | duration, success flag, error type, and metadata |
Budget model
AgentFlow supports two monthly spend caps:| Cap | Default | Override location | Enforcement |
|---|---|---|---|
| Tenant cap | LLM_BUDGET_PER_TENANT | Settings > Usage, tenant budget modal | Checked before every tracked LLM call |
| User cap | LLM_BUDGET_PER_USER | Settings > Usage, by-user table | Checked for authenticated user-attributed calls |
429 with a structured payload and a Retry-After header. The reset timestamp is the first instant of the next UTC month.
Rate controls
AgentFlow’s runtime controls are deliberately layered:| Control | What it does | Configuration |
|---|---|---|
| LLM concurrency | Caps simultaneous outbound LLM calls per API process | LLM_CONCURRENCY_LIMIT |
| Shared request-rate limits | Fixed-window tenant and user RPM limits shared through Redis | LLM_RATE_LIMIT_RPM_PER_TENANT, LLM_RATE_LIMIT_RPM_PER_USER |
| Monthly spend budgets | Stops new tracked LLM calls after the tenant or user cap is reached | LLM_BUDGET_PER_TENANT, LLM_BUDGET_PER_USER, settings overrides |
| Provider limits | Provider-enforced RPM/TPM and quota limits | Provider account or BYOK key tier |
LLM_CONCURRENCY_LIMIT is a pressure valve. The LLM_RATE_LIMIT_RPM_PER_* settings add shared request-per-minute throttles across horizontally scaled API replicas when Redis is configured. Provider TPM limits, quota limits, and any model-specific enforcement still come from the provider account or BYOK key tier.
When AgentFlow rate-limits a call, it returns HTTP 429 with error: "llm_rate_limit_exceeded", the limit scope, limit value, and a Retry-After header.
Because exact spend is known after a response returns, monthly budget enforcement is preflight-based. Concurrent in-flight requests can exceed the cap by the cost of requests that passed the check before earlier requests recorded spend. Use conservative caps, shared RPM limits, provider-side limits, and concurrency tuning for high-volume tenants.
Configuration
Operations checklist
- Keep pricing entries current in the model registry before enabling a new model for production use.
- Treat the Usage dashboard as the source for LLM call spend and adoption analytics; reconcile with provider invoices for finance-close workflows.
- Grant one of
AGENTFLOW_ADMIN_SCOPESonly to operators who should see tenant-wide usage and update caps. - Tune
LLM_CONCURRENCY_LIMITto the provider tier and expected request duration. - Set tenant and user RPM limits when you need shared burst control across replicas.
- Configure Redis for any horizontally scaled deployment.
- Add the same usage-event and spend-recording hooks when introducing new provider paths or non-chat model calls.

