AI Pilot — Resilience & Circuit Breaker
Audience: Platform engineers, SREs, AI infrastructure owners Time: ~10 min read
AI Pilot moves resiliency out of your application SDK code and into the bouncer. The bouncer trips circuit breakers on cost, latency, and error rate (not just 5xx), retries with a budget, hedges slow requests, and falls back to a different model when the primary is unhealthy.
Why server-side resiliency
Today, AI client SDKs each implement their own retry/timeout/fallback logic. That means:
- Inconsistent behavior across teams and languages
- Redundant code in every microservice
- No central observability
- No way for governance to enforce a "must use fallback X for production" rule
Pushing this to the bouncer fixes all four. Apps call a single endpoint and the bouncer handles failures.
Click to enlarge
Configurable signals
Open Settings -> AI Pilot -> Resilience and configure each backend's circuit breaker with multiple trip conditions:
| Signal | Default | Notes |
|---|---|---|
cost_per_minute_usd | unset | Trip if estimated cost spikes (e.g. runaway agents) |
latency_p95_ms | 30000 | Trip on degraded provider performance |
error_rate_pct | 25 | Per minute |
consecutive_5xx | 5 | Classic Envoy circuit breaker |
cool_down_s | 60 | How long the breaker stays open |
half_open_probes | 3 | Probes during the half-open phase |
fallback_route_id | unset | Where to send traffic when tripped |
The bouncer evaluates all enabled signals; tripping any one trips the breaker.
Retries and retry budget
Configure in Settings -> AI Pilot -> Resilience -> Retries:
max_retries_per_request(default 2)retry_on—5xx,connect-failure,gateway-error,429retry_budget_pct— cap retries at e.g. 10% of total RPS so a storm of failures cannot multiply load on the backendper_try_timeout_ms— bounded per attemptback_off_base_msandback_off_max_ms— exponential
Hedging
For latency-sensitive paths, enable hedging:
hedge_after_ms— after this much elapsed time on the primary attempt, fire a second attempt to the fallbackcancel_on_response— cancel the slower attempt as soon as one returns
This trades a little extra cost for tail-latency improvements.
Fallback chains
A fallback can be:
- A different model on the same provider (e.g.
gpt-4o->gpt-4o-mini) - A different provider (e.g. OpenAI -> Anthropic)
- A different deployment region
- A cached response (when the prompt cache is enabled)
- A graceful denial with a configurable error envelope
Order them by priority. The bouncer skips down the chain when each prior route is breaker-tripped or unhealthy.
Worked example
backends:
- name: gpt-4o-prod
circuit_breaker:
cost_per_minute_usd: 5.0
latency_p95_ms: 15000
error_rate_pct: 25
cool_down_s: 90
retries:
max: 2
retry_on: ["5xx","429","connect-failure"]
retry_budget_pct: 10
per_try_timeout_ms: 12000
hedging:
hedge_after_ms: 8000
cancel_on_response: true
fallback_chain:
- gpt-4o-mini-prod
- claude-3-haiku-bedrock-prod
- cached_response
Observing breaker state
The /pilot -> Resilience tab shows, per backend:
- Current breaker state (
closed,open,half-open) - Last trip (signal + value)
- Recovery countdown
- Probe outcome history
- Fallback hit rate
Audit events: AI_BREAKER_TRIPPED, AI_BREAKER_HALFOPEN, AI_BREAKER_RECOVERED, AI_FALLBACK_USED.
Removing client-side resiliency safely
After AI Pilot is enabled with resilience configured:
- Confirm fallback chain works using the
/pilot -> Resiliencetab and synthetic probes. - Remove client-side retry/timeout/fallback code from your apps.
- Add a short timeout on the client (e.g. 60s) so a stuck connection still surfaces.
- Ship.
The bouncer now owns resiliency for every app calling it.