AI Pilot — Resilience & Circuit Breaker

Audience: Platform engineers, SREs, AI infrastructure owners Time: ~10 min read

AI Pilot moves resiliency out of your application SDK code and into the bouncer. The bouncer trips circuit breakers on cost, latency, and error rate (not just 5xx), retries with a budget, hedges slow requests, and falls back to a different model when the primary is unhealthy.

Why server-side resiliency

Today, AI client SDKs each implement their own retry/timeout/fallback logic. That means:

Inconsistent behavior across teams and languages
Redundant code in every microservice
No central observability
No way for governance to enforce a "must use fallback X for production" rule

Pushing this to the bouncer fixes all four. Apps call a single endpoint and the bouncer handles failures.

Click to enlarge

Configurable signals

Open Settings -> AI Pilot -> Resilience and configure each backend's circuit breaker with multiple trip conditions:

Signal	Default	Notes
`cost_per_minute_usd`	unset	Trip if estimated cost spikes (e.g. runaway agents)
`latency_p95_ms`	30000	Trip on degraded provider performance
`error_rate_pct`	25	Per minute
`consecutive_5xx`	5	Classic Envoy circuit breaker
`cool_down_s`	60	How long the breaker stays open
`half_open_probes`	3	Probes during the half-open phase
`fallback_route_id`	unset	Where to send traffic when tripped

The bouncer evaluates all enabled signals; tripping any one trips the breaker.

Retries and retry budget

Configure in Settings -> AI Pilot -> Resilience -> Retries:

max_retries_per_request (default 2)
retry_on — 5xx, connect-failure, gateway-error, 429
retry_budget_pct — cap retries at e.g. 10% of total RPS so a storm of failures cannot multiply load on the backend
per_try_timeout_ms — bounded per attempt
back_off_base_ms and back_off_max_ms — exponential

Hedging

For latency-sensitive paths, enable hedging:

hedge_after_ms — after this much elapsed time on the primary attempt, fire a second attempt to the fallback
cancel_on_response — cancel the slower attempt as soon as one returns

This trades a little extra cost for tail-latency improvements.

Fallback chains

A fallback can be:

A different model on the same provider (e.g. gpt-4o -> gpt-4o-mini)
A different provider (e.g. OpenAI -> Anthropic)
A different deployment region
A cached response (when the prompt cache is enabled)
A graceful denial with a configurable error envelope

Order them by priority. The bouncer skips down the chain when each prior route is breaker-tripped or unhealthy.

Worked example

backends:
  - name: gpt-4o-prod
    circuit_breaker:
      cost_per_minute_usd: 5.0
      latency_p95_ms: 15000
      error_rate_pct: 25
      cool_down_s: 90
    retries:
      max: 2
      retry_on: ["5xx","429","connect-failure"]
      retry_budget_pct: 10
      per_try_timeout_ms: 12000
    hedging:
      hedge_after_ms: 8000
      cancel_on_response: true
    fallback_chain:
      - gpt-4o-mini-prod
      - claude-3-haiku-bedrock-prod
      - cached_response

Observing breaker state

The /pilot -> Resilience tab shows, per backend:

Current breaker state (closed, open, half-open)
Last trip (signal + value)
Recovery countdown
Probe outcome history
Fallback hit rate

Audit events: AI_BREAKER_TRIPPED, AI_BREAKER_HALFOPEN, AI_BREAKER_RECOVERED, AI_FALLBACK_USED.

Removing client-side resiliency safely

After AI Pilot is enabled with resilience configured:

Confirm fallback chain works using the /pilot -> Resilience tab and synthetic probes.
Remove client-side retry/timeout/fallback code from your apps.
Add a short timeout on the client (e.g. 60s) so a stuck connection still surfaces.
Ship.

The bouncer now owns resiliency for every app calling it.