AI Pilot — Resilience & Circuit Breaker

Audience: Platform engineers, SREs, AI infrastructure owners Time: ~10 min read

AI Pilot moves resiliency out of your application SDK code and into the bouncer. The bouncer trips circuit breakers on cost, latency, and error rate (not just 5xx), retries with a budget, hedges slow requests, and falls back to a different model when the primary is unhealthy.

Why server-side resiliency

Today, AI client SDKs each implement their own retry/timeout/fallback logic. That means:

  • Inconsistent behavior across teams and languages
  • Redundant code in every microservice
  • No central observability
  • No way for governance to enforce a "must use fallback X for production" rule

Pushing this to the bouncer fixes all four. Apps call a single endpoint and the bouncer handles failures.

Click to enlarge

Configurable signals

Open Settings -> AI Pilot -> Resilience and configure each backend's circuit breaker with multiple trip conditions:

SignalDefaultNotes
cost_per_minute_usdunsetTrip if estimated cost spikes (e.g. runaway agents)
latency_p95_ms30000Trip on degraded provider performance
error_rate_pct25Per minute
consecutive_5xx5Classic Envoy circuit breaker
cool_down_s60How long the breaker stays open
half_open_probes3Probes during the half-open phase
fallback_route_idunsetWhere to send traffic when tripped

The bouncer evaluates all enabled signals; tripping any one trips the breaker.

Retries and retry budget

Configure in Settings -> AI Pilot -> Resilience -> Retries:

  • max_retries_per_request (default 2)
  • retry_on5xx, connect-failure, gateway-error, 429
  • retry_budget_pct — cap retries at e.g. 10% of total RPS so a storm of failures cannot multiply load on the backend
  • per_try_timeout_ms — bounded per attempt
  • back_off_base_ms and back_off_max_ms — exponential

Hedging

For latency-sensitive paths, enable hedging:

  • hedge_after_ms — after this much elapsed time on the primary attempt, fire a second attempt to the fallback
  • cancel_on_response — cancel the slower attempt as soon as one returns

This trades a little extra cost for tail-latency improvements.

Fallback chains

A fallback can be:

  • A different model on the same provider (e.g. gpt-4o -> gpt-4o-mini)
  • A different provider (e.g. OpenAI -> Anthropic)
  • A different deployment region
  • A cached response (when the prompt cache is enabled)
  • A graceful denial with a configurable error envelope

Order them by priority. The bouncer skips down the chain when each prior route is breaker-tripped or unhealthy.

Worked example

backends:
  - name: gpt-4o-prod
    circuit_breaker:
      cost_per_minute_usd: 5.0
      latency_p95_ms: 15000
      error_rate_pct: 25
      cool_down_s: 90
    retries:
      max: 2
      retry_on: ["5xx","429","connect-failure"]
      retry_budget_pct: 10
      per_try_timeout_ms: 12000
    hedging:
      hedge_after_ms: 8000
      cancel_on_response: true
    fallback_chain:
      - gpt-4o-mini-prod
      - claude-3-haiku-bedrock-prod
      - cached_response

Observing breaker state

The /pilot -> Resilience tab shows, per backend:

  • Current breaker state (closed, open, half-open)
  • Last trip (signal + value)
  • Recovery countdown
  • Probe outcome history
  • Fallback hit rate

Audit events: AI_BREAKER_TRIPPED, AI_BREAKER_HALFOPEN, AI_BREAKER_RECOVERED, AI_FALLBACK_USED.

Removing client-side resiliency safely

After AI Pilot is enabled with resilience configured:

  1. Confirm fallback chain works using the /pilot -> Resilience tab and synthetic probes.
  2. Remove client-side retry/timeout/fallback code from your apps.
  3. Add a short timeout on the client (e.g. 60s) so a stuck connection still surfaces.
  4. Ship.

The bouncer now owns resiliency for every app calling it.