AI Pilot — Cache & Rate Limits

Audience: Platform engineers, SREs Time: ~10 min read

AI Pilot uses a Redis-backed cache for prompt/semantic caching and a separate Rate Limit Service (RLS) for global token-aware enforcement. Both are optional sidecars deployed next to the bouncer, controllable per-bouncer from the Control Plane.

The three deployment topologies

You pick a topology per bouncer in Settings -> AI Pilot -> Cache & Rate Limits.

Topology	Cache	Rate Limit	Best for
`bundled`	`redis:7-alpine` shipped as sidecar	`envoyproxy/ratelimit:1.4` shipped as sidecar	Self-contained pod, no infra dependency
`external`	BYO Redis URL	BYO RLS endpoint (or none)	Reusing existing Redis/RLS infra
`disabled`	Local in-process cache only	Per-process limits only	Sandbox or air-gapped demos

Click to enlarge

Bundled topology (recommended for most enterprises)

The Helm chart ships an optional subchart bouncer-cache/ that adds two containers next to the bouncer pod (or as a sibling Deployment in the same namespace, depending on your topology preference):

cc-bouncer-redis — Redis 7 with optional persistence and auth
cc-bouncer-ratelimit — Envoy Rate Limit Service 1.4 with descriptors auto-generated from the AI Pilot configuration

In Docker Compose, the same effect is achieved by enabling the with-cache profile.

Enable

In Helm:

bouncerCache:
  enabled: true
  redis:
    image: redis:7-alpine
    persistence:
      enabled: true
      size: 1Gi
    auth:
      enabled: true
      existingSecret: bouncer-redis-auth
  ratelimit:
    image: envoyproxy/ratelimit:1.4
    descriptorsConfigMap: bouncer-ratelimit-descriptors

In Compose:

docker compose --profile with-cache up -d

Auth

When redis.auth.enabled=true, the bouncer reads REDIS_PASSWORD from the same Kubernetes Secret. The RLS sidecar uses the same Secret. Connection strings shown to operators in /settings/pilot are masked.

External topology

Tell the bouncer to use an existing Redis (and optionally an existing RLS) by supplying URLs in Settings -> AI Pilot -> Cache & Rate Limits:

Redis URL — e.g. redis://redis.shared.svc.cluster.local:6379/0
Rate Limit endpoint — e.g. ratelimit.shared.svc.cluster.local:8081 (optional; leave blank for "cache only")
Auth secret — name of an existing Secret with the password

PAP probes the connection and shows Connected, Auth failed, or Unreachable on the dashboard.

Disabled topology

Pick this when you do not want any shared store. The bouncer falls back to:

per-process LRU cache for prompt cache (no semantic match across replicas)
per-process token counter for rate limits (so 429s only fire per-replica)

Useful for demos and air-gapped sandboxes where adding Redis is impractical.

How rate-limit descriptors are generated

The Control Plane translates each cost rule from Settings -> AI Pilot -> Cost Optimization into one or more Envoy rate-limit descriptors. Examples:

Cost rule	Generated descriptor
`OpenAI / gpt-4o`, 4000 tok/min/user	`("provider","openai"),("model","gpt-4o"),("user","<sub>")`
`Bedrock / claude-3-haiku`, $50/day total	`("provider","bedrock"),("model","claude-3-haiku")`
MCP `weather_lookup`, 100 RPS	`("mcp_server","<id>"),("tool","weather_lookup")`
`App marketing-portal`, 10k tok/min total	`("application","marketing-portal")`

The RLS sidecar applies the descriptors atomically against Redis-backed counters.

Health and probing

GET /pep-config/pilot/bouncer/{id}/cache-probe returns:

{
  "topology": "bundled",
  "redis": { "url": "redis://cc-bouncer-redis:6379/0", "connected": true, "rtt_ms": 1.2 },
  "ratelimit": { "endpoint": "cc-bouncer-ratelimit:8081", "connected": true, "rtt_ms": 1.1 },
  "checked_at": "2026-04-29T18:31:02Z"
}

Surfaced as a card on the /pilot Overview tab.

Sizing guidance

Small (< 100 RPS, < 10 MB hot cache): redis:7-alpine with 128 Mi memory and no persistence is enough.
Medium (< 1k RPS, < 100 MB hot cache): 512 Mi memory; enable persistence on a dedicated PVC.
Large (1k+ RPS, > 100 MB cache, semantic match): allocate 1+ Gi memory; consider Redis Cluster (v2 follow-up).

Failure modes

Condition	Result
Redis unreachable	bouncer falls back to per-process cache; logs `AI_CACHE_FALLBACK`
RLS unreachable	rate-limit filter fails open by default; raise an alert and escalate to "fail closed" if your policy demands
Redis auth fails	fail closed for the cache path only; rate-limit descriptors keep working

Default failure semantics are configurable from Settings -> AI Pilot -> Cache & Rate Limits.