AI Pilot — Cache & Rate Limits
Audience: Platform engineers, SREs Time: ~10 min read
AI Pilot uses a Redis-backed cache for prompt/semantic caching and a separate Rate Limit Service (RLS) for global token-aware enforcement. Both are optional sidecars deployed next to the bouncer, controllable per-bouncer from the Control Plane.
The three deployment topologies
You pick a topology per bouncer in Settings -> AI Pilot -> Cache & Rate Limits.
| Topology | Cache | Rate Limit | Best for |
|---|---|---|---|
bundled | redis:7-alpine shipped as sidecar | envoyproxy/ratelimit:1.4 shipped as sidecar | Self-contained pod, no infra dependency |
external | BYO Redis URL | BYO RLS endpoint (or none) | Reusing existing Redis/RLS infra |
disabled | Local in-process cache only | Per-process limits only | Sandbox or air-gapped demos |
Click to enlarge
Bundled topology (recommended for most enterprises)
The Helm chart ships an optional subchart bouncer-cache/ that adds two containers next to the bouncer pod (or as a sibling Deployment in the same namespace, depending on your topology preference):
cc-bouncer-redis— Redis 7 with optional persistence and authcc-bouncer-ratelimit— Envoy Rate Limit Service 1.4 with descriptors auto-generated from the AI Pilot configuration
In Docker Compose, the same effect is achieved by enabling the with-cache profile.
Enable
In Helm:
bouncerCache:
enabled: true
redis:
image: redis:7-alpine
persistence:
enabled: true
size: 1Gi
auth:
enabled: true
existingSecret: bouncer-redis-auth
ratelimit:
image: envoyproxy/ratelimit:1.4
descriptorsConfigMap: bouncer-ratelimit-descriptors
In Compose:
docker compose --profile with-cache up -d
Auth
When redis.auth.enabled=true, the bouncer reads REDIS_PASSWORD from the same Kubernetes Secret. The RLS sidecar uses the same Secret. Connection strings shown to operators in /settings/pilot are masked.
External topology
Tell the bouncer to use an existing Redis (and optionally an existing RLS) by supplying URLs in Settings -> AI Pilot -> Cache & Rate Limits:
Redis URL— e.g.redis://redis.shared.svc.cluster.local:6379/0Rate Limit endpoint— e.g.ratelimit.shared.svc.cluster.local:8081(optional; leave blank for "cache only")Auth secret— name of an existing Secret with the password
PAP probes the connection and shows Connected, Auth failed, or Unreachable on the dashboard.
Disabled topology
Pick this when you do not want any shared store. The bouncer falls back to:
- per-process LRU cache for prompt cache (no semantic match across replicas)
- per-process token counter for rate limits (so 429s only fire per-replica)
Useful for demos and air-gapped sandboxes where adding Redis is impractical.
How rate-limit descriptors are generated
The Control Plane translates each cost rule from Settings -> AI Pilot -> Cost Optimization into one or more Envoy rate-limit descriptors. Examples:
| Cost rule | Generated descriptor |
|---|---|
OpenAI / gpt-4o, 4000 tok/min/user | ("provider","openai"),("model","gpt-4o"),("user","<sub>") |
Bedrock / claude-3-haiku, $50/day total | ("provider","bedrock"),("model","claude-3-haiku") |
MCP weather_lookup, 100 RPS | ("mcp_server","<id>"),("tool","weather_lookup") |
App marketing-portal, 10k tok/min total | ("application","marketing-portal") |
The RLS sidecar applies the descriptors atomically against Redis-backed counters.
Health and probing
GET /pep-config/pilot/bouncer/{id}/cache-probe returns:
{
"topology": "bundled",
"redis": { "url": "redis://cc-bouncer-redis:6379/0", "connected": true, "rtt_ms": 1.2 },
"ratelimit": { "endpoint": "cc-bouncer-ratelimit:8081", "connected": true, "rtt_ms": 1.1 },
"checked_at": "2026-04-29T18:31:02Z"
}
Surfaced as a card on the /pilot Overview tab.
Sizing guidance
- Small (< 100 RPS, < 10 MB hot cache):
redis:7-alpinewith 128 Mi memory and no persistence is enough. - Medium (< 1k RPS, < 100 MB hot cache): 512 Mi memory; enable persistence on a dedicated PVC.
- Large (1k+ RPS, > 100 MB cache, semantic match): allocate 1+ Gi memory; consider Redis Cluster (v2 follow-up).
Failure modes
| Condition | Result |
|---|---|
| Redis unreachable | bouncer falls back to per-process cache; logs AI_CACHE_FALLBACK |
| RLS unreachable | rate-limit filter fails open by default; raise an alert and escalate to "fail closed" if your policy demands |
| Redis auth fails | fail closed for the cache path only; rate-limit descriptors keep working |
Default failure semantics are configurable from Settings -> AI Pilot -> Cache & Rate Limits.