AI Pilot — Cache & Rate Limits

Audience: Platform engineers, SREs Time: ~10 min read

AI Pilot uses a Redis-backed cache for prompt/semantic caching and a separate Rate Limit Service (RLS) for global token-aware enforcement. Both are optional sidecars deployed next to the bouncer, controllable per-bouncer from the Control Plane.

The three deployment topologies

You pick a topology per bouncer in Settings -> AI Pilot -> Cache & Rate Limits.

TopologyCacheRate LimitBest for
bundledredis:7-alpine shipped as sidecarenvoyproxy/ratelimit:1.4 shipped as sidecarSelf-contained pod, no infra dependency
externalBYO Redis URLBYO RLS endpoint (or none)Reusing existing Redis/RLS infra
disabledLocal in-process cache onlyPer-process limits onlySandbox or air-gapped demos

Click to enlarge

The Helm chart ships an optional subchart bouncer-cache/ that adds two containers next to the bouncer pod (or as a sibling Deployment in the same namespace, depending on your topology preference):

  • cc-bouncer-redis — Redis 7 with optional persistence and auth
  • cc-bouncer-ratelimit — Envoy Rate Limit Service 1.4 with descriptors auto-generated from the AI Pilot configuration

In Docker Compose, the same effect is achieved by enabling the with-cache profile.

Enable

In Helm:

bouncerCache:
  enabled: true
  redis:
    image: redis:7-alpine
    persistence:
      enabled: true
      size: 1Gi
    auth:
      enabled: true
      existingSecret: bouncer-redis-auth
  ratelimit:
    image: envoyproxy/ratelimit:1.4
    descriptorsConfigMap: bouncer-ratelimit-descriptors

In Compose:

docker compose --profile with-cache up -d

Auth

When redis.auth.enabled=true, the bouncer reads REDIS_PASSWORD from the same Kubernetes Secret. The RLS sidecar uses the same Secret. Connection strings shown to operators in /settings/pilot are masked.

External topology

Tell the bouncer to use an existing Redis (and optionally an existing RLS) by supplying URLs in Settings -> AI Pilot -> Cache & Rate Limits:

  • Redis URL — e.g. redis://redis.shared.svc.cluster.local:6379/0
  • Rate Limit endpoint — e.g. ratelimit.shared.svc.cluster.local:8081 (optional; leave blank for "cache only")
  • Auth secret — name of an existing Secret with the password

PAP probes the connection and shows Connected, Auth failed, or Unreachable on the dashboard.

Disabled topology

Pick this when you do not want any shared store. The bouncer falls back to:

  • per-process LRU cache for prompt cache (no semantic match across replicas)
  • per-process token counter for rate limits (so 429s only fire per-replica)

Useful for demos and air-gapped sandboxes where adding Redis is impractical.

How rate-limit descriptors are generated

The Control Plane translates each cost rule from Settings -> AI Pilot -> Cost Optimization into one or more Envoy rate-limit descriptors. Examples:

Cost ruleGenerated descriptor
OpenAI / gpt-4o, 4000 tok/min/user("provider","openai"),("model","gpt-4o"),("user","<sub>")
Bedrock / claude-3-haiku, $50/day total("provider","bedrock"),("model","claude-3-haiku")
MCP weather_lookup, 100 RPS("mcp_server","<id>"),("tool","weather_lookup")
App marketing-portal, 10k tok/min total("application","marketing-portal")

The RLS sidecar applies the descriptors atomically against Redis-backed counters.

Health and probing

GET /pep-config/pilot/bouncer/{id}/cache-probe returns:

{
  "topology": "bundled",
  "redis": { "url": "redis://cc-bouncer-redis:6379/0", "connected": true, "rtt_ms": 1.2 },
  "ratelimit": { "endpoint": "cc-bouncer-ratelimit:8081", "connected": true, "rtt_ms": 1.1 },
  "checked_at": "2026-04-29T18:31:02Z"
}

Surfaced as a card on the /pilot Overview tab.

Sizing guidance

  • Small (< 100 RPS, < 10 MB hot cache): redis:7-alpine with 128 Mi memory and no persistence is enough.
  • Medium (< 1k RPS, < 100 MB hot cache): 512 Mi memory; enable persistence on a dedicated PVC.
  • Large (1k+ RPS, > 100 MB cache, semantic match): allocate 1+ Gi memory; consider Redis Cluster (v2 follow-up).

Failure modes

ConditionResult
Redis unreachablebouncer falls back to per-process cache; logs AI_CACHE_FALLBACK
RLS unreachablerate-limit filter fails open by default; raise an alert and escalate to "fail closed" if your policy demands
Redis auth failsfail closed for the cache path only; rate-limit descriptors keep working

Default failure semantics are configurable from Settings -> AI Pilot -> Cache & Rate Limits.