🚀 Enterprise Deployment Guide

This comprehensive guide covers enterprise-scale deployment of Control Core with auto-scaling, high availability, load balancing, and advanced configurations. Designed for organizations requiring maximum performance, reliability, and scalability. The same Helm/Kubernetes approach works on any cloud or on-premises (AWS EKS, Azure AKS, GCP GKE, or your own Kubernetes). DevOps: follow the 30-minute runbook; see also what to deploy, before you start, where to run it.

🚀 Developer Portal after deploy

In Enterprise, the Developer Portal is served by your self-hosted Control Plane API deployment (control-plane-api) and remains inside your infrastructure:

  • URL: https://<your-control-plane-host>/devdocs
  • OpenAPI JSON: https://<your-control-plane-host>/openapi.json

Post-deploy checklist:

  1. Open /devdocs and verify title Control Core - Developer.
  2. Use Swagger onboarding endpoints to generate token and environment API keys.
  3. Validate internal platform health with GET /health/ready before developer onboarding.
  4. Optionally mirror openapi.json into internal API catalogs and SDK generation pipelines.

📌 Overview

Enterprise deployment is ideal for:

  • Large organizations (100+ users, 1M+ policy evaluations/day)
  • High-traffic applications requiring sub-10ms latency
  • Multi-region deployments with global reach
  • Strict compliance and audit requirements
  • Mission-critical applications requiring 99.99% uptime
  • Organizations with dedicated DevOps/SRE teams

🏗️ Architecture Patterns

Standard Enterprise Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Load Balancer (Layer 7)                   │
│                  (AWS ALB / NGINX / HAProxy)                    │
│          SSL Termination │ Health Checks │ Routing              │
└────────────────────────────┬────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Console #1  │    │  Console #2  │    │  Console #3  │
│  (React/TS)  │    │  (React/TS)  │    │  (React/TS)  │
│  Port 3000   │    │  Port 3000   │    │  Port 3000   │
└──────────────┘    └──────────────┘    └──────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   API #1     │    │   API #2     │    │   API #3     │
│  (FastAPI)   │    │  (FastAPI)   │    │  (FastAPI)   │
│  Port 8082   │    │  Port 8082   │    │  Port 8082   │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                    │
       └───────────────────┼────────────────────┘
                           │
                           ▼
        ┌──────────────────────────────────────┐
        │      PostgreSQL Primary-Replica       │
        │  Primary (Write) + 2 Read Replicas   │
        └──────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ Policy Bridge #1 │ │ Policy Bridge #2 │ │ Policy Bridge #3 │
│  (Leader)   │   │  (Follower) │   │  (Follower) │
└─────────────┘   └─────────────┘   └─────────────┘
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │ Policy Distribution
                           │
┌──────────────────────────┼──────────────────────────┐
│                          │                          │
▼                          ▼                          ▼
┌─────────────────────────────────────────────────────┐
│           Load Balancer (Bouncer/PEP Fleet)         │
│        (DNS Round-Robin / AWS NLB / HAProxy)        │
└──────────┬──────────────────────────────────────────┘
           │
┌──────────┼──────────┬──────────┬──────────┬─────────┐
│          │          │          │          │         │
▼          ▼          ▼          ▼          ▼         ▼
┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐
│PEP1│   │PEP2│   │PEP3│   │PEP4│   │PEP5│   │PEP-N│
└────┘   └────┘   └────┘   └────┘   └────┘   └────┘
  │        │        │        │        │        │
  └────────┴────────┴────────┴────────┴────────┘
                     │
                     ▼
          ┌──────────────────┐
          │  Protected Apps  │
          └──────────────────┘

Multi-Region Architecture

Region: US-EAST-1                     Region: EU-WEST-1
┌──────────────────────────┐         ┌──────────────────────────┐
│  Control Plane (Primary) │◄───────►│ Control Plane (Replica)  │
│  - Console x3            │  Sync   │  - Console x3            │
│  - API x5                │         │  - API x5                │
│  - Policy Bridge x3      │         │  - Policy Bridge x3      │
│  - DB Primary + Replica  │         │  - DB Read Replicas      │
│  - PEP Fleet (10)        │         │  - PEP Fleet (10)        │
└──────────────────────────┘         └──────────────────────────┘
           │                                    │
           │                                    │
           ▼                                    ▼
    Protected Apps                       Protected Apps
    (US Users)                          (EU Users)

Region: ASIA-PACIFIC-1
┌──────────────────────────┐
│ Control Plane (Replica)  │
│  - Console x3            │
│  - API x5                │
│  - Policy Bridge x3      │
│  - DB Read Replicas      │
│  - PEP Fleet (10)        │
└──────────────────────────┘
           │
           ▼
    Protected Apps
    (APAC Users)

📌 Prerequisites

Infrastructure Requirements

Minimum Production Configuration:

  • Kubernetes Cluster: v1.24+
  • Nodes: 6 nodes minimum (3 control plane, 3 workers)
  • Memory: 16GB RAM per node (96GB total minimum)
  • CPU: 4 cores per node (24 cores total minimum)
  • Storage: 500GB SSD with high IOPS (3000+ IOPS recommended)
  • Network: 10 Gbps between nodes, 1 Gbps external

Recommended Production Configuration:

  • Nodes: 12+ nodes (3 control plane, 9+ workers)
  • Memory: 32GB RAM per node
  • CPU: 8 cores per node
  • Storage: 1TB NVMe SSD with 10,000+ IOPS
  • Network: 25 Gbps between nodes, 10 Gbps external

Software Requirements

  • Kubernetes: 1.24 or higher
  • Helm: 3.0 or higher
  • kubectl: Matching cluster version
  • cert-manager: For SSL certificate management
  • Ingress Controller: NGINX, Traefik, or cloud provider (ALB, etc.)
  • Metrics Server: For HPA (Horizontal Pod Autoscaler)
  • Prometheus: For monitoring (optional but recommended)

Cloud Provider Requirements

AWS:

  • EKS cluster or self-managed Kubernetes
  • RDS PostgreSQL (db.r6g.xlarge or higher)
  • ElastiCache Redis (cache.r6g.large or higher)
  • Application Load Balancer (ALB)
  • Network Load Balancer (NLB)
  • Route 53 for DNS
  • S3 for backups
  • CloudWatch for logging

Azure:

  • AKS cluster
  • Azure Database for PostgreSQL (Flexible Server, Standard_D4s_v3+)
  • Azure Cache for Redis (Standard C1+)
  • Azure Load Balancer
  • Azure DNS
  • Azure Blob Storage for backups
  • Azure Monitor for logging

Google Cloud:

  • GKE cluster
  • Cloud SQL for PostgreSQL (db-custom-4-16384+)
  • Memorystore for Redis (M1 tier+)
  • Cloud Load Balancing
  • Cloud DNS
  • Cloud Storage for backups
  • Cloud Logging

📦 Installation

Step 1: Prepare Kubernetes Cluster

Create EKS Cluster (AWS example):

# Install eksctl
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Create cluster
eksctl create cluster \
  --name controlcore-production \
  --region us-east-1 \
  --version 1.28 \
  --nodegroup-name standard-workers \
  --node-type m5.2xlarge \
  --nodes 6 \
  --nodes-min 6 \
  --nodes-max 20 \
  --managed \
  --with-oidc \
  --ssh-access \
  --ssh-public-key ~/.ssh/id_rsa.pub \
  --enable-ssm

# Verify cluster
kubectl get nodes

Create AKS Cluster (Azure example):

# Create resource group
az group create --name controlcore-production --location eastus

# Create AKS cluster
az aks create \
  --resource-group controlcore-production \
  --name controlcore-cluster \
  --kubernetes-version 1.28.0 \
  --node-count 6 \
  --node-vm-size Standard_D8s_v3 \
  --enable-managed-identity \
  --enable-cluster-autoscaler \
  --min-count 6 \
  --max-count 20 \
  --network-plugin azure \
  --load-balancer-sku standard \
  --generate-ssh-keys

# Get credentials
az aks get-credentials --resource-group controlcore-production --name controlcore-cluster

# Verify
kubectl get nodes

Create GKE Cluster (GCP example):

# Set project and region
gcloud config set project your-project-id
gcloud config set compute/region us-central1

# Create GKE cluster
gcloud container clusters create controlcore-cluster \
  --region us-central1 \
  --cluster-version 1.28 \
  --machine-type n2-standard-8 \
  --num-nodes 2 \
  --min-nodes 2 \
  --max-nodes 7 \
  --enable-autoscaling \
  --enable-autorepair \
  --enable-autoupgrade \
  --disk-type pd-ssd \
  --disk-size 100 \
  --enable-ip-alias \
  --enable-stackdriver-kubernetes \
  --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
  --workload-pool=your-project-id.svc.id.goog \
  --enable-shielded-nodes \
  --shielded-secure-boot \
  --shielded-integrity-monitoring

# Alternative: GKE Autopilot (fully managed)
gcloud container clusters create-auto controlcore-cluster \
  --region us-central1 \
  --cluster-version 1.28

# Get credentials
gcloud container clusters get-credentials controlcore-cluster --region us-central1

# Verify
kubectl get nodes

Step 2: Install Prerequisites

Install cert-manager:

# Add Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager
kubectl create namespace cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.13.0 \
  --set installCRDs=true

# Verify installation
kubectl get pods -n cert-manager

Install NGINX Ingress Controller:

# Add NGINX Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# Install NGINX Ingress
helm install nginx-ingress ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.replicaCount=3 \
  --set controller.service.type=LoadBalancer \
  --set controller.metrics.enabled=true \
  --set controller.podAnnotations."prometheus\.io/scrape"=true

# Get Load Balancer IP
kubectl get svc -n ingress-nginx

Install Metrics Server (for HPA):

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl get deployment metrics-server -n kube-system

Install Prometheus (optional but recommended):

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
  --set grafana.enabled=true \
  --set grafana.adminPassword=ChangeMeSecurePassword

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Step 3: Configure Storage

Create Storage Class (AWS EBS example):

# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: controlcore-fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
kubectl apply -f storage-class.yaml

GCP Persistent Disk Storage Class:

# storage-class-gcp.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: controlcore-fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  replication-type: regional-pd  # For HA across zones
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
kubectl apply -f storage-class-gcp.yaml

Azure Disk Storage Class:

# storage-class-azure.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: controlcore-fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS  # Premium SSD
  kind: Managed
  cachingMode: ReadOnly
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
kubectl apply -f storage-class-azure.yaml

Step 4: Create Namespace and Secrets

# Create namespace
kubectl create namespace control-core

# Create secrets
kubectl create secret generic controlcore-secrets \
  --namespace control-core \
  --from-literal=database-password='SecureDBPassword123!' \
  --from-literal=redis-password='SecureRedisPassword123!' \
  --from-literal=jwt-secret='SecureJWTSecret123!' \
  --from-literal=admin-password='SecureAdminPassword123!'

# Create TLS secret (if using custom certificate)
kubectl create secret tls controlcore-tls \
  --namespace control-core \
  --cert=path/to/tls.crt \
  --key=path/to/tls.key

Step 5: Deploy PostgreSQL (High Availability)

Using Helm (Bitnami PostgreSQL HA):

# Add Bitnami repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Install PostgreSQL with replication
helm install postgresql bitnami/postgresql-ha \
  --namespace control-core \
  --set postgresql.replicaCount=3 \
  --set postgresql.resources.requests.memory=8Gi \
  --set postgresql.resources.requests.cpu=2000m \
  --set postgresql.resources.limits.memory=16Gi \
  --set postgresql.resources.limits.cpu=4000m \
  --set pgpool.replicaCount=3 \
  --set pgpool.resources.requests.memory=2Gi \
  --set pgpool.resources.requests.cpu=1000m \
  --set persistence.size=200Gi \
  --set persistence.storageClass=controlcore-fast-ssd \
  --set metrics.enabled=true \
  --set volumePermissions.enabled=true

# Or use managed database (AWS RDS example)
# Create RDS instance via AWS Console or CLI:
aws rds create-db-instance \
  --db-instance-identifier controlcore-db \
  --db-instance-class db.r6g.2xlarge \
  --engine postgres \
  --engine-version 15.3 \
  --master-username controlcore \
  --master-user-password SecurePassword123! \
  --allocated-storage 500 \
  --storage-type gp3 \
  --iops 12000 \
  --multi-az \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00" \
  --preferred-maintenance-window "mon:04:00-mon:05:00" \
  --enable-performance-insights \
  --enable-cloudwatch-logs-exports postgresql

# Google Cloud SQL (GCP example)
gcloud sql instances create controlcore-db \
  --database-version=POSTGRES_15 \
  --tier=db-custom-8-32768 \
  --region=us-central1 \
  --network=default \
  --availability-type=REGIONAL \
  --storage-type=SSD \
  --storage-size=500GB \
  --storage-auto-increase \
  --backup-start-time=03:00 \
  --maintenance-window-day=MON \
  --maintenance-window-hour=04 \
  --enable-bin-log \
  --retained-backups-count=30 \
  --root-password=SecurePassword123!

# Set database flags for performance
gcloud sql instances patch controlcore-db \
  --database-flags=shared_buffers=8GB,max_connections=500,effective_cache_size=24GB

# Create database
gcloud sql databases create control_core_db --instance=controlcore-db

# Create user
gcloud sql users create controlcore \
  --instance=controlcore-db \
  --password=SecurePassword123!

# Azure Database for PostgreSQL (Azure example)
az postgres flexible-server create \
  --resource-group controlcore-production \
  --name controlcore-db \
  --location eastus \
  --admin-user controlcore \
  --admin-password SecurePassword123! \
  --sku-name Standard_D8s_v3 \
  --tier GeneralPurpose \
  --version 15 \
  --storage-size 512 \
  --backup-retention 30 \
  --geo-redundant-backup Enabled \
  --high-availability ZoneRedundant \
  --public-access 0.0.0.0-255.255.255.255

# Create database
az postgres flexible-server db create \
  --resource-group controlcore-production \
  --server-name controlcore-db \
  --database-name control_core_db

# Configure server parameters
az postgres flexible-server parameter set \
  --resource-group controlcore-production \
  --server-name controlcore-db \
  --name shared_buffers \
  --value 8388608  # 8GB in KB

az postgres flexible-server parameter set \
  --resource-group controlcore-production \
  --server-name controlcore-db \
  --name max_connections \
  --value 500

Database Configuration (cloud-agnostic):

# database-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: database-config
  namespace: control-core
data:
  # AWS RDS
  # host: "controlcore-db.cluster-xxxxx.us-east-1.rds.amazonaws.com"
  # GCP Cloud SQL
  # host: "10.x.x.x"  # Private IP or Cloud SQL Proxy
  # Azure Database
  # host: "controlcore-db.postgres.database.azure.com"
  host: "your-database-host"
  port: "5432"
  database: "control_core_db"
  pool_size: "50"
  max_overflow: "20"
  pool_timeout: "30"
  pool_recycle: "3600"

Step 6: Deploy Redis (High Availability)

Using Helm (Redis Cluster):

# Install Redis cluster
helm install redis bitnami/redis-cluster \
  --namespace control-core \
  --set cluster.nodes=6 \
  --set cluster.replicas=1 \
  --set password=SecureRedisPassword123! \
  --set persistence.size=50Gi \
  --set persistence.storageClass=controlcore-fast-ssd \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=1000m \
  --set metrics.enabled=true

# Or use managed cache (AWS ElastiCache example)
aws elasticache create-replication-group \
  --replication-group-id controlcore-cache \
  --replication-group-description "Control Core Redis Cluster" \
  --engine redis \
  --cache-node-type cache.r6g.xlarge \
  --num-cache-clusters 3 \
  --automatic-failover-enabled \
  --at-rest-encryption-enabled \
  --transit-encryption-enabled \
  --auth-token SecureRedisPassword123! \
  --snapshot-retention-limit 7 \
  --snapshot-window "03:00-05:00"

# GCP Memorystore for Redis
gcloud redis instances create controlcore-cache \
  --size=5 \
  --region=us-central1 \
  --tier=standard \
  --redis-version=redis_7_0 \
  --enable-auth \
  --auth-string=SecureRedisPassword123! \
  --transit-encryption-mode=SERVER_AUTHENTICATION \
  --replica-count=2 \
  --read-replicas-mode=READ_REPLICAS_ENABLED \
  --persistence-mode=RDB \
  --rdb-snapshot-period=12h \
  --rdb-snapshot-start-time=03:00

# Get connection info
gcloud redis instances describe controlcore-cache --region=us-central1

# Azure Cache for Redis
az redis create \
  --resource-group controlcore-production \
  --name controlcore-cache \
  --location eastus \
  --sku Premium \
  --vm-size P2 \
  --enable-non-ssl-port false \
  --minimum-tls-version 1.2 \
  --redis-configuration maxmemory-policy=allkeys-lru \
  --replicas-per-primary 2 \
  --zones 1 2 3 \
  --shard-count 2

# Get connection info
az redis list-keys \
  --resource-group controlcore-production \
  --name controlcore-cache

Step 7: Install Control Core Helm Chart

Add Control Core Helm Repository:

# Add repository
helm repo add controlcore https://charts.controlcore.io
helm repo update

# Pull chart to customize
helm pull controlcore/control-core --untar
cd control-core

Configure values.yaml:

# values-production.yaml

global:
  domain: controlcore.yourcompany.com
  environment: production
  
  # Image configuration
  imageRegistry: controlcore.io
  imagePullSecrets:
    - name: controlcore-registry-secret

# Policy Administration Console
console:
  enabled: true
  replicaCount: 3
  
  image:
    repository: controlcore/console
    tag: "2.0.0"
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"
      cpu: "1000m"
  
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80
  
  ingress:
    enabled: true
    className: nginx
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
      nginx.ingress.kubernetes.io/ssl-redirect: "true"
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    hosts:
      - host: console.controlcore.yourcompany.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: console-tls
        hosts:
          - console.controlcore.yourcompany.com

# Policy Administration API
api:
  enabled: true
  replicaCount: 5
  
  image:
    repository: controlcore/api
    tag: "2.0.0"
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      memory: "2Gi"
      cpu: "1000m"
    limits:
      memory: "4Gi"
      cpu: "2000m"
  
  autoscaling:
    enabled: true
    minReplicas: 5
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80
    # Custom metrics for scaling
    metrics:
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: "1000"
  
  env:
    - name: WORKERS
      value: "4"
    - name: MAX_REQUESTS
      value: "10000"
    - name: MAX_REQUESTS_JITTER
      value: "1000"
    - name: TIMEOUT
      value: "60"
  
  ingress:
    enabled: true
    className: nginx
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
      nginx.ingress.kubernetes.io/ssl-redirect: "true"
      nginx.ingress.kubernetes.io/rate-limit: "1000"
    hosts:
      - host: api.controlcore.yourcompany.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: api-tls
        hosts:
          - api.controlcore.yourcompany.com

# Policy Bridge
policyBridge:
  enabled: true
  replicaCount: 3
  
  image:
    repository: controlcore/policy-bridge
    tag: "0.7.0"
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"
      cpu: "1000m"
  
  # Leader election for HA
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s
  
  config:
    broadcast_uri: "postgres://controlcore:password@postgresql:5432/policy_bridge_db"
    data_config_sources:
      - uri: "https://api.controlcore.yourcompany.com/api/v1/policy-bridge/config"
        config:
          headers:
            Authorization: "Bearer ${POLICY_SYNC_API_KEY}"
    
# Policy Enforcement Point (Bouncer/PEP)
bouncer:
  enabled: true
  replicaCount: 10
  
  image:
    repository: controlcore/bouncer
    tag: "2.0.0"
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      memory: "1Gi"
      cpu: "1000m"
    limits:
      memory: "2Gi"
      cpu: "2000m"
  
  autoscaling:
    enabled: true
    minReplicas: 10
    maxReplicas: 50
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 70
    # Scale based on request rate
    metrics:
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: "500"
  
  # Pod Disruption Budget for high availability
  podDisruptionBudget:
    enabled: true
    minAvailable: 5
  
  # Pod Topology Spread for better distribution
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: controlcore-bouncer
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: controlcore-bouncer
  
  config:
    cache:
      enabled: true
      policy_ttl: "5m"
      decision_ttl: "1m"
      max_size: 50000
    
    performance:
      worker_threads: 8
      connection_pool_size: 100
      max_concurrent_requests: 5000
  
  service:
    type: LoadBalancer
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
    ports:
      - name: http
        port: 80
        targetPort: 8080
        protocol: TCP
      - name: https
        port: 443
        targetPort: 8443
        protocol: TCP

# Database configuration
database:
  # Use external database
  external: true
  host: "controlcore-db.cluster-xxxxx.us-east-1.rds.amazonaws.com"
  port: 5432
  database: "control_core_db"
  username: "controlcore"
  passwordSecret: "controlcore-secrets"
  passwordKey: "database-password"
  
  # Connection pool settings
  pool:
    size: 50
    max_overflow: 20
    timeout: 30
    recycle: 3600

# Redis configuration
redis:
  # Use external Redis
  external: true
  host: "controlcore-cache.xxxxx.cache.amazonaws.com"
  port: 6379
  passwordSecret: "controlcore-secrets"
  passwordKey: "redis-password"
  
  # Cluster mode
  cluster:
    enabled: true
    nodes: 6

# Monitoring & Observability
monitoring:
  enabled: true
  
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
      interval: 30s
  
  grafana:
    enabled: true
    dashboards:
      enabled: true
      
  alerting:
    enabled: true
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"

# Backup configuration
backup:
  enabled: true
  schedule: "0 2 * * *"  # Daily at 2 AM
  retention: 30  # days
  storage:
    type: s3
    bucket: controlcore-backups
    region: us-east-1
    prefix: production/

# Security
security:
  podSecurityPolicy:
    enabled: true
  
  networkPolicy:
    enabled: true
    
  rbac:
    create: true
  
  serviceAccount:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/controlcore-sa-role"

Install Control Core:

# Install with custom values
helm install control-core controlcore/control-core \
  --namespace control-core \
  --values values-production.yaml \
  --timeout 10m \
  --wait

# Verify installation
helm list -n control-core
kubectl get pods -n control-core
kubectl get svc -n control-core
kubectl get ingress -n control-core

Step 8: Configure DNS

Get Load Balancer Address:

# Get Bouncer LB address
kubectl get svc -n control-core controlcore-bouncer -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

# Get Ingress LB address
kubectl get svc -n ingress-nginx nginx-ingress-controller -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Configure DNS Records (Route 53 example):

# Console
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "console.controlcore.yourcompany.com",
        "Type": "CNAME",
        "TTL": 300,
        "ResourceRecords": [{"Value": "INGRESS_LB_HOSTNAME"}]
      }
    }]
  }'

# API
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.controlcore.yourcompany.com",
        "Type": "CNAME",
        "TTL": 300,
        "ResourceRecords": [{"Value": "INGRESS_LB_HOSTNAME"}]
      }
    }]
  }'

# Bouncer (for direct access)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "bouncer.controlcore.yourcompany.com",
        "Type": "CNAME",
        "TTL": 60,
        "ResourceRecords": [{"Value": "BOUNCER_LB_HOSTNAME"}]
      }
    }]
  }'

📌 Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

Control Core components auto-scale based on CPU, memory, and custom metrics.

Verify HPA Status:

# Check all HPAs
kubectl get hpa -n control-core

# Expected output:
NAME                     REFERENCE                   TARGETS         MINPODS   MAXPODS   REPLICAS
controlcore-console      Deployment/console          45%/70%         3         10        3
controlcore-api          Deployment/api              60%/70%         5         20        8
controlcore-bouncer      Deployment/bouncer          55%/60%         10        50        15

Custom Metrics for Scaling:

# hpa-custom-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: controlcore-api-advanced
  namespace: control-core
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: controlcore-api
  minReplicas: 5
  maxReplicas: 50
  metrics:
    # CPU-based scaling
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    
    # Memory-based scaling
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    
    # Request rate scaling
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
    
    # Policy evaluation time scaling
    - type: Pods
      pods:
        metric:
          name: policy_evaluation_duration_seconds
        target:
          type: AverageValue
          averageValue: "0.050"  # Scale when avg > 50ms
  
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max
kubectl apply -f hpa-custom-metrics.yaml

Cluster Autoscaler

AWS EKS:

# Create IAM policy
cat > cluster-autoscaler-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DescribeTags",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplateVersions"
      ],
      "Resource": ["*"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeImages",
        "ec2:GetInstanceTypesFromInstanceRequirements",
        "eks:DescribeNodegroup"
      ],
      "Resource": ["*"]
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name ClusterAutoscalerPolicy \
  --policy-document file://cluster-autoscaler-policy.json

# Deploy cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

# Annotate deployment
kubectl -n kube-system annotate deployment.apps/cluster-autoscaler \
  cluster-autoscaler.kubernetes.io/safe-to-evict="false"

# Set cluster name
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
# Add: --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/controlcore-production

Azure AKS (already configured if created with --enable-cluster-autoscaler):

# Update autoscaler settings
az aks update \
  --resource-group controlcore-production \
  --name controlcore-cluster \
  --update-cluster-autoscaler \
  --min-count 6 \
  --max-count 50

Verification:

# Check autoscaler logs
kubectl -n kube-system logs -f deployment/cluster-autoscaler

# Check node status
kubectl get nodes
kubectl top nodes

📌 Load Balancing Configuration

Application Load Balancing (Layer 7)

NGINX Ingress Configuration:

# ingress-advanced.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: controlcore-ingress
  namespace: control-core
  annotations:
    # SSL/TLS
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    
    # Load balancing
    nginx.ingress.kubernetes.io/load-balance: "ewma"  # Exponentially weighted moving average
    nginx.ingress.kubernetes.io/upstream-hash-by: "$binary_remote_addr"  # IP hash for session affinity
    
    # Rate limiting
    nginx.ingress.kubernetes.io/limit-rps: "1000"
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "5"
    
    # Timeouts
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    
    # Buffering
    nginx.ingress.kubernetes.io/proxy-buffering: "on"
    nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"
    
    # Connection limits
    nginx.ingress.kubernetes.io/limit-connections: "100"
    
    # CORS
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://yourcompany.com"
    
    # Security headers
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-Frame-Options: DENY";
      more_set_headers "X-Content-Type-Options: nosniff";
      more_set_headers "X-XSS-Protection: 1; mode=block";
      more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
    
    # Custom error pages
    nginx.ingress.kubernetes.io/custom-http-errors: "404,503"
    nginx.ingress.kubernetes.io/default-backend: custom-error-pages
    
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - console.controlcore.yourcompany.com
        - api.controlcore.yourcompany.com
      secretName: controlcore-tls
  rules:
    - host: console.controlcore.yourcompany.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: controlcore-console
                port:
                  number: 3000
    - host: api.controlcore.yourcompany.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: controlcore-api
                port:
                  number: 8082
kubectl apply -f ingress-advanced.yaml

Network Load Balancing (Layer 4)

For Bouncer/PEP Fleet:

# bouncer-nlb-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: controlcore-bouncer-nlb
  namespace: control-core
  annotations:
    # AWS NLB annotations
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "http"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/health"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    
    # Azure Load Balancer annotations
    service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "http"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/health"
    
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local  # Preserve client IP
  selector:
    app: controlcore-bouncer
  ports:
    - name: http
      port: 80
      targetPort: 8080
      protocol: TCP
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
kubectl apply -f bouncer-nlb-service.yaml

DNS-Based Load Balancing

AWS Route 53 Weighted Routing:

# Create weighted record sets for multi-region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "bouncer.controlcore.yourcompany.com",
          "Type": "CNAME",
          "SetIdentifier": "us-east-1",
          "Weight": 70,
          "TTL": 60,
          "ResourceRecords": [{"Value": "us-east-1-bouncer-lb.amazonaws.com"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "bouncer.controlcore.yourcompany.com",
          "Type": "CNAME",
          "SetIdentifier": "eu-west-1",
          "Weight": 30,
          "TTL": 60,
          "ResourceRecords": [{"Value": "eu-west-1-bouncer-lb.amazonaws.com"}]
        }
      }
    ]
  }'

Geo-Routing for Global Deployment:

# US users to US region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "bouncer.controlcore.yourcompany.com",
        "Type": "CNAME",
        "SetIdentifier": "North America",
        "GeoLocation": {
          "ContinentCode": "NA"
        },
        "TTL": 60,
        "ResourceRecords": [{"Value": "us-east-1-bouncer-lb.amazonaws.com"}]
      }
    }]
  }'

# EU users to EU region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "bouncer.controlcore.yourcompany.com",
        "Type": "CNAME",
        "SetIdentifier": "Europe",
        "GeoLocation": {
          "ContinentCode": "EU"
        },
        "TTL": 60,
        "ResourceRecords": [{"Value": "eu-west-1-bouncer-lb.amazonaws.com"}]
      }
    }]
  }'

🤖 High Availability Configuration

Multi-AZ Deployment

Node Distribution:

# pod-topology-spread.yaml
apiVersion: v1
kind: Pod
metadata:
  name: controlcore-api
  labels:
    app: controlcore-api
spec:
  topologySpreadConstraints:
    # Spread across zones
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: controlcore-api
    
    # Spread across nodes
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: controlcore-api

Pod Disruption Budgets

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: controlcore-api-pdb
  namespace: control-core
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: controlcore-api
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: controlcore-bouncer-pdb
  namespace: control-core
spec:
  minAvailable: 5
  selector:
    matchLabels:
      app: controlcore-bouncer
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: controlcore-policy-bridge-pdb
  namespace: control-core
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: controlcore-policy-bridge
kubectl apply -f pdb.yaml

Database High Availability

PostgreSQL Replication:

# postgresql-replication.yaml (if self-managed)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: controlcore-db
  namespace: control-core
spec:
  instances: 3
  
  primaryUpdateStrategy: unsupervised
  
  postgresql:
    parameters:
      max_connections: "500"
      shared_buffers: "8GB"
      effective_cache_size: "24GB"
      maintenance_work_mem: "2GB"
      checkpoint_completion_target: "0.9"
      wal_buffers: "16MB"
      default_statistics_target: "100"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      work_mem: "20MB"
      min_wal_size: "1GB"
      max_wal_size: "4GB"
      max_worker_processes: "8"
      max_parallel_workers_per_gather: "4"
      max_parallel_workers: "8"
      max_parallel_maintenance_workers: "4"
  
  bootstrap:
    initdb:
      database: control_core_db
      owner: controlcore
      secret:
        name: controlcore-db-secret
  
  storage:
    size: 500Gi
    storageClass: controlcore-fast-ssd
  
  backup:
    barmanObjectStore:
      destinationPath: s3://controlcore-backups/postgresql/
      s3Credentials:
        accessKeyId:
          name: aws-credentials
          key: access-key-id
        secretAccessKey:
          name: aws-credentials
          key: secret-access-key
      wal:
        compression: gzip
        maxParallel: 8
    retentionPolicy: "30d"
  
  monitoring:
    enabled: true

Redis Sentinel (if self-managed)

# redis-sentinel.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-sentinel-config
  namespace: control-core
data:
  sentinel.conf: |
    sentinel monitor mymaster redis-0.redis 6379 2
    sentinel down-after-milliseconds mymaster 5000
    sentinel parallel-syncs mymaster 1
    sentinel failover-timeout mymaster 10000
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-sentinel
  namespace: control-core
spec:
  serviceName: redis-sentinel
  replicas: 3
  selector:
    matchLabels:
      app: redis-sentinel
  template:
    metadata:
      labels:
        app: redis-sentinel
    spec:
      containers:
        - name: sentinel
          image: redis:7-alpine
          command:
            - redis-sentinel
            - /etc/redis/sentinel.conf
          ports:
            - containerPort: 26379
              name: sentinel
          volumeMounts:
            - name: config
              mountPath: /etc/redis
      volumes:
        - name: config
          configMap:
            name: redis-sentinel-config

📌 SSL/TLS Configuration

Certificate Management with cert-manager

ClusterIssuer for Let's Encrypt:

# letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@yourcompany.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      # HTTP-01 challenge
      - http01:
          ingress:
            class: nginx
      # DNS-01 challenge (for wildcard certs)
      - dns01:
          route53:
            region: us-east-1
            accessKeyID: <set-aws-access-key-id-from-secret-manager>
            secretAccessKeySecretRef:
              name: aws-credentials
              key: secret-access-key
kubectl apply -f letsencrypt-issuer.yaml

Certificate Resource:

# certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: controlcore-tls
  namespace: control-core
spec:
  secretName: controlcore-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - controlcore.yourcompany.com
    - console.controlcore.yourcompany.com
    - api.controlcore.yourcompany.com
    - bouncer.controlcore.yourcompany.com
    - "*.controlcore.yourcompany.com"
  privateKey:
    algorithm: RSA
    size: 4096
kubectl apply -f certificate.yaml

# Check certificate status
kubectl get certificate -n control-core
kubectl describe certificate controlcore-tls -n control-core

mTLS Between Services

Service Mesh with Istio (optional):

# Install Istio
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH

# Install Istio with mTLS
istioctl install --set profile=production -y

# Enable automatic sidecar injection
kubectl label namespace control-core istio-injection=enabled

# Apply strict mTLS policy
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: control-core
spec:
  mtls:
    mode: STRICT
EOF

Policy Bridge Configuration

# policy-bridge-sync-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: policy-bridge-sync-config
  namespace: control-core
data:
  # Policy sync interval (how often to check for policy updates)
  POLICY_REPO_POLLING_INTERVAL: "30"  # seconds
  
  # Data source sync intervals (by type)
  POLICY_DATA_CONFIG_SYNC_INTERVAL: "60"  # seconds
  
  # WebSocket keep-alive
  POLICY_SYNC_KEEPALIVE: "30"  # seconds
  
  # Statistics reporting interval
  POLICY_SYNC_STATS_ENABLED: "true"
  POLICY_SYNC_STATS_INTERVAL: "60"  # seconds
  
  # Broadcast channel
  POLICY_SYNC_BROADCAST_URI: "postgres://controlcore:password@postgresql:5432/policy_bridge_db"
  
  # Client subscriptions
  POLICY_SYNC_CLIENT_TOKEN: "secure-client-token"
  POLICY_SYNC_RECONNECT_INTERVAL: "5"  # seconds
  POLICY_SYNC_MAX_RECONNECT_ATTEMPTS: "10"

Recommendations by Environment:

SettingDevelopmentStagingProductionHigh-Traffic Production
Policy Repo Polling60s30s30s30s
Data Source Sync300s (5m)120s (2m)60s (1m)60s (1m)
WebSocket Keep-Alive60s30s30s15s
Statistics Interval300s60s60s30s

Cache Settings

# cache-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cache-config
  namespace: control-core
data:
  # Policy cache (how long to cache compiled policies)
  POLICY_CACHE_TTL: "300"  # 5 minutes in seconds
  POLICY_CACHE_MAX_SIZE: "50000"  # entries
  
  # Decision cache (how long to cache authorization decisions)
  DECISION_CACHE_TTL: "60"  # 1 minute in seconds
  DECISION_CACHE_MAX_SIZE: "100000"  # entries
  
  # User context cache
  USER_CONTEXT_CACHE_TTL: "300"  # 5 minutes
  USER_CONTEXT_CACHE_MAX_SIZE: "50000"
  
  # Resource metadata cache
  RESOURCE_CACHE_TTL: "600"  # 10 minutes
  RESOURCE_CACHE_MAX_SIZE: "50000"
  
  # Cache eviction policy
  CACHE_EVICTION_POLICY: "lru"  # lru, lfu, or ttl
  
  # Cache warming (preload frequently accessed data)
  CACHE_WARMING_ENABLED: "true"
  CACHE_WARMING_INTERVAL: "3600"  # 1 hour

Recommendations by Load:

MetricLow LoadMedium LoadHigh LoadVery High Load
Policy Cache TTL10m5m5m3m
Decision Cache TTL5m1m1m30s
Policy Cache Size10,00050,000100,000500,000
Decision Cache Size50,000100,000500,0001,000,000

Performance Tuning

# performance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: performance-config
  namespace: control-core
data:
  # API workers
  API_WORKERS: "8"  # per pod
  API_WORKER_CLASS: "uvicorn.workers.UvicornWorker"
  API_WORKER_CONNECTIONS: "1000"
  API_TIMEOUT: "60"
  API_KEEPALIVE: "5"
  
  # Bouncer/PEP workers
  BOUNCER_WORKER_THREADS: "16"  # per pod
  BOUNCER_CONNECTION_POOL_SIZE: "200"
  BOUNCER_MAX_CONCURRENT_REQUESTS: "10000"
  BOUNCER_REQUEST_TIMEOUT: "30s"
  
  # Database connection pool
  DB_POOL_SIZE: "50"  # per API pod
  DB_MAX_OVERFLOW: "20"
  DB_POOL_TIMEOUT: "30"
  DB_POOL_RECYCLE: "3600"
  DB_POOL_PRE_PING: "true"
  
  # Redis connection pool
  REDIS_POOL_SIZE: "50"  # per pod
  REDIS_MAX_CONNECTIONS: "100"
  REDIS_SOCKET_KEEPALIVE: "true"
  REDIS_SOCKET_KEEPALIVE_OPTIONS: "1,10,3"

🔒 Runtime Security

Pod Security Policies

# pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: controlcore-restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: false

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: controlcore-network-policy
  namespace: control-core
spec:
  podSelector:
    matchLabels:
      app: controlcore
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Allow from ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 3000  # Console
        - protocol: TCP
          port: 8082  # API
        - protocol: TCP
          port: 8080  # Bouncer
    
    # Allow internal communication
    - from:
        - podSelector:
            matchLabels:
              app: controlcore
      ports:
        - protocol: TCP
          port: 3000
        - protocol: TCP
          port: 8082
        - protocol: TCP
          port: 8080
        - protocol: TCP
      port: 7000  # Policy Bridge
  
  egress:
    # Allow to database
    - to:
        - podSelector:
            matchLabels:
              app: postgresql
      ports:
        - protocol: TCP
          port: 5432
    
    # Allow to Redis
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - protocol: TCP
          port: 6379
    
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
        - podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
    
    # Allow HTTPS egress (for external APIs)
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443
kubectl apply -f network-policy.yaml

Secrets Management

Using AWS Secrets Manager:

# external-secrets.yaml
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secretsmanager
  namespace: control-core
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: controlcore-sa
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: controlcore-secrets
  namespace: control-core
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: controlcore-secrets
    creationPolicy: Owner
  data:
    - secretKey: database-password
      remoteRef:
        key: controlcore/database
        property: password
    - secretKey: redis-password
      remoteRef:
        key: controlcore/redis
        property: password
    - secretKey: jwt-secret
      remoteRef:
        key: controlcore/jwt
        property: secret

Using HashiCorp Vault:

# vault-integration.yaml
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultAuth
metadata:
  name: controlcore-vault-auth
  namespace: control-core
spec:
  method: kubernetes
  mount: kubernetes
  kubernetes:
    role: controlcore
    serviceAccount: controlcore-sa
---
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
metadata:
  name: controlcore-secrets
  namespace: control-core
spec:
  type: kv-v2
  mount: secret
  path: controlcore/production
  destination:
    name: controlcore-secrets
    create: true
  refreshAfter: 30s
  vaultAuthRef: controlcore-vault-auth

📌 SAML and SSO Configuration

Auth0 Integration

# auth0-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: auth0-config
  namespace: control-core
data:
  AUTH0_DOMAIN: "yourcompany.auth0.com"
  AUTH0_CLIENT_ID: "your-client-id"
  AUTH0_AUDIENCE: "https://api.controlcore.yourcompany.com"
  AUTH0_SCOPE: "openid profile email"
  AUTH0_CALLBACK_URL: "https://console.controlcore.yourcompany.com/callback"
---
apiVersion: v1
kind: Secret
metadata:
  name: auth0-secret
  namespace: control-core
type: Opaque
stringData:
  client_secret: "your-auth0-client-secret"

SAML SSO Configuration

# saml-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: saml-config
  namespace: control-core
data:
  SAML_ENABLED: "true"
  SAML_IDP_ENTITY_ID: "https://idp.yourcompany.com/saml"
  SAML_IDP_SSO_URL: "https://idp.yourcompany.com/saml/sso"
  SAML_IDP_SLO_URL: "https://idp.yourcompany.com/saml/slo"
  SAML_SP_ENTITY_ID: "https://console.controlcore.yourcompany.com"
  SAML_SP_ACS_URL: "https://console.controlcore.yourcompany.com/saml/acs"
  SAML_SP_SLO_URL: "https://console.controlcore.yourcompany.com/saml/slo"
  
  # Attribute mapping
  SAML_ATTR_EMAIL: "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress"
  SAML_ATTR_FIRSTNAME: "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/givenname"
  SAML_ATTR_LASTNAME: "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/surname"
  SAML_ATTR_GROUPS: "http://schemas.xmlsoap.org/claims/Group"
---
apiVersion: v1
kind: Secret
metadata:
  name: saml-certificates
  namespace: control-core
type: Opaque
data:
  idp_cert.pem: <base64-encoded-idp-certificate>
  sp_key.pem: <base64-encoded-sp-private-key>
  sp_cert.pem: <base64-encoded-sp-certificate>

SAML Providers:

  • Okta: Configure SAML 2.0 app
  • Azure AD: Enterprise Application with SAML SSO
  • OneLogin: SAML SSO application
  • Google Workspace: Custom SAML app
  • Ping Identity: SAML 2.0 connection

👁️ Monitoring and Observability

Prometheus Metrics

Service Monitors:

# servicemonitors.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: controlcore-api
  namespace: control-core
spec:
  selector:
    matchLabels:
      app: controlcore-api
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: controlcore-bouncer
  namespace: control-core
spec:
  selector:
    matchLabels:
      app: controlcore-bouncer
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Key Metrics to Monitor:

# API Metrics
http_requests_total - Total HTTP requests
http_request_duration_seconds - Request latency
http_requests_in_flight - Current requests being processed
policy_evaluations_total - Total policy evaluations
policy_evaluation_duration_seconds - Policy evaluation time
cache_hits_total - Cache hits
cache_misses_total - Cache misses

# Bouncer Metrics
bouncer_requests_total - Total requests through bouncer
bouncer_allowed_requests - Allowed requests
bouncer_denied_requests - Denied requests
bouncer_policy_sync_timestamp - Last policy sync time
bouncer_target_app_reachable - Target app health (1=healthy, 0=unhealthy)

# Database Metrics
db_connections_active - Active database connections
db_connections_idle - Idle database connections
db_query_duration_seconds - Query execution time

# Policy Bridge Metrics
policy-bridge_connected_clients - Number of connected clients
policy-bridge_policy_updates_total - Total policy updates distributed
policy-bridge_data_updates_total - Total data updates distributed

Grafana Dashboards

Import Pre-built Dashboard:

# Get dashboard JSON from Control Core
curl -o controlcore-dashboard.json \
  https://downloads.controlcore.io/dashboards/enterprise-v2.json

# Import to Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Access Grafana at http://localhost:3000
# Import dashboard via UI or API

Logging with ELK/EFK Stack

# Install Elasticsearch
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=3 \
  --set resources.requests.memory=4Gi \
  --set volumeClaimTemplate.resources.requests.storage=100Gi

# Install Kibana
helm install kibana elastic/kibana \
  --namespace logging \
  --set service.type=LoadBalancer

# Install Fluentd (or Fluent Bit for lighter footprint)
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluentd fluent/fluentd \
  --namespace logging \
  --set elasticsearch.host=elasticsearch-master \
  --set elasticsearch.port=9200

Alerting

Prometheus Alert Rules:

# alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: controlcore-alerts
  namespace: control-core
spec:
  groups:
    - name: controlcore.rules
      interval: 30s
      rules:
        # High error rate
        - alert: HighErrorRate
          expr: |
            rate(http_requests_total{status=~"5.."}[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate detected"
            description: "{{ $labels.instance }} has error rate of {{ $value }}"
        
        # High latency
        - alert: HighLatency
          expr: |
            histogram_quantile(0.95, 
              rate(http_request_duration_seconds_bucket[5m])
            ) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High latency detected"
            description: "P95 latency is {{ $value }}s"
        
        # Policy sync failure
        - alert: PolicySyncFailure
          expr: |
            time() - bouncer_policy_sync_timestamp > 600
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Policy sync failure"
            description: "Bouncer {{ $labels.instance }} hasn't synced in 10 minutes"
        
        # Pod not ready
        - alert: PodNotReady
          expr: |
            kube_pod_status_phase{namespace="control-core",phase!="Running"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod not ready"
            description: "Pod {{ $labels.pod }} is not in Running state"
        
        # High memory usage
        - alert: HighMemoryUsage
          expr: |
            container_memory_usage_bytes{namespace="control-core"} 
            / container_spec_memory_limit_bytes{namespace="control-core"} > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High memory usage"
            description: "Container {{ $labels.container }} memory usage is {{ $value }}"

Alert Manager Configuration:

# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
        - match:
            severity: critical
          receiver: 'critical'
          continue: true
        - match:
            severity: warning
          receiver: 'warning'
    
    receivers:
      - name: 'default'
        webhook_configs:
          - url: 'http://alertmanager-webhook:5000/alerts'
      
      - name: 'critical'
        email_configs:
          - to: 'ops-critical@yourcompany.com'
            from: 'alertmanager@yourcompany.com'
            smarthost: 'smtp.yourcompany.com:587'
            auth_username: 'alertmanager@yourcompany.com'
            auth_password: 'password'
        pagerduty_configs:
          - service_key: 'your-pagerduty-key'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#ops-critical'
            title: 'Critical Alert'
      
      - name: 'warning'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#ops-warnings'
            title: 'Warning Alert'

📌 Backup and Disaster Recovery

Automated Backups

Velero for Kubernetes Resources:

# Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set configuration.provider=aws \
  --set configuration.backupStorageLocation.bucket=controlcore-backups \
  --set configuration.backupStorageLocation.config.region=us-east-1 \
  --set configuration.volumeSnapshotLocation.config.region=us-east-1 \
  --set initContainers[0].name=velero-plugin-for-aws \
  --set initContainers[0].image=velero/velero-plugin-for-aws:v1.8.0 \
  --set initContainers[0].volumeMounts[0].mountPath=/target \
  --set initContainers[0].volumeMounts[0].name=plugins

# Create backup schedule
velero schedule create control-core-daily \
  --schedule="0 2 * * *" \
  --include-namespaces control-core \
  --ttl 720h0m0s

# Create on-demand backup
velero backup create control-core-backup-$(date +%Y%m%d) \
  --include-namespaces control-core \
  --wait

Database Backups:

# database-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: control-core
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: PGHOST
                  value: "postgresql"
                - name: PGUSER
                  value: "controlcore"
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: controlcore-secrets
                      key: database-password
                - name: PGDATABASE
                  value: "control_core_db"
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: access-key-id
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: secret-access-key
              command:
                - /bin/sh
                - -c
                - |
                  BACKUP_FILE="controlcore-db-$(date +%Y%m%d-%H%M%S).sql.gz"
                  pg_dump | gzip > /tmp/$BACKUP_FILE
                  aws s3 cp /tmp/$BACKUP_FILE s3://controlcore-backups/database/$BACKUP_FILE
                  echo "Backup completed: $BACKUP_FILE"
          restartPolicy: OnFailure

Disaster Recovery Procedures

Recovery Runbook:

# 1. Restore Kubernetes resources
velero restore create --from-backup control-core-backup-20250125

# 2. Restore database
aws s3 cp s3://controlcore-backups/database/controlcore-db-20250125-020000.sql.gz .
gunzip controlcore-db-20250125-020000.sql.gz
kubectl exec -it postgresql-0 -n control-core -- psql -U controlcore -d control_core_db < controlcore-db-20250125-020000.sql

# 3. Verify services
kubectl get pods -n control-core
kubectl get svc -n control-core

# 4. Test health endpoints
curl https://console.controlcore.yourcompany.com/health
curl https://api.controlcore.yourcompany.com/api/v1/health

# 5. Verify policy sync
kubectl logs -n control-core -l app=controlcore-policy-bridge

# 6. Test policy evaluation
curl -X POST https://bouncer.controlcore.yourcompany.com/v1/data/app/authorization/allow \
  -d '{"input": {"user": {"id": "test"}, "resource": {"id": "test"}, "action": "read"}}'

🚀 Troubleshooting Enterprise Deployments

Common Issues

Pod Scheduling Failures:

# Check node resources
kubectl top nodes

# Check pod status
kubectl describe pod <pod-name> -n control-core

# Check events
kubectl get events -n control-core --sort-by='.lastTimestamp'

# Common solutions:
# 1. Scale cluster (add more nodes)
# 2. Adjust resource requests/limits
# 3. Check PodDisruptionBudget settings

Database Connection Pool Exhaustion:

# Check active connections
kubectl exec -it postgresql-0 -n control-core -- psql -U controlcore -d control_core_db -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

# Solution: Increase pool size in values.yaml
database:
  pool:
    size: 100  # Increase from 50
    max_overflow: 40  # Increase from 20

High Memory Usage:

# Check memory usage
kubectl top pods -n control-core

# Identify memory hogs
kubectl exec -it <pod-name> -n control-core -- top

# Solutions:
# 1. Increase cache eviction rate
# 2. Reduce cache sizes
# 3. Add more memory to pods
# 4. Scale horizontally instead of vertically

📞 Support and Resources

📌 Next Steps


Congratulations! You now have a production-ready, enterprise-scale Control Core deployment with high availability, auto-scaling, and comprehensive monitoring.