🚀 Enterprise Deployment Guide
This comprehensive guide covers enterprise-scale deployment of Control Core with auto-scaling, high availability, load balancing, and advanced configurations. Designed for organizations requiring maximum performance, reliability, and scalability. The same Helm/Kubernetes approach works on any cloud or on-premises (AWS EKS, Azure AKS, GCP GKE, or your own Kubernetes). DevOps: follow the 30-minute runbook; see also what to deploy, before you start, where to run it.
🚀 Developer Portal after deploy
In Enterprise, the Developer Portal is served by your self-hosted Control Plane API deployment (control-plane-api) and remains inside your infrastructure:
- URL:
https://<your-control-plane-host>/devdocs - OpenAPI JSON:
https://<your-control-plane-host>/openapi.json
Post-deploy checklist:
- Open
/devdocsand verify title Control Core - Developer. - Use Swagger onboarding endpoints to generate token and environment API keys.
- Validate internal platform health with
GET /health/readybefore developer onboarding. - Optionally mirror
openapi.jsoninto internal API catalogs and SDK generation pipelines.
📌 Overview
Enterprise deployment is ideal for:
- Large organizations (100+ users, 1M+ policy evaluations/day)
- High-traffic applications requiring sub-10ms latency
- Multi-region deployments with global reach
- Strict compliance and audit requirements
- Mission-critical applications requiring 99.99% uptime
- Organizations with dedicated DevOps/SRE teams
🏗️ Architecture Patterns
Standard Enterprise Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Load Balancer (Layer 7) │
│ (AWS ALB / NGINX / HAProxy) │
│ SSL Termination │ Health Checks │ Routing │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Console #1 │ │ Console #2 │ │ Console #3 │
│ (React/TS) │ │ (React/TS) │ │ (React/TS) │
│ Port 3000 │ │ Port 3000 │ │ Port 3000 │
└──────────────┘ └──────────────┘ └──────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ API #1 │ │ API #2 │ │ API #3 │
│ (FastAPI) │ │ (FastAPI) │ │ (FastAPI) │
│ Port 8082 │ │ Port 8082 │ │ Port 8082 │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└───────────────────┼────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ PostgreSQL Primary-Replica │
│ Primary (Write) + 2 Read Replicas │
└──────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Policy Bridge #1 │ │ Policy Bridge #2 │ │ Policy Bridge #3 │
│ (Leader) │ │ (Follower) │ │ (Follower) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└──────────────────┼──────────────────┘
│ Policy Distribution
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ Load Balancer (Bouncer/PEP Fleet) │
│ (DNS Round-Robin / AWS NLB / HAProxy) │
└──────────┬──────────────────────────────────────────┘
│
┌──────────┼──────────┬──────────┬──────────┬─────────┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│PEP1│ │PEP2│ │PEP3│ │PEP4│ │PEP5│ │PEP-N│
└────┘ └────┘ └────┘ └────┘ └────┘ └────┘
│ │ │ │ │ │
└────────┴────────┴────────┴────────┴────────┘
│
▼
┌──────────────────┐
│ Protected Apps │
└──────────────────┘
Multi-Region Architecture
Region: US-EAST-1 Region: EU-WEST-1
┌──────────────────────────┐ ┌──────────────────────────┐
│ Control Plane (Primary) │◄───────►│ Control Plane (Replica) │
│ - Console x3 │ Sync │ - Console x3 │
│ - API x5 │ │ - API x5 │
│ - Policy Bridge x3 │ │ - Policy Bridge x3 │
│ - DB Primary + Replica │ │ - DB Read Replicas │
│ - PEP Fleet (10) │ │ - PEP Fleet (10) │
└──────────────────────────┘ └──────────────────────────┘
│ │
│ │
▼ ▼
Protected Apps Protected Apps
(US Users) (EU Users)
Region: ASIA-PACIFIC-1
┌──────────────────────────┐
│ Control Plane (Replica) │
│ - Console x3 │
│ - API x5 │
│ - Policy Bridge x3 │
│ - DB Read Replicas │
│ - PEP Fleet (10) │
└──────────────────────────┘
│
▼
Protected Apps
(APAC Users)
📌 Prerequisites
Infrastructure Requirements
Minimum Production Configuration:
- Kubernetes Cluster: v1.24+
- Nodes: 6 nodes minimum (3 control plane, 3 workers)
- Memory: 16GB RAM per node (96GB total minimum)
- CPU: 4 cores per node (24 cores total minimum)
- Storage: 500GB SSD with high IOPS (3000+ IOPS recommended)
- Network: 10 Gbps between nodes, 1 Gbps external
Recommended Production Configuration:
- Nodes: 12+ nodes (3 control plane, 9+ workers)
- Memory: 32GB RAM per node
- CPU: 8 cores per node
- Storage: 1TB NVMe SSD with 10,000+ IOPS
- Network: 25 Gbps between nodes, 10 Gbps external
Software Requirements
- Kubernetes: 1.24 or higher
- Helm: 3.0 or higher
- kubectl: Matching cluster version
- cert-manager: For SSL certificate management
- Ingress Controller: NGINX, Traefik, or cloud provider (ALB, etc.)
- Metrics Server: For HPA (Horizontal Pod Autoscaler)
- Prometheus: For monitoring (optional but recommended)
Cloud Provider Requirements
AWS:
- EKS cluster or self-managed Kubernetes
- RDS PostgreSQL (db.r6g.xlarge or higher)
- ElastiCache Redis (cache.r6g.large or higher)
- Application Load Balancer (ALB)
- Network Load Balancer (NLB)
- Route 53 for DNS
- S3 for backups
- CloudWatch for logging
Azure:
- AKS cluster
- Azure Database for PostgreSQL (Flexible Server, Standard_D4s_v3+)
- Azure Cache for Redis (Standard C1+)
- Azure Load Balancer
- Azure DNS
- Azure Blob Storage for backups
- Azure Monitor for logging
Google Cloud:
- GKE cluster
- Cloud SQL for PostgreSQL (db-custom-4-16384+)
- Memorystore for Redis (M1 tier+)
- Cloud Load Balancing
- Cloud DNS
- Cloud Storage for backups
- Cloud Logging
📦 Installation
Step 1: Prepare Kubernetes Cluster
Create EKS Cluster (AWS example):
# Install eksctl
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
# Create cluster
eksctl create cluster \
--name controlcore-production \
--region us-east-1 \
--version 1.28 \
--nodegroup-name standard-workers \
--node-type m5.2xlarge \
--nodes 6 \
--nodes-min 6 \
--nodes-max 20 \
--managed \
--with-oidc \
--ssh-access \
--ssh-public-key ~/.ssh/id_rsa.pub \
--enable-ssm
# Verify cluster
kubectl get nodes
Create AKS Cluster (Azure example):
# Create resource group
az group create --name controlcore-production --location eastus
# Create AKS cluster
az aks create \
--resource-group controlcore-production \
--name controlcore-cluster \
--kubernetes-version 1.28.0 \
--node-count 6 \
--node-vm-size Standard_D8s_v3 \
--enable-managed-identity \
--enable-cluster-autoscaler \
--min-count 6 \
--max-count 20 \
--network-plugin azure \
--load-balancer-sku standard \
--generate-ssh-keys
# Get credentials
az aks get-credentials --resource-group controlcore-production --name controlcore-cluster
# Verify
kubectl get nodes
Create GKE Cluster (GCP example):
# Set project and region
gcloud config set project your-project-id
gcloud config set compute/region us-central1
# Create GKE cluster
gcloud container clusters create controlcore-cluster \
--region us-central1 \
--cluster-version 1.28 \
--machine-type n2-standard-8 \
--num-nodes 2 \
--min-nodes 2 \
--max-nodes 7 \
--enable-autoscaling \
--enable-autorepair \
--enable-autoupgrade \
--disk-type pd-ssd \
--disk-size 100 \
--enable-ip-alias \
--enable-stackdriver-kubernetes \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
--workload-pool=your-project-id.svc.id.goog \
--enable-shielded-nodes \
--shielded-secure-boot \
--shielded-integrity-monitoring
# Alternative: GKE Autopilot (fully managed)
gcloud container clusters create-auto controlcore-cluster \
--region us-central1 \
--cluster-version 1.28
# Get credentials
gcloud container clusters get-credentials controlcore-cluster --region us-central1
# Verify
kubectl get nodes
Step 2: Install Prerequisites
Install cert-manager:
# Add Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update
# Install cert-manager
kubectl create namespace cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.13.0 \
--set installCRDs=true
# Verify installation
kubectl get pods -n cert-manager
Install NGINX Ingress Controller:
# Add NGINX Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
# Install NGINX Ingress
helm install nginx-ingress ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.replicaCount=3 \
--set controller.service.type=LoadBalancer \
--set controller.metrics.enabled=true \
--set controller.podAnnotations."prometheus\.io/scrape"=true
# Get Load Balancer IP
kubectl get svc -n ingress-nginx
Install Metrics Server (for HPA):
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify
kubectl get deployment metrics-server -n kube-system
Install Prometheus (optional but recommended):
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
--set grafana.enabled=true \
--set grafana.adminPassword=ChangeMeSecurePassword
# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Step 3: Configure Storage
Create Storage Class (AWS EBS example):
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: controlcore-fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
kubectl apply -f storage-class.yaml
GCP Persistent Disk Storage Class:
# storage-class-gcp.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: controlcore-fast-ssd
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
replication-type: regional-pd # For HA across zones
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
kubectl apply -f storage-class-gcp.yaml
Azure Disk Storage Class:
# storage-class-azure.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: controlcore-fast-ssd
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_LRS # Premium SSD
kind: Managed
cachingMode: ReadOnly
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
kubectl apply -f storage-class-azure.yaml
Step 4: Create Namespace and Secrets
# Create namespace
kubectl create namespace control-core
# Create secrets
kubectl create secret generic controlcore-secrets \
--namespace control-core \
--from-literal=database-password='SecureDBPassword123!' \
--from-literal=redis-password='SecureRedisPassword123!' \
--from-literal=jwt-secret='SecureJWTSecret123!' \
--from-literal=admin-password='SecureAdminPassword123!'
# Create TLS secret (if using custom certificate)
kubectl create secret tls controlcore-tls \
--namespace control-core \
--cert=path/to/tls.crt \
--key=path/to/tls.key
Step 5: Deploy PostgreSQL (High Availability)
Using Helm (Bitnami PostgreSQL HA):
# Add Bitnami repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Install PostgreSQL with replication
helm install postgresql bitnami/postgresql-ha \
--namespace control-core \
--set postgresql.replicaCount=3 \
--set postgresql.resources.requests.memory=8Gi \
--set postgresql.resources.requests.cpu=2000m \
--set postgresql.resources.limits.memory=16Gi \
--set postgresql.resources.limits.cpu=4000m \
--set pgpool.replicaCount=3 \
--set pgpool.resources.requests.memory=2Gi \
--set pgpool.resources.requests.cpu=1000m \
--set persistence.size=200Gi \
--set persistence.storageClass=controlcore-fast-ssd \
--set metrics.enabled=true \
--set volumePermissions.enabled=true
# Or use managed database (AWS RDS example)
# Create RDS instance via AWS Console or CLI:
aws rds create-db-instance \
--db-instance-identifier controlcore-db \
--db-instance-class db.r6g.2xlarge \
--engine postgres \
--engine-version 15.3 \
--master-username controlcore \
--master-user-password SecurePassword123! \
--allocated-storage 500 \
--storage-type gp3 \
--iops 12000 \
--multi-az \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00" \
--preferred-maintenance-window "mon:04:00-mon:05:00" \
--enable-performance-insights \
--enable-cloudwatch-logs-exports postgresql
# Google Cloud SQL (GCP example)
gcloud sql instances create controlcore-db \
--database-version=POSTGRES_15 \
--tier=db-custom-8-32768 \
--region=us-central1 \
--network=default \
--availability-type=REGIONAL \
--storage-type=SSD \
--storage-size=500GB \
--storage-auto-increase \
--backup-start-time=03:00 \
--maintenance-window-day=MON \
--maintenance-window-hour=04 \
--enable-bin-log \
--retained-backups-count=30 \
--root-password=SecurePassword123!
# Set database flags for performance
gcloud sql instances patch controlcore-db \
--database-flags=shared_buffers=8GB,max_connections=500,effective_cache_size=24GB
# Create database
gcloud sql databases create control_core_db --instance=controlcore-db
# Create user
gcloud sql users create controlcore \
--instance=controlcore-db \
--password=SecurePassword123!
# Azure Database for PostgreSQL (Azure example)
az postgres flexible-server create \
--resource-group controlcore-production \
--name controlcore-db \
--location eastus \
--admin-user controlcore \
--admin-password SecurePassword123! \
--sku-name Standard_D8s_v3 \
--tier GeneralPurpose \
--version 15 \
--storage-size 512 \
--backup-retention 30 \
--geo-redundant-backup Enabled \
--high-availability ZoneRedundant \
--public-access 0.0.0.0-255.255.255.255
# Create database
az postgres flexible-server db create \
--resource-group controlcore-production \
--server-name controlcore-db \
--database-name control_core_db
# Configure server parameters
az postgres flexible-server parameter set \
--resource-group controlcore-production \
--server-name controlcore-db \
--name shared_buffers \
--value 8388608 # 8GB in KB
az postgres flexible-server parameter set \
--resource-group controlcore-production \
--server-name controlcore-db \
--name max_connections \
--value 500
Database Configuration (cloud-agnostic):
# database-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: database-config
namespace: control-core
data:
# AWS RDS
# host: "controlcore-db.cluster-xxxxx.us-east-1.rds.amazonaws.com"
# GCP Cloud SQL
# host: "10.x.x.x" # Private IP or Cloud SQL Proxy
# Azure Database
# host: "controlcore-db.postgres.database.azure.com"
host: "your-database-host"
port: "5432"
database: "control_core_db"
pool_size: "50"
max_overflow: "20"
pool_timeout: "30"
pool_recycle: "3600"
Step 6: Deploy Redis (High Availability)
Using Helm (Redis Cluster):
# Install Redis cluster
helm install redis bitnami/redis-cluster \
--namespace control-core \
--set cluster.nodes=6 \
--set cluster.replicas=1 \
--set password=SecureRedisPassword123! \
--set persistence.size=50Gi \
--set persistence.storageClass=controlcore-fast-ssd \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=1000m \
--set metrics.enabled=true
# Or use managed cache (AWS ElastiCache example)
aws elasticache create-replication-group \
--replication-group-id controlcore-cache \
--replication-group-description "Control Core Redis Cluster" \
--engine redis \
--cache-node-type cache.r6g.xlarge \
--num-cache-clusters 3 \
--automatic-failover-enabled \
--at-rest-encryption-enabled \
--transit-encryption-enabled \
--auth-token SecureRedisPassword123! \
--snapshot-retention-limit 7 \
--snapshot-window "03:00-05:00"
# GCP Memorystore for Redis
gcloud redis instances create controlcore-cache \
--size=5 \
--region=us-central1 \
--tier=standard \
--redis-version=redis_7_0 \
--enable-auth \
--auth-string=SecureRedisPassword123! \
--transit-encryption-mode=SERVER_AUTHENTICATION \
--replica-count=2 \
--read-replicas-mode=READ_REPLICAS_ENABLED \
--persistence-mode=RDB \
--rdb-snapshot-period=12h \
--rdb-snapshot-start-time=03:00
# Get connection info
gcloud redis instances describe controlcore-cache --region=us-central1
# Azure Cache for Redis
az redis create \
--resource-group controlcore-production \
--name controlcore-cache \
--location eastus \
--sku Premium \
--vm-size P2 \
--enable-non-ssl-port false \
--minimum-tls-version 1.2 \
--redis-configuration maxmemory-policy=allkeys-lru \
--replicas-per-primary 2 \
--zones 1 2 3 \
--shard-count 2
# Get connection info
az redis list-keys \
--resource-group controlcore-production \
--name controlcore-cache
Step 7: Install Control Core Helm Chart
Add Control Core Helm Repository:
# Add repository
helm repo add controlcore https://charts.controlcore.io
helm repo update
# Pull chart to customize
helm pull controlcore/control-core --untar
cd control-core
Configure values.yaml:
# values-production.yaml
global:
domain: controlcore.yourcompany.com
environment: production
# Image configuration
imageRegistry: controlcore.io
imagePullSecrets:
- name: controlcore-registry-secret
# Policy Administration Console
console:
enabled: true
replicaCount: 3
image:
repository: controlcore/console
tag: "2.0.0"
pullPolicy: IfNotPresent
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
hosts:
- host: console.controlcore.yourcompany.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: console-tls
hosts:
- console.controlcore.yourcompany.com
# Policy Administration API
api:
enabled: true
replicaCount: 5
image:
repository: controlcore/api
tag: "2.0.0"
pullPolicy: IfNotPresent
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# Custom metrics for scaling
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
env:
- name: WORKERS
value: "4"
- name: MAX_REQUESTS
value: "10000"
- name: MAX_REQUESTS_JITTER
value: "1000"
- name: TIMEOUT
value: "60"
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/rate-limit: "1000"
hosts:
- host: api.controlcore.yourcompany.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: api-tls
hosts:
- api.controlcore.yourcompany.com
# Policy Bridge
policyBridge:
enabled: true
replicaCount: 3
image:
repository: controlcore/policy-bridge
tag: "0.7.0"
pullPolicy: IfNotPresent
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
# Leader election for HA
leaderElection:
enabled: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
config:
broadcast_uri: "postgres://controlcore:password@postgresql:5432/policy_bridge_db"
data_config_sources:
- uri: "https://api.controlcore.yourcompany.com/api/v1/policy-bridge/config"
config:
headers:
Authorization: "Bearer ${POLICY_SYNC_API_KEY}"
# Policy Enforcement Point (Bouncer/PEP)
bouncer:
enabled: true
replicaCount: 10
image:
repository: controlcore/bouncer
tag: "2.0.0"
pullPolicy: IfNotPresent
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "2000m"
autoscaling:
enabled: true
minReplicas: 10
maxReplicas: 50
targetCPUUtilizationPercentage: 60
targetMemoryUtilizationPercentage: 70
# Scale based on request rate
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
# Pod Disruption Budget for high availability
podDisruptionBudget:
enabled: true
minAvailable: 5
# Pod Topology Spread for better distribution
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: controlcore-bouncer
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: controlcore-bouncer
config:
cache:
enabled: true
policy_ttl: "5m"
decision_ttl: "1m"
max_size: 50000
performance:
worker_threads: 8
connection_pool_size: 100
max_concurrent_requests: 5000
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: https
port: 443
targetPort: 8443
protocol: TCP
# Database configuration
database:
# Use external database
external: true
host: "controlcore-db.cluster-xxxxx.us-east-1.rds.amazonaws.com"
port: 5432
database: "control_core_db"
username: "controlcore"
passwordSecret: "controlcore-secrets"
passwordKey: "database-password"
# Connection pool settings
pool:
size: 50
max_overflow: 20
timeout: 30
recycle: 3600
# Redis configuration
redis:
# Use external Redis
external: true
host: "controlcore-cache.xxxxx.cache.amazonaws.com"
port: 6379
passwordSecret: "controlcore-secrets"
passwordKey: "redis-password"
# Cluster mode
cluster:
enabled: true
nodes: 6
# Monitoring & Observability
monitoring:
enabled: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
grafana:
enabled: true
dashboards:
enabled: true
alerting:
enabled: true
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
# Backup configuration
backup:
enabled: true
schedule: "0 2 * * *" # Daily at 2 AM
retention: 30 # days
storage:
type: s3
bucket: controlcore-backups
region: us-east-1
prefix: production/
# Security
security:
podSecurityPolicy:
enabled: true
networkPolicy:
enabled: true
rbac:
create: true
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/controlcore-sa-role"
Install Control Core:
# Install with custom values
helm install control-core controlcore/control-core \
--namespace control-core \
--values values-production.yaml \
--timeout 10m \
--wait
# Verify installation
helm list -n control-core
kubectl get pods -n control-core
kubectl get svc -n control-core
kubectl get ingress -n control-core
Step 8: Configure DNS
Get Load Balancer Address:
# Get Bouncer LB address
kubectl get svc -n control-core controlcore-bouncer -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
# Get Ingress LB address
kubectl get svc -n ingress-nginx nginx-ingress-controller -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Configure DNS Records (Route 53 example):
# Console
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "console.controlcore.yourcompany.com",
"Type": "CNAME",
"TTL": 300,
"ResourceRecords": [{"Value": "INGRESS_LB_HOSTNAME"}]
}
}]
}'
# API
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.controlcore.yourcompany.com",
"Type": "CNAME",
"TTL": 300,
"ResourceRecords": [{"Value": "INGRESS_LB_HOSTNAME"}]
}
}]
}'
# Bouncer (for direct access)
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "bouncer.controlcore.yourcompany.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "BOUNCER_LB_HOSTNAME"}]
}
}]
}'
📌 Auto-Scaling Configuration
Horizontal Pod Autoscaler (HPA)
Control Core components auto-scale based on CPU, memory, and custom metrics.
Verify HPA Status:
# Check all HPAs
kubectl get hpa -n control-core
# Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
controlcore-console Deployment/console 45%/70% 3 10 3
controlcore-api Deployment/api 60%/70% 5 20 8
controlcore-bouncer Deployment/bouncer 55%/60% 10 50 15
Custom Metrics for Scaling:
# hpa-custom-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: controlcore-api-advanced
namespace: control-core
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: controlcore-api
minReplicas: 5
maxReplicas: 50
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Request rate scaling
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
# Policy evaluation time scaling
- type: Pods
pods:
metric:
name: policy_evaluation_duration_seconds
target:
type: AverageValue
averageValue: "0.050" # Scale when avg > 50ms
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
kubectl apply -f hpa-custom-metrics.yaml
Cluster Autoscaler
AWS EKS:
# Create IAM policy
cat > cluster-autoscaler-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeImages",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": ["*"]
}
]
}
EOF
aws iam create-policy \
--policy-name ClusterAutoscalerPolicy \
--policy-document file://cluster-autoscaler-policy.json
# Deploy cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
# Annotate deployment
kubectl -n kube-system annotate deployment.apps/cluster-autoscaler \
cluster-autoscaler.kubernetes.io/safe-to-evict="false"
# Set cluster name
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
# Add: --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/controlcore-production
Azure AKS (already configured if created with --enable-cluster-autoscaler):
# Update autoscaler settings
az aks update \
--resource-group controlcore-production \
--name controlcore-cluster \
--update-cluster-autoscaler \
--min-count 6 \
--max-count 50
Verification:
# Check autoscaler logs
kubectl -n kube-system logs -f deployment/cluster-autoscaler
# Check node status
kubectl get nodes
kubectl top nodes
📌 Load Balancing Configuration
Application Load Balancing (Layer 7)
NGINX Ingress Configuration:
# ingress-advanced.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: controlcore-ingress
namespace: control-core
annotations:
# SSL/TLS
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
# Load balancing
nginx.ingress.kubernetes.io/load-balance: "ewma" # Exponentially weighted moving average
nginx.ingress.kubernetes.io/upstream-hash-by: "$binary_remote_addr" # IP hash for session affinity
# Rate limiting
nginx.ingress.kubernetes.io/limit-rps: "1000"
nginx.ingress.kubernetes.io/limit-burst-multiplier: "5"
# Timeouts
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
# Buffering
nginx.ingress.kubernetes.io/proxy-buffering: "on"
nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"
# Connection limits
nginx.ingress.kubernetes.io/limit-connections: "100"
# CORS
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://yourcompany.com"
# Security headers
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-Frame-Options: DENY";
more_set_headers "X-Content-Type-Options: nosniff";
more_set_headers "X-XSS-Protection: 1; mode=block";
more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
# Custom error pages
nginx.ingress.kubernetes.io/custom-http-errors: "404,503"
nginx.ingress.kubernetes.io/default-backend: custom-error-pages
spec:
ingressClassName: nginx
tls:
- hosts:
- console.controlcore.yourcompany.com
- api.controlcore.yourcompany.com
secretName: controlcore-tls
rules:
- host: console.controlcore.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: controlcore-console
port:
number: 3000
- host: api.controlcore.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: controlcore-api
port:
number: 8082
kubectl apply -f ingress-advanced.yaml
Network Load Balancing (Layer 4)
For Bouncer/PEP Fleet:
# bouncer-nlb-service.yaml
apiVersion: v1
kind: Service
metadata:
name: controlcore-bouncer-nlb
namespace: control-core
annotations:
# AWS NLB annotations
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "http"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/health"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
# Azure Load Balancer annotations
service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "http"
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/health"
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Preserve client IP
selector:
app: controlcore-bouncer
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: https
port: 443
targetPort: 8443
protocol: TCP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
kubectl apply -f bouncer-nlb-service.yaml
DNS-Based Load Balancing
AWS Route 53 Weighted Routing:
# Create weighted record sets for multi-region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "bouncer.controlcore.yourcompany.com",
"Type": "CNAME",
"SetIdentifier": "us-east-1",
"Weight": 70,
"TTL": 60,
"ResourceRecords": [{"Value": "us-east-1-bouncer-lb.amazonaws.com"}]
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "bouncer.controlcore.yourcompany.com",
"Type": "CNAME",
"SetIdentifier": "eu-west-1",
"Weight": 30,
"TTL": 60,
"ResourceRecords": [{"Value": "eu-west-1-bouncer-lb.amazonaws.com"}]
}
}
]
}'
Geo-Routing for Global Deployment:
# US users to US region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "bouncer.controlcore.yourcompany.com",
"Type": "CNAME",
"SetIdentifier": "North America",
"GeoLocation": {
"ContinentCode": "NA"
},
"TTL": 60,
"ResourceRecords": [{"Value": "us-east-1-bouncer-lb.amazonaws.com"}]
}
}]
}'
# EU users to EU region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "bouncer.controlcore.yourcompany.com",
"Type": "CNAME",
"SetIdentifier": "Europe",
"GeoLocation": {
"ContinentCode": "EU"
},
"TTL": 60,
"ResourceRecords": [{"Value": "eu-west-1-bouncer-lb.amazonaws.com"}]
}
}]
}'
🤖 High Availability Configuration
Multi-AZ Deployment
Node Distribution:
# pod-topology-spread.yaml
apiVersion: v1
kind: Pod
metadata:
name: controlcore-api
labels:
app: controlcore-api
spec:
topologySpreadConstraints:
# Spread across zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: controlcore-api
# Spread across nodes
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: controlcore-api
Pod Disruption Budgets
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: controlcore-api-pdb
namespace: control-core
spec:
minAvailable: 3
selector:
matchLabels:
app: controlcore-api
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: controlcore-bouncer-pdb
namespace: control-core
spec:
minAvailable: 5
selector:
matchLabels:
app: controlcore-bouncer
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: controlcore-policy-bridge-pdb
namespace: control-core
spec:
maxUnavailable: 1
selector:
matchLabels:
app: controlcore-policy-bridge
kubectl apply -f pdb.yaml
Database High Availability
PostgreSQL Replication:
# postgresql-replication.yaml (if self-managed)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: controlcore-db
namespace: control-core
spec:
instances: 3
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
max_connections: "500"
shared_buffers: "8GB"
effective_cache_size: "24GB"
maintenance_work_mem: "2GB"
checkpoint_completion_target: "0.9"
wal_buffers: "16MB"
default_statistics_target: "100"
random_page_cost: "1.1"
effective_io_concurrency: "200"
work_mem: "20MB"
min_wal_size: "1GB"
max_wal_size: "4GB"
max_worker_processes: "8"
max_parallel_workers_per_gather: "4"
max_parallel_workers: "8"
max_parallel_maintenance_workers: "4"
bootstrap:
initdb:
database: control_core_db
owner: controlcore
secret:
name: controlcore-db-secret
storage:
size: 500Gi
storageClass: controlcore-fast-ssd
backup:
barmanObjectStore:
destinationPath: s3://controlcore-backups/postgresql/
s3Credentials:
accessKeyId:
name: aws-credentials
key: access-key-id
secretAccessKey:
name: aws-credentials
key: secret-access-key
wal:
compression: gzip
maxParallel: 8
retentionPolicy: "30d"
monitoring:
enabled: true
Redis Sentinel (if self-managed)
# redis-sentinel.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-sentinel-config
namespace: control-core
data:
sentinel.conf: |
sentinel monitor mymaster redis-0.redis 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-sentinel
namespace: control-core
spec:
serviceName: redis-sentinel
replicas: 3
selector:
matchLabels:
app: redis-sentinel
template:
metadata:
labels:
app: redis-sentinel
spec:
containers:
- name: sentinel
image: redis:7-alpine
command:
- redis-sentinel
- /etc/redis/sentinel.conf
ports:
- containerPort: 26379
name: sentinel
volumeMounts:
- name: config
mountPath: /etc/redis
volumes:
- name: config
configMap:
name: redis-sentinel-config
📌 SSL/TLS Configuration
Certificate Management with cert-manager
ClusterIssuer for Let's Encrypt:
# letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@yourcompany.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
# HTTP-01 challenge
- http01:
ingress:
class: nginx
# DNS-01 challenge (for wildcard certs)
- dns01:
route53:
region: us-east-1
accessKeyID: <set-aws-access-key-id-from-secret-manager>
secretAccessKeySecretRef:
name: aws-credentials
key: secret-access-key
kubectl apply -f letsencrypt-issuer.yaml
Certificate Resource:
# certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: controlcore-tls
namespace: control-core
spec:
secretName: controlcore-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- controlcore.yourcompany.com
- console.controlcore.yourcompany.com
- api.controlcore.yourcompany.com
- bouncer.controlcore.yourcompany.com
- "*.controlcore.yourcompany.com"
privateKey:
algorithm: RSA
size: 4096
kubectl apply -f certificate.yaml
# Check certificate status
kubectl get certificate -n control-core
kubectl describe certificate controlcore-tls -n control-core
mTLS Between Services
Service Mesh with Istio (optional):
# Install Istio
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH
# Install Istio with mTLS
istioctl install --set profile=production -y
# Enable automatic sidecar injection
kubectl label namespace control-core istio-injection=enabled
# Apply strict mTLS policy
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: control-core
spec:
mtls:
mode: STRICT
EOF
📌 Recommended Sync Settings
Policy Bridge Configuration
# policy-bridge-sync-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: policy-bridge-sync-config
namespace: control-core
data:
# Policy sync interval (how often to check for policy updates)
POLICY_REPO_POLLING_INTERVAL: "30" # seconds
# Data source sync intervals (by type)
POLICY_DATA_CONFIG_SYNC_INTERVAL: "60" # seconds
# WebSocket keep-alive
POLICY_SYNC_KEEPALIVE: "30" # seconds
# Statistics reporting interval
POLICY_SYNC_STATS_ENABLED: "true"
POLICY_SYNC_STATS_INTERVAL: "60" # seconds
# Broadcast channel
POLICY_SYNC_BROADCAST_URI: "postgres://controlcore:password@postgresql:5432/policy_bridge_db"
# Client subscriptions
POLICY_SYNC_CLIENT_TOKEN: "secure-client-token"
POLICY_SYNC_RECONNECT_INTERVAL: "5" # seconds
POLICY_SYNC_MAX_RECONNECT_ATTEMPTS: "10"
Recommendations by Environment:
| Setting | Development | Staging | Production | High-Traffic Production |
|---|---|---|---|---|
| Policy Repo Polling | 60s | 30s | 30s | 30s |
| Data Source Sync | 300s (5m) | 120s (2m) | 60s (1m) | 60s (1m) |
| WebSocket Keep-Alive | 60s | 30s | 30s | 15s |
| Statistics Interval | 300s | 60s | 60s | 30s |
Cache Settings
# cache-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cache-config
namespace: control-core
data:
# Policy cache (how long to cache compiled policies)
POLICY_CACHE_TTL: "300" # 5 minutes in seconds
POLICY_CACHE_MAX_SIZE: "50000" # entries
# Decision cache (how long to cache authorization decisions)
DECISION_CACHE_TTL: "60" # 1 minute in seconds
DECISION_CACHE_MAX_SIZE: "100000" # entries
# User context cache
USER_CONTEXT_CACHE_TTL: "300" # 5 minutes
USER_CONTEXT_CACHE_MAX_SIZE: "50000"
# Resource metadata cache
RESOURCE_CACHE_TTL: "600" # 10 minutes
RESOURCE_CACHE_MAX_SIZE: "50000"
# Cache eviction policy
CACHE_EVICTION_POLICY: "lru" # lru, lfu, or ttl
# Cache warming (preload frequently accessed data)
CACHE_WARMING_ENABLED: "true"
CACHE_WARMING_INTERVAL: "3600" # 1 hour
Recommendations by Load:
| Metric | Low Load | Medium Load | High Load | Very High Load |
|---|---|---|---|---|
| Policy Cache TTL | 10m | 5m | 5m | 3m |
| Decision Cache TTL | 5m | 1m | 1m | 30s |
| Policy Cache Size | 10,000 | 50,000 | 100,000 | 500,000 |
| Decision Cache Size | 50,000 | 100,000 | 500,000 | 1,000,000 |
Performance Tuning
# performance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: performance-config
namespace: control-core
data:
# API workers
API_WORKERS: "8" # per pod
API_WORKER_CLASS: "uvicorn.workers.UvicornWorker"
API_WORKER_CONNECTIONS: "1000"
API_TIMEOUT: "60"
API_KEEPALIVE: "5"
# Bouncer/PEP workers
BOUNCER_WORKER_THREADS: "16" # per pod
BOUNCER_CONNECTION_POOL_SIZE: "200"
BOUNCER_MAX_CONCURRENT_REQUESTS: "10000"
BOUNCER_REQUEST_TIMEOUT: "30s"
# Database connection pool
DB_POOL_SIZE: "50" # per API pod
DB_MAX_OVERFLOW: "20"
DB_POOL_TIMEOUT: "30"
DB_POOL_RECYCLE: "3600"
DB_POOL_PRE_PING: "true"
# Redis connection pool
REDIS_POOL_SIZE: "50" # per pod
REDIS_MAX_CONNECTIONS: "100"
REDIS_SOCKET_KEEPALIVE: "true"
REDIS_SOCKET_KEEPALIVE_OPTIONS: "1,10,3"
🔒 Runtime Security
Pod Security Policies
# pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: controlcore-restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
readOnlyRootFilesystem: false
Network Policies
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: controlcore-network-policy
namespace: control-core
spec:
podSelector:
matchLabels:
app: controlcore
policyTypes:
- Ingress
- Egress
ingress:
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 3000 # Console
- protocol: TCP
port: 8082 # API
- protocol: TCP
port: 8080 # Bouncer
# Allow internal communication
- from:
- podSelector:
matchLabels:
app: controlcore
ports:
- protocol: TCP
port: 3000
- protocol: TCP
port: 8082
- protocol: TCP
port: 8080
- protocol: TCP
port: 7000 # Policy Bridge
egress:
# Allow to database
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- protocol: TCP
port: 5432
# Allow to Redis
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
# Allow HTTPS egress (for external APIs)
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
kubectl apply -f network-policy.yaml
Secrets Management
Using AWS Secrets Manager:
# external-secrets.yaml
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secretsmanager
namespace: control-core
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: controlcore-sa
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: controlcore-secrets
namespace: control-core
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secretsmanager
kind: SecretStore
target:
name: controlcore-secrets
creationPolicy: Owner
data:
- secretKey: database-password
remoteRef:
key: controlcore/database
property: password
- secretKey: redis-password
remoteRef:
key: controlcore/redis
property: password
- secretKey: jwt-secret
remoteRef:
key: controlcore/jwt
property: secret
Using HashiCorp Vault:
# vault-integration.yaml
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultAuth
metadata:
name: controlcore-vault-auth
namespace: control-core
spec:
method: kubernetes
mount: kubernetes
kubernetes:
role: controlcore
serviceAccount: controlcore-sa
---
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
metadata:
name: controlcore-secrets
namespace: control-core
spec:
type: kv-v2
mount: secret
path: controlcore/production
destination:
name: controlcore-secrets
create: true
refreshAfter: 30s
vaultAuthRef: controlcore-vault-auth
📌 SAML and SSO Configuration
Auth0 Integration
# auth0-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: auth0-config
namespace: control-core
data:
AUTH0_DOMAIN: "yourcompany.auth0.com"
AUTH0_CLIENT_ID: "your-client-id"
AUTH0_AUDIENCE: "https://api.controlcore.yourcompany.com"
AUTH0_SCOPE: "openid profile email"
AUTH0_CALLBACK_URL: "https://console.controlcore.yourcompany.com/callback"
---
apiVersion: v1
kind: Secret
metadata:
name: auth0-secret
namespace: control-core
type: Opaque
stringData:
client_secret: "your-auth0-client-secret"
SAML SSO Configuration
# saml-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: saml-config
namespace: control-core
data:
SAML_ENABLED: "true"
SAML_IDP_ENTITY_ID: "https://idp.yourcompany.com/saml"
SAML_IDP_SSO_URL: "https://idp.yourcompany.com/saml/sso"
SAML_IDP_SLO_URL: "https://idp.yourcompany.com/saml/slo"
SAML_SP_ENTITY_ID: "https://console.controlcore.yourcompany.com"
SAML_SP_ACS_URL: "https://console.controlcore.yourcompany.com/saml/acs"
SAML_SP_SLO_URL: "https://console.controlcore.yourcompany.com/saml/slo"
# Attribute mapping
SAML_ATTR_EMAIL: "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress"
SAML_ATTR_FIRSTNAME: "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/givenname"
SAML_ATTR_LASTNAME: "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/surname"
SAML_ATTR_GROUPS: "http://schemas.xmlsoap.org/claims/Group"
---
apiVersion: v1
kind: Secret
metadata:
name: saml-certificates
namespace: control-core
type: Opaque
data:
idp_cert.pem: <base64-encoded-idp-certificate>
sp_key.pem: <base64-encoded-sp-private-key>
sp_cert.pem: <base64-encoded-sp-certificate>
SAML Providers:
- Okta: Configure SAML 2.0 app
- Azure AD: Enterprise Application with SAML SSO
- OneLogin: SAML SSO application
- Google Workspace: Custom SAML app
- Ping Identity: SAML 2.0 connection
👁️ Monitoring and Observability
Prometheus Metrics
Service Monitors:
# servicemonitors.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: controlcore-api
namespace: control-core
spec:
selector:
matchLabels:
app: controlcore-api
endpoints:
- port: metrics
path: /metrics
interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: controlcore-bouncer
namespace: control-core
spec:
selector:
matchLabels:
app: controlcore-bouncer
endpoints:
- port: metrics
path: /metrics
interval: 30s
Key Metrics to Monitor:
# API Metrics
http_requests_total - Total HTTP requests
http_request_duration_seconds - Request latency
http_requests_in_flight - Current requests being processed
policy_evaluations_total - Total policy evaluations
policy_evaluation_duration_seconds - Policy evaluation time
cache_hits_total - Cache hits
cache_misses_total - Cache misses
# Bouncer Metrics
bouncer_requests_total - Total requests through bouncer
bouncer_allowed_requests - Allowed requests
bouncer_denied_requests - Denied requests
bouncer_policy_sync_timestamp - Last policy sync time
bouncer_target_app_reachable - Target app health (1=healthy, 0=unhealthy)
# Database Metrics
db_connections_active - Active database connections
db_connections_idle - Idle database connections
db_query_duration_seconds - Query execution time
# Policy Bridge Metrics
policy-bridge_connected_clients - Number of connected clients
policy-bridge_policy_updates_total - Total policy updates distributed
policy-bridge_data_updates_total - Total data updates distributed
Grafana Dashboards
Import Pre-built Dashboard:
# Get dashboard JSON from Control Core
curl -o controlcore-dashboard.json \
https://downloads.controlcore.io/dashboards/enterprise-v2.json
# Import to Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Access Grafana at http://localhost:3000
# Import dashboard via UI or API
Logging with ELK/EFK Stack
# Install Elasticsearch
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch \
--namespace logging \
--create-namespace \
--set replicas=3 \
--set resources.requests.memory=4Gi \
--set volumeClaimTemplate.resources.requests.storage=100Gi
# Install Kibana
helm install kibana elastic/kibana \
--namespace logging \
--set service.type=LoadBalancer
# Install Fluentd (or Fluent Bit for lighter footprint)
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluentd fluent/fluentd \
--namespace logging \
--set elasticsearch.host=elasticsearch-master \
--set elasticsearch.port=9200
Alerting
Prometheus Alert Rules:
# alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: controlcore-alerts
namespace: control-core
spec:
groups:
- name: controlcore.rules
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "{{ $labels.instance }} has error rate of {{ $value }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
# Policy sync failure
- alert: PolicySyncFailure
expr: |
time() - bouncer_policy_sync_timestamp > 600
for: 5m
labels:
severity: critical
annotations:
summary: "Policy sync failure"
description: "Bouncer {{ $labels.instance }} hasn't synced in 10 minutes"
# Pod not ready
- alert: PodNotReady
expr: |
kube_pod_status_phase{namespace="control-core",phase!="Running"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Pod not ready"
description: "Pod {{ $labels.pod }} is not in Running state"
# High memory usage
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{namespace="control-core"}
/ container_spec_memory_limit_bytes{namespace="control-core"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Container {{ $labels.container }} memory usage is {{ $value }}"
Alert Manager Configuration:
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alertmanager-webhook:5000/alerts'
- name: 'critical'
email_configs:
- to: 'ops-critical@yourcompany.com'
from: 'alertmanager@yourcompany.com'
smarthost: 'smtp.yourcompany.com:587'
auth_username: 'alertmanager@yourcompany.com'
auth_password: 'password'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#ops-critical'
title: 'Critical Alert'
- name: 'warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#ops-warnings'
title: 'Warning Alert'
📌 Backup and Disaster Recovery
Automated Backups
Velero for Kubernetes Resources:
# Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--set configuration.provider=aws \
--set configuration.backupStorageLocation.bucket=controlcore-backups \
--set configuration.backupStorageLocation.config.region=us-east-1 \
--set configuration.volumeSnapshotLocation.config.region=us-east-1 \
--set initContainers[0].name=velero-plugin-for-aws \
--set initContainers[0].image=velero/velero-plugin-for-aws:v1.8.0 \
--set initContainers[0].volumeMounts[0].mountPath=/target \
--set initContainers[0].volumeMounts[0].name=plugins
# Create backup schedule
velero schedule create control-core-daily \
--schedule="0 2 * * *" \
--include-namespaces control-core \
--ttl 720h0m0s
# Create on-demand backup
velero backup create control-core-backup-$(date +%Y%m%d) \
--include-namespaces control-core \
--wait
Database Backups:
# database-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: control-core
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
env:
- name: PGHOST
value: "postgresql"
- name: PGUSER
value: "controlcore"
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: controlcore-secrets
key: database-password
- name: PGDATABASE
value: "control_core_db"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
command:
- /bin/sh
- -c
- |
BACKUP_FILE="controlcore-db-$(date +%Y%m%d-%H%M%S).sql.gz"
pg_dump | gzip > /tmp/$BACKUP_FILE
aws s3 cp /tmp/$BACKUP_FILE s3://controlcore-backups/database/$BACKUP_FILE
echo "Backup completed: $BACKUP_FILE"
restartPolicy: OnFailure
Disaster Recovery Procedures
Recovery Runbook:
# 1. Restore Kubernetes resources
velero restore create --from-backup control-core-backup-20250125
# 2. Restore database
aws s3 cp s3://controlcore-backups/database/controlcore-db-20250125-020000.sql.gz .
gunzip controlcore-db-20250125-020000.sql.gz
kubectl exec -it postgresql-0 -n control-core -- psql -U controlcore -d control_core_db < controlcore-db-20250125-020000.sql
# 3. Verify services
kubectl get pods -n control-core
kubectl get svc -n control-core
# 4. Test health endpoints
curl https://console.controlcore.yourcompany.com/health
curl https://api.controlcore.yourcompany.com/api/v1/health
# 5. Verify policy sync
kubectl logs -n control-core -l app=controlcore-policy-bridge
# 6. Test policy evaluation
curl -X POST https://bouncer.controlcore.yourcompany.com/v1/data/app/authorization/allow \
-d '{"input": {"user": {"id": "test"}, "resource": {"id": "test"}, "action": "read"}}'
🚀 Troubleshooting Enterprise Deployments
Common Issues
Pod Scheduling Failures:
# Check node resources
kubectl top nodes
# Check pod status
kubectl describe pod <pod-name> -n control-core
# Check events
kubectl get events -n control-core --sort-by='.lastTimestamp'
# Common solutions:
# 1. Scale cluster (add more nodes)
# 2. Adjust resource requests/limits
# 3. Check PodDisruptionBudget settings
Database Connection Pool Exhaustion:
# Check active connections
kubectl exec -it postgresql-0 -n control-core -- psql -U controlcore -d control_core_db -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# Solution: Increase pool size in values.yaml
database:
pool:
size: 100 # Increase from 50
max_overflow: 40 # Increase from 20
High Memory Usage:
# Check memory usage
kubectl top pods -n control-core
# Identify memory hogs
kubectl exec -it <pod-name> -n control-core -- top
# Solutions:
# 1. Increase cache eviction rate
# 2. Reduce cache sizes
# 3. Add more memory to pods
# 4. Scale horizontally instead of vertically
📞 Support and Resources
- Administrator Guide: System administration
- Troubleshooting: Common issues
- Security Best Practices: Security hardening
- Enterprise Support: support@controlcore.io (24/7 SLA)
- Professional Services: Available for deployment assistance
📌 Next Steps
- Administrator Guide: Learn day-to-day operations
- User Guide: Create and deploy policies
- Security Best Practices: Harden your deployment
- API Reference: Integrate with APIs
Congratulations! You now have a production-ready, enterprise-scale Control Core deployment with high availability, auto-scaling, and comprehensive monitoring.