📘 DevOps Guide

This guide provides comprehensive instructions for DevOps engineers managing Control Core deployments, CI/CD pipelines, and infrastructure automation.

📌 Overview

Control Core DevOps responsibilities include:

  • Infrastructure Management: Kubernetes, Docker, and cloud infrastructure
  • CI/CD Pipelines: Automated testing, building, and deployment
  • Monitoring & Observability: System monitoring, logging, and alerting
  • Security Operations: Security scanning, compliance, and incident response
  • Performance Optimization: System tuning and scalability
  • Disaster Recovery: Backup, restore, and business continuity

CI/CD and automation guidance

Use this guide for customer-safe deployment automation patterns, release controls, and operational runbooks.

📌 Infrastructure Management

Kubernetes Deployment

Production Kubernetes Setup

  1. Cluster Requirements

    # Minimum cluster specifications
    nodes:
      - type: "master"
        count: 3
        specs:
          cpu: "4 cores"
          memory: "16GB"
          storage: "100GB SSD"
      - type: "worker"
        count: 3
        specs:
          cpu: "8 cores"
          memory: "32GB"
          storage: "200GB SSD"
    
  2. Deploy with Helm

    # Add Control Core Helm repository
    helm repo add controlcore https://charts.controlcore.io
    helm repo update
    
    # Install Control Core
    helm install controlcore controlcore/controlcore \
      --namespace controlcore \
      --create-namespace \
      --values values-production.yaml \
      --set global.domain=controlcore.company.com \
      --set global.tls.enabled=true
    

Helm Chart Configuration

  1. Production Values
    # values-production.yaml
    global:
      domain: "controlcore.company.com"
      tls:
        enabled: true
        certManager:
          enabled: true
    
    # Database configuration
    postgresql:
      enabled: true
      auth:
        postgresPassword: "secure_password"
        database: "control_core_db"
      primary:
        persistence:
          size: 100Gi
          storageClass: "fast-ssd"
    
    # Redis configuration
    redis:
      enabled: true
      auth:
        enabled: true
        password: "redis_secure_password"
      master:
        persistence:
          size: 50Gi
          storageClass: "fast-ssd"
    
    # Application configuration
    control-plane-api:
      replicas: 3
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 2000m
          memory: 4Gi
      autoscaling:
        enabled: true
        minReplicas: 3
        maxReplicas: 10
        targetCPUUtilizationPercentage: 70
    
    # Monitoring
    monitoring:
      enabled: true
      prometheus:
        enabled: true
      grafana:
        enabled: true
    

Docker Configuration

Multi-stage Dockerfile

  1. API Service Dockerfile

    # Dockerfile for control-plane-api
    FROM python:3.11-slim as builder
    
    # Install build dependencies
    RUN apt-get update && apt-get install -y \
        build-essential \
        libpq-dev \
        && rm -rf /var/lib/apt/lists/*
    
    # Install Python dependencies
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Production stage
    FROM python:3.11-slim
    
    # Install runtime dependencies
    RUN apt-get update && apt-get install -y \
        libpq5 \
        && rm -rf /var/lib/apt/lists/*
    
    # Create non-root user
    RUN useradd --create-home --shell /bin/bash controlcore
    
    # Copy application
    COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
    COPY --from=builder /usr/local/bin /usr/local/bin
    COPY . /app
    
    # Set permissions
    RUN chown -R controlcore:controlcore /app
    USER controlcore
    
    # Expose port
    EXPOSE 8000
    
    # Health check
    HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
      CMD curl -f http://localhost:8000/health || exit 1
    
    # Start application
    CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
    
  2. Frontend Dockerfile

    # Dockerfile for control-plane-ui frontend
    FROM node:18-alpine as builder
    
    # Set working directory
    WORKDIR /app
    
    # Copy package files
    COPY package*.json ./
    
    # Install dependencies
    RUN npm ci --only=production
    
    # Copy source code
    COPY . .
    
    # Build application
    RUN npm run build
    
    # Production stage
    FROM nginx:alpine
    
    # Copy built application
    COPY --from=builder /app/out /usr/share/nginx/html
    
    # Copy nginx configuration
    COPY nginx.conf /etc/nginx/nginx.conf
    
    # Expose port
    EXPOSE 80
    
    # Start nginx
    CMD ["nginx", "-g", "daemon off;"]
    

Infrastructure as Code

Terraform Configuration

  1. AWS Infrastructure

    # main.tf
    terraform {
      required_version = ">= 1.0"
      required_providers {
        aws = {
          source  = "hashicorp/aws"
          version = "~> 5.0"
        }
      }
    }
    
    provider "aws" {
      region = var.aws_region
    }
    
    # EKS Cluster
    resource "aws_eks_cluster" "controlcore" {
      name     = "controlcore-${var.environment}"
      role_arn = aws_iam_role.eks_cluster.arn
      version  = "1.28"
    
      vpc_config {
        subnet_ids              = aws_subnet.private[*].id
        endpoint_private_access = true
        endpoint_public_access  = true
        public_access_cidrs     = ["0.0.0.0/0"]
      }
    
      encryption_config {
        provider {
          key_arn = aws_kms_key.eks.arn
        }
        resources = ["secrets"]
      }
    }
    
    # RDS PostgreSQL
    resource "aws_db_instance" "postgres" {
      identifier = "controlcore-postgres-${var.environment}"
    
      engine         = "postgres"
      engine_version = "15.4"
      instance_class = "db.r5.xlarge"
    
      allocated_storage     = 100
      max_allocated_storage = 1000
      storage_type          = "gp3"
      storage_encrypted     = true
    
      db_name  = "control_core_db"
      username = "postgres"
      password = var.db_password
    
      vpc_security_group_ids = [aws_security_group.rds.id]
      db_subnet_group_name   = aws_db_subnet_group.main.name
    
      backup_retention_period = 7
      backup_window          = "03:00-04:00"
      maintenance_window     = "sun:04:00-sun:05:00"
    
      skip_final_snapshot = false
      final_snapshot_identifier = "controlcore-postgres-final-snapshot-${var.environment}"
    }
    
    # ElastiCache Redis
    resource "aws_elasticache_replication_group" "redis" {
      replication_group_id       = "controlcore-redis-${var.environment}"
      description                = "Control Core Redis cluster"
    
      node_type                  = "cache.r6g.large"
      port                       = 6379
      parameter_group_name       = "default.redis7"
    
      num_cache_clusters         = 2
    
      subnet_group_name          = aws_elasticache_subnet_group.main.name
      security_group_ids         = [aws_security_group.redis.id]
    
      at_rest_encryption_enabled = true
      transit_encryption_enabled = true
      auth_token                 = var.redis_auth_token
    }
    
  2. Variables

    # variables.tf
    variable "environment" {
      description = "Environment name"
      type        = string
      default     = "production"
    }
    
    variable "aws_region" {
      description = "AWS region"
      type        = string
      default     = "us-east-1"
    }
    
    variable "db_password" {
      description = "Database password"
      type        = string
      sensitive   = true
    }
    
    variable "redis_auth_token" {
      description = "Redis auth token"
      type        = string
      sensitive   = true
    }
    

📌 CI/CD Pipelines

GitHub Actions

Main CI/CD Pipeline

  1. Workflow Configuration
    # .github/workflows/ci-cd.yml
    name: CI/CD Pipeline
    
    on:
      push:
        branches: [main, develop]
      pull_request:
        branches: [main]
    
    env:
      REGISTRY: ghcr.io
      IMAGE_NAME: controlcore/control-core
    
    jobs:
      test:
        runs-on: ubuntu-latest
        services:
          postgres:
            image: postgres:15
            env:
              POSTGRES_PASSWORD: postgres
              POSTGRES_DB: test_db
            options: >-
              --health-cmd pg_isready
              --health-interval 10s
              --health-timeout 5s
              --health-retries 5
          
          redis:
            image: redis:7
            options: >-
              --health-cmd "redis-cli ping"
              --health-interval 10s
              --health-timeout 5s
              --health-retries 5
    
        steps:
          - name: Checkout code
            uses: actions/checkout@v4
    
          - name: Set up Python
            uses: actions/setup-python@v4
            with:
              python-version: '3.11'
    
          - name: Set up Node.js
            uses: actions/setup-node@v4
            with:
              node-version: '18'
              cache: 'npm'
              cache-dependency-path: control-plane-ui/package-lock.json
    
          - name: Install Python dependencies
            run: |
              cd control-plane-api
              pip install -r requirements.txt
              pip install -r requirements-dev.txt
    
          - name: Install Node.js dependencies
            run: |
              cd control-plane-ui
              npm ci
    
          - name: Run Python tests
            run: |
              cd control-plane-api
              pytest tests/ -v --cov=app --cov-report=xml
            env:
              DATABASE_URL: postgresql://postgres:postgres@localhost:5432/test_db
              REDIS_URL: redis://localhost:6379
    
          - name: Run Node.js tests
            run: |
              cd control-plane-ui
              npm test
    
          - name: Run security scan
            run: |
              cd control-plane-api
              bandit -r app/
              safety check
    
          - name: Upload coverage reports
            uses: codecov/codecov-action@v3
            with:
              files: ./control-plane-api/coverage.xml
    
      build:
        needs: test
        runs-on: ubuntu-latest
        if: github.ref == 'refs/heads/main'
        
        steps:
          - name: Checkout code
            uses: actions/checkout@v4
    
          - name: Set up Docker Buildx
            uses: docker/setup-buildx-action@v3
    
          - name: Log in to Container Registry
            uses: docker/login-action@v3
            with:
              registry: ${{ env.REGISTRY }}
              username: ${{ github.actor }}
              password: ${{ secrets.GITHUB_TOKEN }}
    
          - name: Extract metadata
            id: meta
            uses: docker/metadata-action@v5
            with:
              images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
              tags: |
                type=ref,event=branch
                type=ref,event=pr
                type=sha,prefix={{branch}}-
                type=raw,value=latest,enable={{is_default_branch}}
    
          - name: Build and push API image
            uses: docker/build-push-action@v5
            with:
              context: ./control-plane-api
              push: true
              tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}-api:${{ steps.meta.outputs.tags }}
              labels: ${{ steps.meta.outputs.labels }}
              cache-from: type=gha
              cache-to: type=gha,mode=max
    
          - name: Build and push Frontend image
            uses: docker/build-push-action@v5
            with:
              context: ./control-plane-ui
              push: true
              tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}-frontend:${{ steps.meta.outputs.tags }}
              labels: ${{ steps.meta.outputs.labels }}
              cache-from: type=gha
              cache-to: type=gha,mode=max
    
      deploy:
        needs: build
        runs-on: ubuntu-latest
        if: github.ref == 'refs/heads/main'
        environment: production
        
        steps:
          - name: Checkout code
            uses: actions/checkout@v4
    
          - name: Configure AWS credentials
            uses: aws-actions/configure-aws-credentials@v4
            with:
              aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
              aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
              aws-region: us-east-1
    
          - name: Update kubeconfig
            run: aws eks update-kubeconfig --region us-east-1 --name controlcore-production
    
          - name: Deploy to Kubernetes
            run: |
              helm upgrade --install controlcore ./deployment-assets/helm-chart/controlcore \
                --namespace controlcore \
                --create-namespace \
                --values ./deployment-assets/helm-chart/controlcore/values-production.yaml \
                --set image.tag=${{ github.sha }} \
                --set global.domain=controlcore.company.com \
                --wait
    
          - name: Run smoke tests
            run: |
              kubectl wait --for=condition=ready pod -l app=control-plane-api -n controlcore --timeout=300s
              kubectl wait --for=condition=ready pod -l app=control-plane-ui -n controlcore --timeout=300s
    

Security Scanning Pipeline

  1. Security Workflow
    # .github/workflows/security.yml
    name: Security Scan
    
    on:
      schedule:
        - cron: '0 2 * * *'  # Daily at 2 AM
      push:
        branches: [main]
    
    jobs:
      security-scan:
        runs-on: ubuntu-latest
        
        steps:
          - name: Checkout code
            uses: actions/checkout@v4
    
          - name: Run Trivy vulnerability scanner
            uses: aquasecurity/trivy-action@master
            with:
              scan-type: 'fs'
              scan-ref: '.'
              format: 'sarif'
              output: 'trivy-results.sarif'
    
          - name: Upload Trivy scan results
            uses: github/codeql-action/upload-sarif@v2
            with:
              sarif_file: 'trivy-results.sarif'
    
          - name: Run Snyk security scan
            uses: snyk/actions/python@master
            env:
              SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
            with:
              args: --severity-threshold=high
    
          - name: Run Bandit security linter
            run: |
              pip install bandit
              bandit -r control-plane-api/app/ -f json -o bandit-report.json
    
          - name: Upload security reports
            uses: actions/upload-artifact@v3
            with:
              name: security-reports
              path: |
                bandit-report.json
                trivy-results.sarif
    

GitLab CI/CD

GitLab Pipeline

  1. GitLab CI Configuration
    # .gitlab-ci.yml
    stages:
      - test
      - build
      - security
      - deploy
    
    variables:
      DOCKER_DRIVER: overlay2
      DOCKER_TLS_CERTDIR: "/certs"
      REGISTRY: registry.gitlab.com
      IMAGE_NAME: $CI_REGISTRY_IMAGE
    
    services:
      - docker:24.0.5-dind
    
    before_script:
      - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    
    test:api:
      stage: test
      image: python:3.11
      services:
        - postgres:15
        - redis:7
      variables:
        POSTGRES_DB: test_db
        POSTGRES_USER: postgres
        POSTGRES_PASSWORD: postgres
      script:
        - cd control-plane-api
        - pip install -r requirements.txt
        - pip install -r requirements-dev.txt
        - pytest tests/ -v --cov=app --cov-report=xml
      artifacts:
        reports:
          coverage_report:
            coverage_format: cobertura
            path: control-plane-api/coverage.xml
    
    test:frontend:
      stage: test
      image: node:18
      script:
        - cd control-plane-ui
        - npm ci
        - npm test
        - npm run build
      artifacts:
        paths:
          - control-plane-ui/out/
        expire_in: 1 hour
    
    build:api:
      stage: build
      image: docker:24.0.5
      script:
        - docker build -t $IMAGE_NAME-api:$CI_COMMIT_SHA ./control-plane-api
        - docker push $IMAGE_NAME-api:$CI_COMMIT_SHA
        - docker tag $IMAGE_NAME-api:$CI_COMMIT_SHA $IMAGE_NAME-api:latest
        - docker push $IMAGE_NAME-api:latest
      only:
        - main
    
    build:frontend:
      stage: build
      image: docker:24.0.5
      script:
        - docker build -t $IMAGE_NAME-frontend:$CI_COMMIT_SHA ./control-plane-ui
        - docker push $IMAGE_NAME-frontend:$CI_COMMIT_SHA
        - docker tag $IMAGE_NAME-frontend:$CI_COMMIT_SHA $IMAGE_NAME-frontend:latest
        - docker push $IMAGE_NAME-frontend:latest
      only:
        - main
    
    security-scan:
      stage: security
      image: aquasec/trivy:latest
      script:
        - trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE_NAME-api:$CI_COMMIT_SHA
        - trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE_NAME-frontend:$CI_COMMIT_SHA
      only:
        - main
    
    deploy:production:
      stage: deploy
      image: bitnami/kubectl:latest
      script:
        - kubectl config use-context production
        - helm upgrade --install controlcore ./deployment-assets/helm-chart/controlcore \
            --namespace controlcore \
            --create-namespace \
            --values ./deployment-assets/helm-chart/controlcore/values-production.yaml \
            --set image.tag=$CI_COMMIT_SHA
      environment:
        name: production
        url: https://controlcore.company.com
      only:
        - main
      when: manual
    

👁️ Monitoring & Observability

Prometheus & Grafana

Prometheus Configuration

  1. Prometheus Config

    # prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - "rules/*.yml"
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - alertmanager:9093
    
    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
    
      - job_name: 'controlcore-api'
        static_configs:
          - targets: ['control-plane-api:8000']
        metrics_path: '/metrics'
        scrape_interval: 30s
    
  2. Alert Rules

    # rules/controlcore.yml
    groups:
      - name: controlcore
        rules:
          - alert: HighCPUUsage
            expr: cpu_usage_percent > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage detected"
              description: "CPU usage is above 80% for more than 5 minutes"
    
          - alert: DatabaseConnectionFailure
            expr: database_connections_failed > 5
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Database connection failures"
              description: "Multiple database connection failures detected"
    
          - alert: PolicyEvaluationSlow
            expr: policy_evaluation_duration_seconds > 1
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: "Slow policy evaluation"
              description: "Policy evaluation is taking longer than 1 second"
    

Grafana Dashboards

  1. Control Core Dashboard
    {
      "dashboard": {
        "title": "Control Core Overview",
        "panels": [
          {
            "title": "API Response Time",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
                "legendFormat": "95th percentile"
              }
            ]
          },
          {
            "title": "Policy Evaluations",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(policy_evaluations_total[5m])",
                "legendFormat": "Evaluations/sec"
              }
            ]
          },
          {
            "title": "Active Users",
            "type": "stat",
            "targets": [
              {
                "expr": "active_users_total",
                "legendFormat": "Active Users"
              }
            ]
          }
        ]
      }
    }
    

Logging

ELK Stack Configuration

  1. Fluentd Configuration

    # fluent.conf
    <source>
      @type tail
      path /var/log/controlcore/*.log
      pos_file /var/log/fluentd/controlcore.log.pos
      tag controlcore.*
      format json
      time_key timestamp
      time_format %Y-%m-%dT%H:%M:%S.%L%z
    </source>
    
    <filter controlcore.**>
      @type parser
      key_name message
      reserve_data true
      <parse>
        @type json
      </parse>
    </filter>
    
    <match controlcore.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      index_name controlcore
      type_name _doc
      <buffer>
        @type file
        path /var/log/fluentd/buffers/controlcore
        flush_mode interval
        flush_interval 5s
        chunk_limit_size 8MB
        queue_limit_length 32
        retry_max_interval 30
        retry_forever true
      </buffer>
    </match>
    
  2. Log Aggregation

    # Log shipping script
    #!/bin/bash
    
    LOG_DIR="/var/log/controlcore"
    ARCHIVE_DIR="/var/log/controlcore/archive"
    
    # Create archive directory
    mkdir -p $ARCHIVE_DIR
    
    # Archive logs older than 7 days
    find $LOG_DIR -name "*.log" -mtime +7 -exec mv {} $ARCHIVE_DIR/ \;
    
    # Compress archived logs
    find $ARCHIVE_DIR -name "*.log" -mtime +1 -exec gzip {} \;
    
    # Ship logs to ELK
    rsyslog -f /etc/rsyslog.d/controlcore.conf
    

🔒 Security Operations

Security Scanning

Container Security

  1. Trivy Security Scan

    #!/bin/bash
    
    # Scan images for vulnerabilities
    trivy image --severity HIGH,CRITICAL controlcore/control-plane-api:latest
    trivy image --severity HIGH,CRITICAL controlcore/control-plane-ui:latest
    
    # Generate security report
    trivy image --format json --output security-report.json controlcore/control-plane-api:latest
    
    # Scan filesystem
    trivy fs --severity HIGH,CRITICAL /app
    
  2. Clair Security Scan

    # clair-config.yaml
    api:
      addr: ":6060"
      timeout: "30s"
      rate_limit: "10/s"
    
    updater:
      interval: "2h"
      concurrency: 5
    
    matcher:
      type: "vulnerability"
    
    notifier:
      webhook:
        url: "https://controlcore.company.com/security/webhook"
        timeout: "10s"
    

Network Security

  1. Network Policies
    # network-policy.yaml
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: controlcore-network-policy
      namespace: controlcore
    spec:
      podSelector:
        matchLabels:
          app: control-plane-api
      policyTypes:
        - Ingress
        - Egress
      ingress:
        - from:
            - namespaceSelector:
                matchLabels:
                  name: controlcore
            - podSelector:
                matchLabels:
                  app: control-plane-ui
          ports:
            - protocol: TCP
              port: 8000
      egress:
        - to:
            - podSelector:
                matchLabels:
                  app: postgresql
          ports:
            - protocol: TCP
              port: 5432
        - to:
            - podSelector:
                matchLabels:
                  app: redis
          ports:
            - protocol: TCP
              port: 6379
    

Compliance

SOC 2 Compliance

  1. Security Controls

    #!/bin/bash
    
    # Check encryption at rest
    kubectl get secrets -o jsonpath='{.items[*].data}' | base64 -d
    
    # Verify RBAC configuration
    kubectl auth can-i --list --as=system:serviceaccount:controlcore:control-plane-api
    
    # Check network policies
    kubectl get networkpolicies -n controlcore
    
    # Verify resource limits
    kubectl describe deployment control-plane-api -n controlcore | grep -A 10 "Limits"
    
  2. Audit Logging

    # audit-policy.yaml
    apiVersion: audit.k8s.io/v1
    kind: Policy
    rules:
      - level: Metadata
        namespaces: ["controlcore"]
        verbs: ["create", "update", "patch", "delete"]
        resources:
          - group: ""
            resources: ["secrets", "configmaps"]
          - group: "apps"
            resources: ["deployments", "services"]
    

⚡ Performance Optimization

System Tuning

Database Optimization

  1. PostgreSQL Tuning

    -- postgresql.conf optimizations
    shared_buffers = 256MB
    effective_cache_size = 1GB
    maintenance_work_mem = 64MB
    checkpoint_completion_target = 0.9
    wal_buffers = 16MB
    default_statistics_target = 100
    random_page_cost = 1.1
    effective_io_concurrency = 200
    work_mem = 4MB
    min_wal_size = 1GB
    max_wal_size = 4GB
    
  2. Redis Optimization

    # redis.conf optimizations
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    tcp-keepalive 60
    timeout 300
    tcp-backlog 511
    

Application Optimization

  1. API Performance Tuning

    # FastAPI optimizations
    app = FastAPI(
        title="Control Core API",
        docs_url="/docs",
        redoc_url="/redoc",
        # Performance optimizations
        generate_unique_id_function=custom_generate_unique_id,
    )
    
    # Connection pooling
    from sqlalchemy.pool import QueuePool
    
    engine = create_engine(
        DATABASE_URL,
        poolclass=QueuePool,
        pool_size=20,
        max_overflow=30,
        pool_pre_ping=True,
        pool_recycle=3600
    )
    
  2. Caching Strategy

    # Redis caching
    from redis import Redis
    from functools import wraps
    
    redis_client = Redis(host='redis', port=6379, db=0)
    
    def cache_result(expiration=3600):
        def decorator(func):
            @wraps(func)
            def wrapper(*args, **kwargs):
                cache_key = f"{func.__name__}:{hash(str(args) + str(kwargs))}"
                cached_result = redis_client.get(cache_key)
                
                if cached_result:
                    return json.loads(cached_result)
                
                result = func(*args, **kwargs)
                redis_client.setex(cache_key, expiration, json.dumps(result))
                return result
            return wrapper
        return decorator
    

📌 Disaster Recovery

Backup Strategy

Database Backup

  1. Automated Backup Script

    #!/bin/bash
    
    # backup-database.sh
    BACKUP_DIR="/backups/postgresql"
    DATE=$(date +%Y%m%d_%H%M%S)
    DB_NAME="control_core_db"
    
    # Create backup directory
    mkdir -p $BACKUP_DIR
    
    # Full backup
    pg_dump -h localhost -U postgres -d $DB_NAME | gzip > $BACKUP_DIR/full_backup_$DATE.sql.gz
    
    # Incremental backup (WAL files)
    pg_basebackup -h localhost -U postgres -D $BACKUP_DIR/incremental_$DATE -Ft -z -P
    
    # Cleanup old backups (keep 30 days)
    find $BACKUP_DIR -name "*.sql.gz" -mtime +30 -delete
    find $BACKUP_DIR -name "incremental_*" -mtime +30 -exec rm -rf {} \;
    
  2. Backup Verification

    #!/bin/bash
    
    # verify-backup.sh
    BACKUP_FILE=$1
    
    # Test backup integrity
    gunzip -t $BACKUP_FILE
    
    # Restore to test database
    createdb -h localhost -U postgres test_restore_db
    gunzip -c $BACKUP_FILE | psql -h localhost -U postgres -d test_restore_db
    
    # Verify data integrity
    psql -h localhost -U postgres -d test_restore_db -c "SELECT COUNT(*) FROM users;"
    
    # Cleanup test database
    dropdb -h localhost -U postgres test_restore_db
    

Recovery Procedures

Database Recovery

  1. Point-in-Time Recovery
    #!/bin/bash
    
    # pitr-recovery.sh
    RECOVERY_TIME=$1  # Format: YYYY-MM-DD HH:MM:SS
    
    # Stop application
    kubectl scale deployment control-plane-api --replicas=0 -n controlcore
    
    # Restore from backup
    pg_restore -h localhost -U postgres -d control_core_db --clean --if-exists /backups/latest_backup.sql
    
    # Apply WAL files up to recovery time
    echo "restore_command = 'cp /backups/wal/%f %p'" > /var/lib/postgresql/data/recovery.conf
    echo "recovery_target_time = '$RECOVERY_TIME'" >> /var/lib/postgresql/data/recovery.conf
    
    # Restart PostgreSQL
    systemctl restart postgresql
    
    # Verify recovery
    psql -h localhost -U postgres -d control_core_db -c "SELECT NOW();"
    
    # Restart application
    kubectl scale deployment control-plane-api --replicas=3 -n controlcore
    

Application Recovery

  1. Blue-Green Deployment
    #!/bin/bash
    
    # blue-green-deployment.sh
    NEW_VERSION=$1
    
    # Deploy to green environment
    helm upgrade controlcore-green ./helm-chart \
      --namespace controlcore-green \
      --create-namespace \
      --set image.tag=$NEW_VERSION
    
    # Wait for green deployment to be ready
    kubectl wait --for=condition=ready pod -l app=control-plane-api -n controlcore-green --timeout=300s
    
    # Run smoke tests against green
    ./smoke-tests.sh https://green.controlcore.company.com
    
    # Switch traffic to green
    kubectl patch service control-plane-api -n controlcore -p '{"spec":{"selector":{"version":"green"}}}'
    
    # Scale down blue environment
    kubectl scale deployment control-plane-api-blue --replicas=0 -n controlcore
    

🛠️ Troubleshooting

IssueWhat to check
CI/CD pipeline fails on policy or API stepVerify API base URL, authentication (e.g. token or API key), and network access from the runner; check pipeline logs.
Deployment or rollout issuesConfirm health checks and readiness probes; ensure database (e.g. Postgres) and storage are available; review Disaster Recovery.
Monitoring or alerting gapsValidate metrics endpoints and alert rules; ensure Control Plane and bouncer health are monitored.
Secrets or config not available at runtimeCheck secret manager or vault integration and injection (e.g. Kubernetes secrets, env vars).

For more, see the Troubleshooting Guide.

📞 Support and Documentation

Getting Help

Contact Information