📘 DevOps Guide

This guide provides comprehensive instructions for DevOps engineers managing Control Core deployments, CI/CD pipelines, and infrastructure automation.

📌 Overview

Control Core DevOps responsibilities include:

Infrastructure Management: Kubernetes, Docker, and cloud infrastructure
CI/CD Pipelines: Automated testing, building, and deployment
Monitoring & Observability: System monitoring, logging, and alerting
Security Operations: Security scanning, compliance, and incident response
Performance Optimization: System tuning and scalability
Disaster Recovery: Backup, restore, and business continuity

CI/CD and automation guidance

Use this guide for customer-safe deployment automation patterns, release controls, and operational runbooks.

📌 Infrastructure Management

Kubernetes Deployment

Production Kubernetes Setup

Cluster Requirements

# Minimum cluster specifications
nodes:
  - type: "master"
    count: 3
    specs:
      cpu: "4 cores"
      memory: "16GB"
      storage: "100GB SSD"
  - type: "worker"
    count: 3
    specs:
      cpu: "8 cores"
      memory: "32GB"
      storage: "200GB SSD"

Deploy with Helm

# Add Control Core Helm repository
helm repo add controlcore https://charts.controlcore.io
helm repo update

# Install Control Core
helm install controlcore controlcore/controlcore \
  --namespace controlcore \
  --create-namespace \
  --values values-production.yaml \
  --set global.domain=controlcore.company.com \
  --set global.tls.enabled=true

Helm Chart Configuration

Production Values

# values-production.yaml
global:
  domain: "controlcore.company.com"
  tls:
    enabled: true
    certManager:
      enabled: true

# Database configuration
postgresql:
  enabled: true
  auth:
    postgresPassword: "secure_password"
    database: "control_core_db"
  primary:
    persistence:
      size: 100Gi
      storageClass: "fast-ssd"

# Redis configuration
redis:
  enabled: true
  auth:
    enabled: true
    password: "redis_secure_password"
  master:
    persistence:
      size: 50Gi
      storageClass: "fast-ssd"

# Application configuration
control-plane-api:
  replicas: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

# Monitoring
monitoring:
  enabled: true
  prometheus:
    enabled: true
  grafana:
    enabled: true

Docker Configuration

Multi-stage Dockerfile

API Service Dockerfile

# Dockerfile for control-plane-api
FROM python:3.11-slim as builder

# Install build dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Production stage
FROM python:3.11-slim

# Install runtime dependencies
RUN apt-get update && apt-get install -y \
    libpq5 \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user
RUN useradd --create-home --shell /bin/bash controlcore

# Copy application
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY . /app

# Set permissions
RUN chown -R controlcore:controlcore /app
USER controlcore

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

# Start application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Frontend Dockerfile

# Dockerfile for control-plane-ui frontend
FROM node:18-alpine as builder

# Set working directory
WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy source code
COPY . .

# Build application
RUN npm run build

# Production stage
FROM nginx:alpine

# Copy built application
COPY --from=builder /app/out /usr/share/nginx/html

# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf

# Expose port
EXPOSE 80

# Start nginx
CMD ["nginx", "-g", "daemon off;"]

Infrastructure as Code

Terraform Configuration

AWS Infrastructure

# main.tf
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# EKS Cluster
resource "aws_eks_cluster" "controlcore" {
  name     = "controlcore-${var.environment}"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.28"

  vpc_config {
    subnet_ids              = aws_subnet.private[*].id
    endpoint_private_access = true
    endpoint_public_access  = true
    public_access_cidrs     = ["0.0.0.0/0"]
  }

  encryption_config {
    provider {
      key_arn = aws_kms_key.eks.arn
    }
    resources = ["secrets"]
  }
}

# RDS PostgreSQL
resource "aws_db_instance" "postgres" {
  identifier = "controlcore-postgres-${var.environment}"

  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r5.xlarge"

  allocated_storage     = 100
  max_allocated_storage = 1000
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = "control_core_db"
  username = "postgres"
  password = var.db_password

  vpc_security_group_ids = [aws_security_group.rds.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "sun:04:00-sun:05:00"

  skip_final_snapshot = false
  final_snapshot_identifier = "controlcore-postgres-final-snapshot-${var.environment}"
}

# ElastiCache Redis
resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "controlcore-redis-${var.environment}"
  description                = "Control Core Redis cluster"

  node_type                  = "cache.r6g.large"
  port                       = 6379
  parameter_group_name       = "default.redis7"

  num_cache_clusters         = 2

  subnet_group_name          = aws_elasticache_subnet_group.main.name
  security_group_ids         = [aws_security_group.redis.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = var.redis_auth_token
}

Variables

# variables.tf
variable "environment" {
  description = "Environment name"
  type        = string
  default     = "production"
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "db_password" {
  description = "Database password"
  type        = string
  sensitive   = true
}

variable "redis_auth_token" {
  description = "Redis auth token"
  type        = string
  sensitive   = true
}

📌 CI/CD Pipelines

GitHub Actions

Main CI/CD Pipeline

Workflow Configuration

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: controlcore/control-core

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      
      redis:
        image: redis:7
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
          cache: 'npm'
          cache-dependency-path: control-plane-ui/package-lock.json

      - name: Install Python dependencies
        run: |
          cd control-plane-api
          pip install -r requirements.txt
          pip install -r requirements-dev.txt

      - name: Install Node.js dependencies
        run: |
          cd control-plane-ui
          npm ci

      - name: Run Python tests
        run: |
          cd control-plane-api
          pytest tests/ -v --cov=app --cov-report=xml
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379

      - name: Run Node.js tests
        run: |
          cd control-plane-ui
          npm test

      - name: Run security scan
        run: |
          cd control-plane-api
          bandit -r app/
          safety check

      - name: Upload coverage reports
        uses: codecov/codecov-action@v3
        with:
          files: ./control-plane-api/coverage.xml

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=sha,prefix={{branch}}-
            type=raw,value=latest,enable={{is_default_branch}}

      - name: Build and push API image
        uses: docker/build-push-action@v5
        with:
          context: ./control-plane-api
          push: true
          tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}-api:${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Build and push Frontend image
        uses: docker/build-push-action@v5
        with:
          context: ./control-plane-ui
          push: true
          tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}-frontend:${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Update kubeconfig
        run: aws eks update-kubeconfig --region us-east-1 --name controlcore-production

      - name: Deploy to Kubernetes
        run: |
          helm upgrade --install controlcore ./deployment-assets/helm-chart/controlcore \
            --namespace controlcore \
            --create-namespace \
            --values ./deployment-assets/helm-chart/controlcore/values-production.yaml \
            --set image.tag=${{ github.sha }} \
            --set global.domain=controlcore.company.com \
            --wait

      - name: Run smoke tests
        run: |
          kubectl wait --for=condition=ready pod -l app=control-plane-api -n controlcore --timeout=300s
          kubectl wait --for=condition=ready pod -l app=control-plane-ui -n controlcore --timeout=300s

Security Scanning Pipeline

Security Workflow

# .github/workflows/security.yml
name: Security Scan

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM
  push:
    branches: [main]

jobs:
  security-scan:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy scan results
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

      - name: Run Snyk security scan
        uses: snyk/actions/python@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high

      - name: Run Bandit security linter
        run: |
          pip install bandit
          bandit -r control-plane-api/app/ -f json -o bandit-report.json

      - name: Upload security reports
        uses: actions/upload-artifact@v3
        with:
          name: security-reports
          path: |
            bandit-report.json
            trivy-results.sarif

GitLab CI/CD

GitLab Pipeline

GitLab CI Configuration

# .gitlab-ci.yml
stages:
  - test
  - build
  - security
  - deploy

variables:
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"
  REGISTRY: registry.gitlab.com
  IMAGE_NAME: $CI_REGISTRY_IMAGE

services:
  - docker:24.0.5-dind

before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

test:api:
  stage: test
  image: python:3.11
  services:
    - postgres:15
    - redis:7
  variables:
    POSTGRES_DB: test_db
    POSTGRES_USER: postgres
    POSTGRES_PASSWORD: postgres
  script:
    - cd control-plane-api
    - pip install -r requirements.txt
    - pip install -r requirements-dev.txt
    - pytest tests/ -v --cov=app --cov-report=xml
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: control-plane-api/coverage.xml

test:frontend:
  stage: test
  image: node:18
  script:
    - cd control-plane-ui
    - npm ci
    - npm test
    - npm run build
  artifacts:
    paths:
      - control-plane-ui/out/
    expire_in: 1 hour

build:api:
  stage: build
  image: docker:24.0.5
  script:
    - docker build -t $IMAGE_NAME-api:$CI_COMMIT_SHA ./control-plane-api
    - docker push $IMAGE_NAME-api:$CI_COMMIT_SHA
    - docker tag $IMAGE_NAME-api:$CI_COMMIT_SHA $IMAGE_NAME-api:latest
    - docker push $IMAGE_NAME-api:latest
  only:
    - main

build:frontend:
  stage: build
  image: docker:24.0.5
  script:
    - docker build -t $IMAGE_NAME-frontend:$CI_COMMIT_SHA ./control-plane-ui
    - docker push $IMAGE_NAME-frontend:$CI_COMMIT_SHA
    - docker tag $IMAGE_NAME-frontend:$CI_COMMIT_SHA $IMAGE_NAME-frontend:latest
    - docker push $IMAGE_NAME-frontend:latest
  only:
    - main

security-scan:
  stage: security
  image: aquasec/trivy:latest
  script:
    - trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE_NAME-api:$CI_COMMIT_SHA
    - trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE_NAME-frontend:$CI_COMMIT_SHA
  only:
    - main

deploy:production:
  stage: deploy
  image: bitnami/kubectl:latest
  script:
    - kubectl config use-context production
    - helm upgrade --install controlcore ./deployment-assets/helm-chart/controlcore \
        --namespace controlcore \
        --create-namespace \
        --values ./deployment-assets/helm-chart/controlcore/values-production.yaml \
        --set image.tag=$CI_COMMIT_SHA
  environment:
    name: production
    url: https://controlcore.company.com
  only:
    - main
  when: manual

👁️ Monitoring & Observability

Prometheus & Grafana

Prometheus Configuration

Prometheus Config

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  - job_name: 'controlcore-api'
    static_configs:
      - targets: ['control-plane-api:8000']
    metrics_path: '/metrics'
    scrape_interval: 30s

Alert Rules

# rules/controlcore.yml
groups:
  - name: controlcore
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes"

      - alert: DatabaseConnectionFailure
        expr: database_connections_failed > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection failures"
          description: "Multiple database connection failures detected"

      - alert: PolicyEvaluationSlow
        expr: policy_evaluation_duration_seconds > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Slow policy evaluation"
          description: "Policy evaluation is taking longer than 1 second"

Grafana Dashboards

Control Core Dashboard

{
  "dashboard": {
    "title": "Control Core Overview",
    "panels": [
      {
        "title": "API Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          }
        ]
      },
      {
        "title": "Policy Evaluations",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(policy_evaluations_total[5m])",
            "legendFormat": "Evaluations/sec"
          }
        ]
      },
      {
        "title": "Active Users",
        "type": "stat",
        "targets": [
          {
            "expr": "active_users_total",
            "legendFormat": "Active Users"
          }
        ]
      }
    ]
  }
}

Logging

ELK Stack Configuration

Fluentd Configuration

# fluent.conf
<source>
  @type tail
  path /var/log/controlcore/*.log
  pos_file /var/log/fluentd/controlcore.log.pos
  tag controlcore.*
  format json
  time_key timestamp
  time_format %Y-%m-%dT%H:%M:%S.%L%z
</source>

<filter controlcore.**>
  @type parser
  key_name message
  reserve_data true
  <parse>
    @type json
  </parse>
</filter>

<match controlcore.**>
  @type elasticsearch
  host elasticsearch.logging.svc.cluster.local
  port 9200
  index_name controlcore
  type_name _doc
  <buffer>
    @type file
    path /var/log/fluentd/buffers/controlcore
    flush_mode interval
    flush_interval 5s
    chunk_limit_size 8MB
    queue_limit_length 32
    retry_max_interval 30
    retry_forever true
  </buffer>
</match>

Log Aggregation

# Log shipping script
#!/bin/bash

LOG_DIR="/var/log/controlcore"
ARCHIVE_DIR="/var/log/controlcore/archive"

# Create archive directory
mkdir -p $ARCHIVE_DIR

# Archive logs older than 7 days
find $LOG_DIR -name "*.log" -mtime +7 -exec mv {} $ARCHIVE_DIR/ \;

# Compress archived logs
find $ARCHIVE_DIR -name "*.log" -mtime +1 -exec gzip {} \;

# Ship logs to ELK
rsyslog -f /etc/rsyslog.d/controlcore.conf

🔒 Security Operations

Security Scanning

Container Security

Trivy Security Scan

#!/bin/bash

# Scan images for vulnerabilities
trivy image --severity HIGH,CRITICAL controlcore/control-plane-api:latest
trivy image --severity HIGH,CRITICAL controlcore/control-plane-ui:latest

# Generate security report
trivy image --format json --output security-report.json controlcore/control-plane-api:latest

# Scan filesystem
trivy fs --severity HIGH,CRITICAL /app

Clair Security Scan

# clair-config.yaml
api:
  addr: ":6060"
  timeout: "30s"
  rate_limit: "10/s"

updater:
  interval: "2h"
  concurrency: 5

matcher:
  type: "vulnerability"

notifier:
  webhook:
    url: "https://controlcore.company.com/security/webhook"
    timeout: "10s"

Network Security

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: controlcore-network-policy
  namespace: controlcore
spec:
  podSelector:
    matchLabels:
      app: control-plane-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: controlcore
        - podSelector:
            matchLabels:
              app: control-plane-ui
      ports:
        - protocol: TCP
          port: 8000
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgresql
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - protocol: TCP
          port: 6379

Compliance

SOC 2 Compliance

Security Controls

#!/bin/bash

# Check encryption at rest
kubectl get secrets -o jsonpath='{.items[*].data}' | base64 -d

# Verify RBAC configuration
kubectl auth can-i --list --as=system:serviceaccount:controlcore:control-plane-api

# Check network policies
kubectl get networkpolicies -n controlcore

# Verify resource limits
kubectl describe deployment control-plane-api -n controlcore | grep -A 10 "Limits"

Audit Logging

# audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    namespaces: ["controlcore"]
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
      - group: "apps"
        resources: ["deployments", "services"]

⚡ Performance Optimization

System Tuning

Database Optimization

PostgreSQL Tuning

-- postgresql.conf optimizations
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 4MB
min_wal_size = 1GB
max_wal_size = 4GB

Redis Optimization

# redis.conf optimizations
maxmemory 2gb
maxmemory-policy allkeys-lru
tcp-keepalive 60
timeout 300
tcp-backlog 511

Application Optimization

API Performance Tuning

# FastAPI optimizations
app = FastAPI(
    title="Control Core API",
    docs_url="/docs",
    redoc_url="/redoc",
    # Performance optimizations
    generate_unique_id_function=custom_generate_unique_id,
)

# Connection pooling
from sqlalchemy.pool import QueuePool

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,
    max_overflow=30,
    pool_pre_ping=True,
    pool_recycle=3600
)

Caching Strategy

# Redis caching
from redis import Redis
from functools import wraps

redis_client = Redis(host='redis', port=6379, db=0)

def cache_result(expiration=3600):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{hash(str(args) + str(kwargs))}"
            cached_result = redis_client.get(cache_key)
            
            if cached_result:
                return json.loads(cached_result)
            
            result = func(*args, **kwargs)
            redis_client.setex(cache_key, expiration, json.dumps(result))
            return result
        return wrapper
    return decorator

📌 Disaster Recovery

Backup Strategy

Database Backup

Automated Backup Script

#!/bin/bash

# backup-database.sh
BACKUP_DIR="/backups/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="control_core_db"

# Create backup directory
mkdir -p $BACKUP_DIR

# Full backup
pg_dump -h localhost -U postgres -d $DB_NAME | gzip > $BACKUP_DIR/full_backup_$DATE.sql.gz

# Incremental backup (WAL files)
pg_basebackup -h localhost -U postgres -D $BACKUP_DIR/incremental_$DATE -Ft -z -P

# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.sql.gz" -mtime +30 -delete
find $BACKUP_DIR -name "incremental_*" -mtime +30 -exec rm -rf {} \;

Backup Verification

#!/bin/bash

# verify-backup.sh
BACKUP_FILE=$1

# Test backup integrity
gunzip -t $BACKUP_FILE

# Restore to test database
createdb -h localhost -U postgres test_restore_db
gunzip -c $BACKUP_FILE | psql -h localhost -U postgres -d test_restore_db

# Verify data integrity
psql -h localhost -U postgres -d test_restore_db -c "SELECT COUNT(*) FROM users;"

# Cleanup test database
dropdb -h localhost -U postgres test_restore_db

Recovery Procedures

Database Recovery

Point-in-Time Recovery

#!/bin/bash

# pitr-recovery.sh
RECOVERY_TIME=$1  # Format: YYYY-MM-DD HH:MM:SS

# Stop application
kubectl scale deployment control-plane-api --replicas=0 -n controlcore

# Restore from backup
pg_restore -h localhost -U postgres -d control_core_db --clean --if-exists /backups/latest_backup.sql

# Apply WAL files up to recovery time
echo "restore_command = 'cp /backups/wal/%f %p'" > /var/lib/postgresql/data/recovery.conf
echo "recovery_target_time = '$RECOVERY_TIME'" >> /var/lib/postgresql/data/recovery.conf

# Restart PostgreSQL
systemctl restart postgresql

# Verify recovery
psql -h localhost -U postgres -d control_core_db -c "SELECT NOW();"

# Restart application
kubectl scale deployment control-plane-api --replicas=3 -n controlcore

Application Recovery

Blue-Green Deployment

#!/bin/bash

# blue-green-deployment.sh
NEW_VERSION=$1

# Deploy to green environment
helm upgrade controlcore-green ./helm-chart \
  --namespace controlcore-green \
  --create-namespace \
  --set image.tag=$NEW_VERSION

# Wait for green deployment to be ready
kubectl wait --for=condition=ready pod -l app=control-plane-api -n controlcore-green --timeout=300s

# Run smoke tests against green
./smoke-tests.sh https://green.controlcore.company.com

# Switch traffic to green
kubectl patch service control-plane-api -n controlcore -p '{"spec":{"selector":{"version":"green"}}}'

# Scale down blue environment
kubectl scale deployment control-plane-api-blue --replicas=0 -n controlcore

🛠️ Troubleshooting

Issue	What to check
CI/CD pipeline fails on policy or API step	Verify API base URL, authentication (e.g. token or API key), and network access from the runner; check pipeline logs.
Deployment or rollout issues	Confirm health checks and readiness probes; ensure database (e.g. Postgres) and storage are available; review Disaster Recovery.
Monitoring or alerting gaps	Validate metrics endpoints and alert rules; ensure Control Plane and bouncer health are monitored.
Secrets or config not available at runtime	Check secret manager or vault integration and injection (e.g. Kubernetes secrets, env vars).

For more, see the Troubleshooting Guide.

📞 Support and Documentation

Getting Help

Documentation: Documentation Home
General Inquiries: info@controlcore.io
Support Email: support@controlcore.io
Emergency Support: emergency@controlcore.io

Contact Information

General Inquiries: info@controlcore.io
Technical Support: support@controlcore.io
DevOps Support: devops@controlcore.io