π οΈ Comprehensive Troubleshooting Guide
This guide helps you diagnose and resolve common issues with Control Core deployments across all components and deployment models.
Developer Portal quick checks (/devdocs)
If developers cannot onboard via API/SDK, verify the Control Plane Developer Portal first:
# Replace with your Control Plane host
curl -I https://<control-plane-host>/devdocs
curl -s https://<control-plane-host>/openapi.json | jq '.info.title, .info.version, .info["x-control-core-contract"]'
curl -s https://<control-plane-host>/health/ready | jq .
Expected outcomes:
/devdocsreturns200and shows Control Core - Developer/openapi.jsoncontainsx-control-core-contractmetadata (api_version,schema_version,deprecation_policy)/health/readyreturns statusreadybefore handing portal access to developers
If /devdocs returns your frontend app instead of Swagger, fix reverse proxy routing so these paths forward to control-plane-api: /devdocs, /openapi.json, /health/*, /developer-portal/*.
π Quick Diagnostic Tools
Health Check Commands
Run these commands first for quick diagnostics:
# Check all services (Docker Compose)
docker-compose ps
docker-compose logs --tail=50
# Check all pods (Kubernetes)
kubectl get pods -n control-core
kubectl get svc -n control-core
# Health endpoints
curl http://localhost:3000/api/health # Console
curl http://localhost:8082/api/v1/health # API
curl http://localhost:8080/health # Bouncer
curl http://localhost:7000/health # Policy Bridge
# Quick status check script
for service in console:3000 api:8082 bouncer:8080 policy-bridge:7000; do
name=$(echo $service | cut -d: -f1)
port=$(echo $service | cut -d: -f2)
if curl -sf http://localhost:$port/health > /dev/null 2>&1; then
echo "β
$name is healthy"
else
echo "β $name is down"
fi
done
π οΈ Service-Specific Issues
Policy Administration Console Issues
Symptom: Console won't load / White screen
Solutions:
# Check if service is running
docker-compose ps | grep console
kubectl get pods -n control-core -l app=controlcore-console
# Check logs for errors
docker-compose logs console | grep -i error
kubectl logs -n control-core -l app=controlcore-console --tail=100
# Common causes:
# 1. API connection failure - check NEXT_PUBLIC_API_URL
# 2. Build errors - check for JavaScript errors in logs
# 3. Port conflict - ensure port 3000 is available
Symptom: Cannot log in
Solutions:
# Verify API is accessible
curl http://localhost:8082/api/v1/health
# Check authentication configuration
docker-compose logs console | grep -i auth
docker-compose logs api | grep -i auth
# Reset admin password (if needed)
docker-compose exec api python reset_admin_password.py
# For Kubernetes
kubectl exec -it -n control-core deployment/controlcore-api -- python reset_admin_password.py
Symptom: Policies not saving
Solutions:
# Check database connection
docker-compose logs api | grep -i database
# Check database status
docker-compose exec db psql -U controlcore -d control_core_db -c "SELECT 1"
# Check disk space
df -h
docker system df
# Check for validation errors in browser console (F12)
Policy Administration API Issues
Symptom: API returns 500 errors
Solutions:
# Check detailed error logs
docker-compose logs api --tail=100 | grep -A 10 "ERROR"
# Check database connection pool
docker-compose logs api | grep -i "connection pool"
# Check for database migrations
docker-compose exec api alembic current
docker-compose exec api alembic upgrade head
# Restart API service
docker-compose restart api
Symptom: Slow API responses
Solutions:
# Check database query performance
docker-compose exec db psql -U controlcore -d control_core_db
# Run in psql:
SELECT
query,
calls,
mean_exec_time,
max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
# Check for missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE schemaname = 'public'
ORDER BY n_distinct DESC;
# Vacuum and analyze
VACUUM ANALYZE;
# Check Redis cache hit rate
docker-compose exec redis redis-cli INFO stats | grep hit_rate
# Increase API workers (in .env)
API_WORKERS=8 # Increase from 4
Symptom: Authentication failures
Solutions:
# Check JWT configuration
echo $JWT_SECRET_KEY
echo $JWT_ALGORITHM
# Verify Auth0 configuration (if using Auth0)
echo $AUTH0_DOMAIN
echo $AUTH0_CLIENT_ID
# Check token expiration settings
grep JWT_ACCESS_TOKEN_EXPIRE_MINUTES .env
# View recent authentication errors
docker-compose logs api | grep -i "authentication failed"
Policy Enforcement Point (Bouncer) Issues
Symptom: Bouncer not forwarding requests
Solutions:
# Check target host configuration
echo $TARGET_HOST
# Test target connectivity from bouncer
docker-compose exec bouncer ping <target-host>
docker-compose exec bouncer curl http://<target-host>:<port>/health
# Check bouncer logs
docker-compose logs bouncer | grep -i "target"
# Verify proxy configuration
curl -v http://localhost:8080/health
Symptom: Policy not being enforced
Solutions:
# Check if policy is loaded
curl http://localhost:8080/api/v1/policies
# Check policy synchronization
curl http://localhost:8080/api/v1/policy-bridge/status
# Force policy sync
curl -X POST http://localhost:8080/api/v1/policy-bridge/sync
# Check OPA data
curl http://localhost:8080/api/v1/opa/data | jq .
# Clear policy cache
curl -X POST http://localhost:8080/api/v1/policies/cache/clear
Symptom: High latency / Slow policy evaluation
Solutions:
# Check cache statistics
curl http://localhost:8080/api/v1/policies/cache
# Check metrics
curl http://localhost:8080/api/v1/metrics
# Increase cache TTL (in .env)
POLICY_CACHE_TTL=10m # Increase from 5m
DECISION_CACHE_TTL=5m # Increase from 1m
# Increase cache size
CACHE_MAX_SIZE=50000 # Increase from 10000
# Add more worker threads
WORKER_THREADS=16 # Increase from 8
# Check for complex policies
# Review policy evaluation times in logs
docker-compose logs bouncer | grep "evaluation_time"
Symptom: Memory leak in Bouncer
Solutions:
# Monitor memory over time
watch -n 5 'docker stats bouncer --no-stream'
# Check cache size
curl http://localhost:8080/api/v1/policies/cache | jq '.cache_size'
# Enable cache eviction
# In .env:
CACHE_EVICTION_POLICY=lru
CACHE_MAX_SIZE=50000
# Restart bouncer
docker-compose restart bouncer
Policy Bridge Synchronization Issues
Symptom: Policies not syncing to Bouncer
Solutions:
# Check Policy Bridge status
curl http://localhost:7000/health
# Check Policy Bridge client connection from bouncer
docker-compose logs bouncer | grep -i "policy bridge"
# Check Policy Bridge broadcast configuration
docker-compose logs policy-bridge | grep -i broadcast
# Check policy data sources
curl http://localhost:7000/data/config
# Manually trigger sync
curl -X POST http://localhost:7000/policy/refresh
# Check WebSocket connection
docker-compose logs bouncer | grep -i websocket
Symptom: /settings/peps shows Policies Loaded: 0
Solutions:
# 1. Verify policies are enabled for current environment
curl -H "Authorization: Bearer <token>" \
http://localhost:8082/api/policies?status=enabled&environment=sandbox
# 2. Verify policy->resource and policy->bouncer bindings
# Policies are linked to a resource (and bouncer) on Policy Builder page 1; Policy Bridge syncs each policy only to bouncers for that resource. Ensure the policy's resource_id matches a resource linked to the bouncer you expect.
docker-compose exec api python - <<'PY'
from app.database import SessionLocal
from app.models import Policy
db = SessionLocal()
rows = db.query(Policy).filter(Policy.status=="enabled").all()
print([(p.id, p.name, p.resource_id, p.bouncer_id, p.environment) for p in rows[:20]])
db.close()
PY
# 3. Check bouncer sync status endpoint
curl -H "Authorization: Bearer <token>" \
http://localhost:8082/api/peps/<pep_id>/sync-status
If policies are enabled but count is still zero, re-open and save the policy in the builder to trigger self-healing bindings and sync metadata refresh.
Symptom: Last Sync shows negative time
Solutions:
# 1. Ensure host and container clocks are synchronized (use your API and Bouncer service names)
date -u
docker-compose exec api date -u
docker-compose exec control-core-bouncer-sb-<resourcename> date -u
# 2. Check timestamp format returned by API
curl -H "Authorization: Bearer <token>" \
http://localhost:8082/api/peps/<pep_id>/sync-status | jq '.last_sync_time'
Use UTC timestamps consistently and avoid local-time assumptions in custom integrations.
Symptom: Resource Enrich action returns 404
Solutions:
# Correct API routes
GET /api/resources/{id}
PUT /api/resources/{id}/enrich
If your reverse proxy rewrites paths, confirm it does not prepend a second /api segment (for example /api/api/...).
Symptom: Policy Bridge disconnecting frequently
Solutions:
# Check network stability
ping -c 100 localhost
# Increase reconnection settings (policy bridge client)
POLICY_SYNC_CLIENT_RECONNECT_INTERVAL=5s
POLICY_SYNC_MAX_RECONNECT_ATTEMPTS=10
# Check for network policies blocking connections (K8s)
kubectl get networkpolicies -n control-core
# Review Policy Bridge logs for connection errors
docker-compose logs policy-bridge | grep -i "connection"
π€ Health Check Failures
Control Plane API Not Responding
Symptoms:
- Health check shows "Cannot connect to API endpoint"
- Control Plane dashboard is not accessible
Solutions:
-
Check if the service is running:
# For Docker Compose docker-compose ps # For Kubernetes kubectl get pods -n control-core -
Check service logs:
# For Docker Compose docker-compose logs control-plane # For Kubernetes kubectl logs -n control-core deployment/control-plane -
Verify port configuration:
- Ensure port 8000 is not blocked by firewall
- Check if another service is using port 8000
- Verify the service is binding to 0.0.0.0:8000
-
Check environment variables:
# Verify required environment variables are set echo $DATABASE_URL echo $REDIS_URL echo $SECRET_KEY
Database Connection Issues
Symptoms:
- Health check shows "Cannot connect to database"
- API returns 500 errors for database operations
Solutions:
-
Verify database is running:
# For Docker Compose docker-compose ps postgres # For Kubernetes kubectl get pods -n control-core -l app=postgres -
Check database connectivity:
# Test connection docker exec -it control-core-postgres-1 psql -U controlcore -d controlcore -
Verify database URL:
- Format:
postgresql://username:password@host:port/database - Ensure credentials match database configuration
- Check if database exists and user has permissions
- Format:
-
Check database logs:
docker-compose logs postgres
Policy Bridge Connection Problems
Symptoms:
- Health check shows "Cannot connect to policy synchronization service"
- Policies are not synchronizing
Solutions:
-
Verify the Policy Bridge service:
# Check policy bridge pod status kubectl get pods -n control-core -l app=policy-bridge -
Check policy bridge configuration:
- Verify POLICY_SYNC_URL is correct
- Ensure the service can reach the Control Plane
- Check policy repository access
-
Test policy bridge connectivity:
curl http://localhost:7000/health
Bouncer Connection Issues
Symptoms:
- Health check shows "Cannot connect to bouncer monitoring"
- No bouncers appear in the dashboard
Solutions:
-
Verify bouncer deployment:
# Check bouncer pods kubectl get pods -l app=control-core-bouncer -
Check bouncer configuration:
- Verify CONTROL_PLANE_URL points to correct Control Plane
- Ensure bouncer can reach Control Plane on port 8000
- Check network policies and firewall rules
-
Test bouncer connectivity:
# From bouncer pod curl http://control-plane:8000/health
Redis Connection Problems
Symptoms:
- Health check shows "Cannot connect to Redis"
- Session management issues
Solutions:
-
Verify Redis service:
# Check Redis pod kubectl get pods -l app=redis -
Test Redis connectivity:
# From Control Plane pod redis-cli -h redis ping -
Check Redis configuration:
- Verify REDIS_URL format:
redis://host:port - Ensure Redis is accessible from Control Plane
- Verify REDIS_URL format:
π Common Deployment Issues
Helm Chart Installation Failures
Symptoms:
helm installcommand fails- Pods stuck in Pending or CrashLoopBackOff
Solutions:
-
Check resource requirements:
# Verify cluster has enough resources kubectl describe nodes -
Check persistent volume claims:
# Verify storage class exists kubectl get storageclass -
Review values.yaml:
- Ensure all required values are set
- Check for typos in configuration
- Verify image tags are correct
Docker Compose Issues
Symptoms:
- Services fail to start
- Port conflicts
Solutions:
-
Check port availability:
# Check if ports are in use netstat -tulpn | grep :8000 -
Verify Docker resources:
# Check Docker system info docker system df docker system prune -
Check environment file:
- Ensure .env file exists and is properly formatted
- Verify all required variables are set
π οΈ Performance Issues
Slow API Responses
Symptom: API responds slowly (>1 second)
Solutions:
# 1. Check database performance
docker-compose exec db psql -U controlcore -d control_core_db
# Run performance analysis
SELECT
query,
calls,
mean_exec_time,
max_exec_time,
stddev_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 100 # Over 100ms
ORDER BY mean_exec_time DESC
LIMIT 20;
# Check for missing indexes
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0 # Unused indexes
ORDER BY schemaname, tablename;
# 2. Check connection pool exhaustion
docker-compose logs api | grep -i "connection pool"
# Increase pool size in .env
DATABASE_POOL_SIZE=100 # Increase from 50
DATABASE_MAX_OVERFLOW=50 # Increase from 20
# 3. Check Redis performance
docker-compose exec redis redis-cli --latency
# Check slow commands
docker-compose exec redis redis-cli SLOWLOG GET 10
# 4. Enable query caching
# In API configuration
QUERY_CACHE_ENABLED=true
QUERY_CACHE_TTL=300
# 5. Add database read replicas (for high load)
# Update DATABASE_URL to use read replica for SELECT queries
DATABASE_READ_REPLICA_URL=postgresql://...
Symptom: Policy evaluation taking >100ms
Solutions:
# Check policy complexity
curl http://localhost:8080/api/v1/policies | jq '.policies[].evaluation_avg_ms'
# Optimize policies:
# 1. Reduce nested iterations
# 2. Use early returns
# 3. Avoid expensive computations
# 4. Use helper functions
# Increase policy cache
POLICY_CACHE_TTL=15m
POLICY_CACHE_MAX_SIZE=100000
# Enable decision caching
DECISION_CACHE_ENABLED=true
DECISION_CACHE_TTL=5m
DECISION_CACHE_MAX_SIZE=500000
High Memory Usage
Symptom: Service using excessive memory
Solutions:
# 1. Identify memory hog
docker stats
kubectl top pods -n control-core
# 2. Check for memory leaks
# Monitor over time
watch -n 10 'docker stats --no-stream'
# 3. Check cache sizes
curl http://localhost:8080/api/v1/policies/cache | jq '.memory_usage'
# 4. Reduce cache sizes (if too large)
CACHE_MAX_SIZE=25000 # Reduce
DECISION_CACHE_MAX_SIZE=100000 # Reduce
# 5. Enable cache eviction
CACHE_EVICTION_POLICY=lru
CACHE_CLEANUP_INTERVAL=5m
# 6. Adjust resource limits
# Kubernetes: Update values.yaml
resources:
requests:
memory: "2Gi"
limits:
memory: "4Gi"
# 7. Check for large data structures
docker-compose logs api | grep -i "large"
# 8. Restart service to clear memory
docker-compose restart api
kubectl rollout restart deployment/controlcore-api -n control-core
High CPU Usage
Symptom: Constant high CPU (>80%)
Solutions:
# 1. Check what's causing CPU load
docker stats
kubectl top pods -n control-core
# 2. Profile the application
# Enable profiling in .env
PROFILING_ENABLED=true
PROFILING_PORT=6060
# Access profiling endpoint
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# 3. Check for infinite loops in policies
# Review policy evaluation counts
curl http://localhost:8080/api/v1/metrics | grep policy_eval
# 4. Check database CPU
docker-compose exec db pg_top
# 5. Scale horizontally instead of vertically
# Add more replicas
kubectl scale deployment controlcore-api --replicas=10 -n control-core
# 6. Optimize policies
# - Avoid complex iterations
# - Use indexed data lookups
# - Cache intermediate results
π οΈ Scaling Issues
Auto-Scaling Not Triggering
Symptom: HPA not scaling pods despite high load
Solutions:
# 1. Check HPA status
kubectl get hpa -n control-core
kubectl describe hpa controlcore-api -n control-core
# 2. Verify metrics server is running
kubectl get deployment metrics-server -n kube-system
# 3. Check if metrics are available
kubectl top pods -n control-core
# 4. Review HPA conditions
kubectl describe hpa controlcore-api -n control-core | grep -A 10 Conditions
# 5. Check resource requests are set
kubectl get pods -n control-core -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'
# 6. Lower scaling thresholds if needed
kubectl patch hpa controlcore-api -n control-core -p '{"spec":{"targetCPUUtilizationPercentage":60}}'
# 7. Check custom metrics (if using)
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
Symptom: Pods stuck in Pending state
Solutions:
# 1. Check why pods are pending
kubectl describe pod <pod-name> -n control-core
# Common reasons:
# - Insufficient CPU/memory
# - No nodes match selector
# - PVC not bound
# 2. Check node resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# 3. Check PVC status
kubectl get pvc -n control-core
# 4. Scale cluster (add nodes)
# AWS EKS
eksctl scale nodegroup --cluster=controlcore --name=standard-workers --nodes=10
# Azure AKS
az aks scale --resource-group controlcore --name controlcore-cluster --node-count 10
# 5. Adjust resource requests (if too high)
# Edit deployment
kubectl edit deployment controlcore-api -n control-core
# Reduce requests.memory and requests.cpu
Load Balancer Issues
Symptom: Load balancer health checks failing
Solutions:
# 1. Check health endpoint
curl -v http://pod-ip:8080/health
# 2. Verify health check configuration
kubectl get svc controlcore-bouncer -n control-core -o yaml | grep -A 10 health
# 3. Check if pods are ready
kubectl get pods -n control-core -l app=controlcore-bouncer
# 4. Review pod logs for health check errors
kubectl logs -n control-core -l app=controlcore-bouncer | grep health
# 5. Adjust health check thresholds
# For AWS ALB
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "30"
# 6. Check network policies
kubectl get networkpolicies -n control-core
Symptom: Uneven load distribution
Solutions:
# 1. Check pod distribution
kubectl get pods -n control-core -l app=controlcore-bouncer -o wide
# 2. Verify load balancing algorithm
kubectl get svc controlcore-bouncer -n control-core -o yaml | grep sessionAffinity
# 3. Check if using IP hash (might cause imbalance)
# Consider changing to least_conn or round-robin
# 4. Review topology spread constraints
kubectl get deployment controlcore-bouncer -n control-core -o yaml | grep -A 10 topologySpread
# 5. Check for pod scheduling anti-patterns
kubectl describe pod -n control-core -l app=controlcore-bouncer | grep -A 5 "Topology Spread"
π οΈ Database Issues
Connection Pool Exhaustion
Symptom: "Too many connections" errors
Solutions:
# 1. Check current connections
docker-compose exec db psql -U controlcore -d control_core_db -c \
"SELECT count(*) FROM pg_stat_activity;"
# 2. Find idle connections
docker-compose exec db psql -U controlcore -d control_core_db -c \
"SELECT pid, usename, state, state_change
FROM pg_stat_activity
WHERE state = 'idle'
ORDER BY state_change;"
# 3. Kill idle connections (if needed)
docker-compose exec db psql -U controlcore -d control_core_db -c \
"SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '1 hour';"
# 4. Increase max connections (PostgreSQL)
docker-compose exec db psql -U controlcore -d control_core_db -c \
"ALTER SYSTEM SET max_connections = 200;" # From 100
# Restart database
docker-compose restart db
# 5. Optimize connection pool (in .env)
DATABASE_POOL_SIZE=50 # Per API pod
DATABASE_POOL_TIMEOUT=30
DATABASE_POOL_RECYCLE=3600 # Recycle connections hourly
Database Performance Degradation
Symptom: Queries getting slower over time
Solutions:
# 1. Check table bloat
docker-compose exec db psql -U controlcore -d control_core_db -c \
"SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
n_dead_tup,
n_live_tup,
ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY dead_ratio DESC;"
# 2. Run VACUUM ANALYZE
docker-compose exec db psql -U controlcore -d control_core_db -c "VACUUM ANALYZE;"
# 3. Enable autovacuum (if not enabled)
docker-compose exec db psql -U controlcore -d control_core_db -c \
"ALTER SYSTEM SET autovacuum = on;"
# 4. Add missing indexes
# Analyze query patterns first
docker-compose exec db psql -U controlcore -d control_core_db -c \
"SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY total_exec_time DESC
LIMIT 10;"
# 5. Check for lock contention
docker-compose exec db psql -U controlcore -d control_core_db -c \
"SELECT
locktype,
relation::regclass,
mode,
granted
FROM pg_locks
WHERE NOT granted;"
π οΈ Kubernetes-Specific Issues
Pod CrashLoopBackOff
Symptom: Pods restarting repeatedly
Solutions:
# 1. Check pod logs
kubectl logs -n control-core <pod-name> --previous
# 2. Describe pod for events
kubectl describe pod <pod-name> -n control-core
# 3. Common causes:
# - Missing environment variables
# - Database connection failure
# - Insufficient memory (OOMKilled)
# - Failed health checks
# 4. Check resource limits
kubectl get pod <pod-name> -n control-core -o yaml | grep -A 10 resources
# 5. Check liveness/readiness probes
kubectl get pod <pod-name> -n control-core -o yaml | grep -A 10 livenessProbe
# 6. Temporarily disable probes to debug
kubectl patch deployment controlcore-api -n control-core -p '{"spec":{"template":{"spec":{"containers":[{"name":"api","livenessProbe":null}]}}}}'
ImagePullBackOff
Symptom: Cannot pull container images
Solutions:
# 1. Check image pull errors
kubectl describe pod <pod-name> -n control-core
# 2. Verify image exists
docker pull controlcore/api:2.0.0
# 3. Check image pull secrets
kubectl get secrets -n control-core | grep registry
# 4. Create image pull secret (if missing)
kubectl create secret docker-registry controlcore-registry \
--docker-server=controlcore.io \
--docker-username=<username> \
--docker-password=<password> \
--namespace=control-core
# 5. Verify secret in deployment
kubectl get deployment controlcore-api -n control-core -o yaml | grep imagePullSecrets
PVC Not Binding
Symptom: PersistentVolumeClaim stuck in Pending
Solutions:
# 1. Check PVC status
kubectl get pvc -n control-core
kubectl describe pvc <pvc-name> -n control-core
# 2. Check if StorageClass exists
kubectl get storageclass
# 3. Check available PVs
kubectl get pv
# 4. Provision storage manually (if needed)
# Or create StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
volumeBindingMode: WaitForFirstConsumer
EOF
π οΈ Scaling Issues
Cluster Autoscaler Not Scaling
Symptom: Nodes not being added despite pending pods
Solutions:
# 1. Check Cluster Autoscaler logs
kubectl logs -n kube-system deployment/cluster-autoscaler
# 2. Check autoscaler status
kubectl describe configmap cluster-autoscaler-status -n kube-system
# 3. Verify node groups are tagged correctly
# AWS: Check ASG tags
aws autoscaling describe-auto-scaling-groups \
--query 'AutoScalingGroups[?Tags[?Key==`k8s.io/cluster-autoscaler/enabled`]]'
# 4. Check scaling limits
kubectl get deployment cluster-autoscaler -n kube-system -o yaml | grep -A 5 "command"
# 5. Manually scale if urgent
# AWS
aws autoscaling set-desired-capacity \
--auto-scaling-group-name controlcore-nodes \
--desired-capacity 10
# Azure
az aks nodepool scale \
--resource-group controlcore \
--cluster-name controlcore-cluster \
--name standard \
--node-count 10
HPA Not Scaling Pods
Symptom: HPA not creating more pods
Solutions:
# 1. Check HPA status and conditions
kubectl get hpa -n control-core
kubectl describe hpa controlcore-api -n control-core
# 2. Verify metrics are available
kubectl top pods -n control-core
# 3. Check if at maxReplicas
kubectl get hpa controlcore-api -n control-core -o yaml | grep -E "maxReplicas|currentReplicas"
# 4. Review scaling behavior
kubectl get hpa controlcore-api -n control-core -o yaml | grep -A 20 behavior
# 5. Check custom metrics (if configured)
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
# 6. Temporarily adjust thresholds
kubectl patch hpa controlcore-api -n control-core --type='json' -p='[{"op": "replace", "path": "/spec/targetCPUUtilizationPercentage", "value":50}]'
# 7. Check for pod disruption budgets blocking scale down
kubectl get pdb -n control-core
π οΈ Policy Issues
Policy Syntax Errors
Symptom: Policy fails validation
Solutions:
# 1. Check error message in Console UI
# Look for line number and description
# 2. Use OPA CLI to validate locally
opa check policy.rego
# 3. Use OPA playground
# Copy policy to https://play.openpolicyagent.org/
# 4. Common syntax errors:
# - Missing 'import rego.v1'
# - Incorrect indentation
# - Missing braces
# - Typos in keywords (if, every, some)
# - Missing 'default' for default values
# 5. Use policy linter
opa fmt -w policy.rego # Auto-format
opa check -s policy.rego # Strict checking
Policy Not Applying
Symptom: Policy deployed but not enforcing
Solutions:
# 1. Check deployment status
# In Console: Policies β <your-policy> β Status
# 2. Verify policy is in correct environment
# Sandbox vs Production
# 3. Check Bouncer has received policy
curl http://localhost:8080/api/v1/policies | jq '.policies[] | select(.name=="<policy-name>")'
# 4. Check policy sync status
curl http://localhost:8080/api/v1/policy-bridge/status
# 5. Force policy sync
curl -X POST http://localhost:7000/policy/refresh
curl -X POST http://localhost:8080/api/v1/policy-bridge/sync
# 6. Clear caches
curl -X POST http://localhost:8080/api/v1/policies/cache/clear
# 7. Check for policy conflicts (multiple policies)
# Review policy priority and evaluation order
Unexpected Policy Decisions
Symptom: Policy allowing/denying incorrectly
Solutions:
# 1. Check decision logs
# In Console: Monitoring β Decisions β Filter by policy
# 2. Review actual input data
# Decision logs show the exact input used
# 3. Test policy with actual data
# In Console: Policy β Test β Use decision log input
# 4. Add debugging to policy
package controlcore.policy
import rego.v1
default allow := false
# Debug output
debug_info := {
"user_roles": input.user.roles,
"resource_type": input.resource.type,
"action": input.action.name
}
allow if {
# Your conditions
}
# 5. Use OPA eval for testing
opa eval --data policy.rego --input input.json 'data.controlcore.policy.allow'
# 6. Check for data availability issues
# Ensure PIP data is synchronized
curl http://localhost:8080/api/v1/opa/data | jq '.policy_data'
π οΈ Authentication & Authorization Issues
SAML SSO Not Working
Symptom: SAML login fails or redirects incorrectly
Solutions:
# 1. Check SAML configuration
echo $SAML_IDP_SSO_URL
echo $SAML_SP_ENTITY_ID
# 2. Verify SAML certificates
openssl x509 -in idp-cert.pem -text -noout
# 3. Check SAML response in browser DevTools
# Network tab β Look for SAMLResponse parameter
# 4. Validate SAML metadata
curl https://idp.yourcompany.com/metadata.xml
# 5. Check attribute mapping
docker-compose logs api | grep -i "saml attribute"
# 6. Enable SAML debug logging
SAML_DEBUG=true
docker-compose restart api
# 7. Test with SAML tracer browser extension
# Chrome/Firefox: SAML-tracer
# 8. Common issues:
# - Clock skew (sync server time with NTP)
# - Expired certificates
# - Incorrect ACS URL
# - Missing required attributes
OAuth/Auth0 Issues
Symptom: OAuth login fails
Solutions:
# 1. Check OAuth configuration
echo $AUTH0_DOMAIN
echo $AUTH0_CLIENT_ID
echo $AUTH0_CALLBACK_URL
# 2. Verify callback URL is registered
# In Auth0 Dashboard: Applications β Settings β Allowed Callback URLs
# 3. Check for token expiration
docker-compose logs api | grep -i "token expired"
# 4. Test OAuth flow manually
curl https://<auth0-domain>/authorize?client_id=<client-id>&response_type=code&redirect_uri=<callback-url>&scope=openid%20profile%20email
# 5. Check network connectivity to Auth0
curl -v https://<auth0-domain>/.well-known/openid-configuration
# 6. Enable OAuth debug logging
AUTH0_DEBUG=true
π οΈ Network & Connectivity Issues
DNS Resolution Failures
Symptom: Services cannot resolve hostnames
Solutions:
# 1. Test DNS resolution
nslookup api.controlcore.yourcompany.com
dig api.controlcore.yourcompany.com
# 2. Check Kubernetes DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup controlcore-api.control-core.svc.cluster.local
# 3. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# 4. Verify DNS configuration
kubectl get configmap coredns -n kube-system -o yaml
# 5. Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system
Network Policy Blocking Traffic
Symptom: Services cannot communicate
Solutions:
# 1. Check network policies
kubectl get networkpolicies -n control-core
# 2. Describe network policy
kubectl describe networkpolicy <policy-name> -n control-core
# 3. Temporarily disable for testing
kubectl delete networkpolicy <policy-name> -n control-core
# 4. Test connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash
# Inside pod:
curl http://controlcore-api:8082/health
nc -zv controlcore-api 8082
# 5. Add necessary rules
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-db
namespace: control-core
spec:
podSelector:
matchLabels:
app: controlcore-api
egress:
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- protocol: TCP
port: 5432
EOF
π οΈ Certificate & SSL Issues
Certificate Expired
Symptom: SSL errors, certificate validation failures
Solutions:
# 1. Check certificate expiration
echo | openssl s_client -connect api.controlcore.yourcompany.com:443 2>/dev/null | \
openssl x509 -noout -dates
# 2. Check cert-manager certificates
kubectl get certificates -n control-core
kubectl describe certificate controlcore-tls -n control-core
# 3. Force certificate renewal
kubectl delete certificate controlcore-tls -n control-core
# cert-manager will automatically recreate
# 4. Check Let's Encrypt rate limits
kubectl logs -n cert-manager deployment/cert-manager | grep -i "rate limit"
# 5. Use staging for testing
kubectl patch clusterissuer letsencrypt-prod -p '{"spec":{"acme":{"server":"https://acme-staging-v02.api.letsencrypt.org/directory"}}}'
mTLS Issues
Symptom: Service-to-service mTLS failing
Solutions:
# 1. Check if Istio/service mesh is configured correctly
kubectl get peerauthentication -n control-core
# 2. Verify certificates
kubectl get secret -n control-core | grep tls
# 3. Check Istio sidecar injection
kubectl get pod <pod-name> -n control-core -o jsonpath='{.spec.containers[*].name}'
# Should show both app container and istio-proxy
# 4. Review mTLS mode
kubectl get destinationrule -n control-core -o yaml
# 5. Disable mTLS temporarily for debugging
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind:PeerAuthentication
metadata:
name: default
namespace: control-core
spec:
mtls:
mode: PERMISSIVE # Allow both mTLS and plain text
EOF
π Policy and Control Plane Hardening
Control Core uses hardening by default: policy/settings changes require admin role, BouncerβControl Plane traffic requires API keys (and optionally SPIRE), and the Bouncer plugin uses deny-by-default (no policy or policy error β deny). If you see 401/403, sync/heartbeat failures, or unexpected "access denied" at the Bouncer, see the dedicated guide:
- Policy and Control Plane Hardening - What hardening protects, common effects, and step-by-step troubleshooting (API keys, SPIRE, policy existence, logs).
Quick checks:
| Symptom | What to check |
|---|---|
| 401/403 on policy or settings API | User has admin role; token valid. |
| 401/403 on Bouncer β Control Plane | Bouncer API key set and correct; if SPIRE is on, SPIFFE identity valid. |
| All requests denied through Bouncer | Policy exists, is enabled, and applies to path; check bouncer logs for Rego errors. |
| Heartbeat or registration fails | Bouncer API key and (if used) SPIRE; Control Plane logs. |
π Getting Help
Log Collection
Before contacting support, collect the following logs:
# Control Plane logs
kubectl logs -n control-core deployment/control-plane --tail=100
# Database logs
kubectl logs -n control-core deployment/postgres --tail=100
# Policy Bridge logs
kubectl logs -n control-core deployment/policy-bridge --tail=100
# Bouncer logs
kubectl logs -l app=control-core-bouncer --tail=100
Support Resources
- Documentation: Documentation Home
- General Inquiries: info@controlcore.io
- Technical & Customer Support: support@controlcore.io
Issue Reporting
When reporting issues, include:
- Control Core version
- Deployment method (Helm, Docker Compose, etc.)
- Environment details (OS, Kubernetes version, etc.)
- Error messages and logs
- Steps to reproduce the issue