Operational Runbooks

This guide provides step-by-step procedures for common operational tasks and incident response scenarios in ConduitLLM production environments.

Overview

These runbooks cover:

Incident response procedures
Common troubleshooting scenarios
Maintenance operations
Performance optimization
Disaster recovery

Incident Response

High Error Rate

Symptoms:

Error rate above 5% for more than 5 minutes
Multiple user complaints
Alerts from monitoring system

Investigation Steps:

Check System Health

# Check overall health
curl -s http://api.conduit.example.com/health/ready | jq

# Check provider health
curl -s http://admin.conduit.example.com/api/admin/providers/health \
  -H "X-Master-Key: $MASTER_KEY" | jq

Review Recent Logs

# Check API logs for errors
kubectl logs -n conduit -l app=conduit-api --tail=100 | grep ERROR

# Check specific provider errors
kubectl logs -n conduit -l app=conduit-api --tail=1000 | \
  grep -E "(openai|anthropic|googlecloud)" | grep -i error

Check Metrics

# Query Prometheus for error rates
curl -s "http://prometheus:9090/api/v1/query?query=rate(conduit_llm_requests_total{status='error'}[5m])" | jq

# Check provider-specific errors
curl -s "http://prometheus:9090/api/v1/query?query=conduit_provider_health" | jq

Resolution Steps:

Provider Issues

# Disable unhealthy provider
curl -X PATCH http://admin.conduit.example.com/api/admin/providers/openai/disable \
  -H "X-Master-Key: $MASTER_KEY"

# Force health check
curl -X POST http://admin.conduit.example.com/api/admin/providers/health/check \
  -H "X-Master-Key: $MASTER_KEY"

Circuit Breaker Reset

# Reset circuit breaker for provider
curl -X POST http://admin.conduit.example.com/api/admin/providers/openai/reset-circuit \
  -H "X-Master-Key: $MASTER_KEY"

Scale Up if Needed

# Increase replicas
kubectl scale deployment conduit-api -n conduit --replicas=10

# Or update HPA
kubectl patch hpa conduit-api-hpa -n conduit \
  -p '{"spec":{"minReplicas":5,"maxReplicas":30}}'

Database Connection Issues

Symptoms:

"Database connection failed" in health checks
Timeout errors in logs
Slow response times

Investigation Steps:

Check Database Status

# Check PostgreSQL pod status
kubectl get pods -n conduit -l app=postgres

# Check connection pool stats
kubectl exec -n conduit postgres-0 -- psql -U conduit -c \
  "SELECT count(*) as total, state FROM pg_stat_activity GROUP BY state;"

Review Connection Limits

# Check max connections
kubectl exec -n conduit postgres-0 -- psql -U conduit -c \
  "SHOW max_connections;"

# Check current connections
kubectl exec -n conduit postgres-0 -- psql -U conduit -c \
  "SELECT count(*) FROM pg_stat_activity;"

Resolution Steps:

Kill Idle Connections

-- Terminate idle connections older than 5 minutes
SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE state = 'idle' 
AND state_change < NOW() - INTERVAL '5 minutes'
AND datname = 'conduit';

Restart Connection Pool

# Rolling restart of API pods
kubectl rollout restart deployment conduit-api -n conduit

# Monitor rollout
kubectl rollout status deployment conduit-api -n conduit

Increase Connection Limits

# Update PostgreSQL config
kubectl exec -n conduit postgres-0 -- psql -U postgres -c \
  "ALTER SYSTEM SET max_connections = 500;"

# Restart PostgreSQL
kubectl delete pod -n conduit postgres-0

Memory Issues

Symptoms:

OutOfMemoryError in logs
Pods being killed (OOMKilled)
Degraded performance

Investigation Steps:

Check Memory Usage

# Current memory usage
kubectl top pods -n conduit

# Historical memory usage
curl -s "http://prometheus:9090/api/v1/query_range?query=container_memory_usage_bytes{namespace='conduit'}" | jq

Identify Memory Leaks

# Check for increasing memory over time
kubectl exec -n conduit conduit-api-xxxx -- \
  curl -s http://localhost:8080/metrics | grep dotnet_gc

Resolution Steps:

Immediate Relief

# Restart affected pods
kubectl delete pod -n conduit conduit-api-xxxx

# Increase memory limits temporarily
kubectl set resources deployment conduit-api -n conduit \
  --limits=memory=4Gi --requests=memory=2Gi

Cache Cleanup

# Clear Redis cache
kubectl exec -n conduit redis-0 -- redis-cli FLUSHDB

# Clear specific cache patterns
kubectl exec -n conduit redis-0 -- redis-cli --scan --pattern "cache:*" | \
  xargs kubectl exec -n conduit redis-0 -- redis-cli DEL

Maintenance Operations

Rolling Updates

Pre-deployment Checklist:

Backup database
Check current system health
Notify users of maintenance window
Prepare rollback plan

Deployment Steps:

Create Backup

# Trigger manual backup
kubectl create job -n conduit manual-backup-$(date +%Y%m%d-%H%M%S) \
  --from=cronjob/postgres-backup

# Verify backup completed
kubectl logs -n conduit -l job-name=manual-backup-* --tail=100

Deploy New Version

# Update image
kubectl set image deployment/conduit-api -n conduit \
  api=your-registry/conduit:v1.2.3

# Monitor rollout
kubectl rollout status deployment conduit-api -n conduit

# Check pod status
kubectl get pods -n conduit -l app=conduit-api -w

Verify Deployment

# Check health
curl -s http://api.conduit.example.com/health/ready | jq

# Run smoke tests
./scripts/smoke-tests.sh

# Check metrics
curl -s http://api.conduit.example.com/metrics | grep conduit_

Rollback if Needed:

# Rollback to previous version
kubectl rollout undo deployment conduit-api -n conduit

# Or rollback to specific revision
kubectl rollout undo deployment conduit-api -n conduit --to-revision=2

Database Maintenance

Regular Maintenance Tasks:

Vacuum and Analyze

-- Run vacuum analyze
VACUUM ANALYZE;

-- Check table sizes
SELECT 
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables 
WHERE schemaname = 'public' 
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Index Maintenance

-- Find unused indexes
SELECT 
  schemaname, 
  tablename, 
  indexname, 
  idx_scan
FROM pg_stat_user_indexes 
WHERE idx_scan = 0 
AND schemaname = 'public';

-- Rebuild bloated indexes
REINDEX TABLE request_logs;

Archive Old Data

-- Archive logs older than 90 days
INSERT INTO request_logs_archive 
SELECT * FROM request_logs 
WHERE timestamp < NOW() - INTERVAL '90 days';

-- Delete archived records
DELETE FROM request_logs 
WHERE timestamp < NOW() - INTERVAL '90 days';

Certificate Renewal

Check Certificate Expiration:

# Check ingress certificates
kubectl get certificate -n conduit

# Check certificate details
kubectl describe certificate conduit-tls -n conduit

# Manual check
echo | openssl s_client -servername api.conduit.example.com \
  -connect api.conduit.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

Renew Certificates:

# Force renewal with cert-manager
kubectl delete certificate conduit-tls -n conduit

# Monitor renewal
kubectl logs -n cert-manager deployment/cert-manager -f

# Verify new certificate
kubectl get certificate conduit-tls -n conduit

Performance Optimization

Slow Response Times

Investigation:

Identify Slow Queries

-- Find slow queries
SELECT 
  query,
  mean_exec_time,
  calls,
  total_exec_time
FROM pg_stat_statements 
ORDER BY mean_exec_time DESC 
LIMIT 20;

Check Cache Performance

# Redis stats
kubectl exec -n conduit redis-0 -- redis-cli INFO stats

# Cache hit rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(conduit_cache_hits_total[5m])" | jq

Optimization Steps:

Database Tuning

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_request_logs_timestamp 
ON request_logs(timestamp);

-- Update statistics
ANALYZE request_logs;

Cache Warming

# Warm popular models cache
curl -X POST http://admin.conduit.example.com/api/admin/cache/warm \
  -H "X-Master-Key: $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"models": ["gpt-4", "claude-3-opus"]}'

High Latency to Providers

Investigation:

# Check provider latencies
curl -s "http://prometheus:9090/api/v1/query?query=conduit_provider_response_time_seconds" | jq

# Test direct connectivity
curl -w "@curl-format.txt" -o /dev/null -s https://api.openai.com/v1/models

Resolution:

Enable Regional Routing

# Configure regional endpoints
curl -X POST http://admin.conduit.example.com/api/admin/providers/config \
  -H "X-Master-Key: $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "regionalEndpoints": {
      "us-east": "https://api.openai.com",
      "eu-west": "https://eu.api.openai.com"
    }
  }'

Disaster Recovery

Database Recovery

From Backup:

# List available backups
aws s3 ls s3://conduit-backups/ | grep sql.gz

# Download backup
aws s3 cp s3://conduit-backups/conduit-20240115-020000.sql.gz .

# Restore database
gunzip -c conduit-20240115-020000.sql.gz | \
  kubectl exec -i -n conduit postgres-0 -- psql -U conduit

Point-in-Time Recovery:

# Stop application
kubectl scale deployment conduit-api -n conduit --replicas=0

# Restore to specific time
kubectl exec -n conduit postgres-0 -- pg_restore \
  --dbname=conduit \
  --target-time="2024-01-15 10:30:00" \
  /backup/base.tar

# Restart application
kubectl scale deployment conduit-api -n conduit --replicas=3

Split-Brain Recovery

Identify Split-Brain:

# Check Redis cluster status
kubectl exec -n conduit redis-0 -- redis-cli CLUSTER NODES

# Check for multiple masters
kubectl exec -n conduit redis-0 -- redis-cli CLUSTER INFO

Resolution:

# Force failover to single master
kubectl exec -n conduit redis-1 -- redis-cli CLUSTER FAILOVER FORCE

# Reset cluster if needed
kubectl exec -n conduit redis-0 -- redis-cli CLUSTER RESET HARD

Monitoring and Alerting

Alert Response Matrix

Alert	Severity	Response Time	Actions
Provider Down	Critical	5 min	Disable provider, notify on-call
High Error Rate	High	15 min	Check logs, scale up if needed
Database Connection Pool Exhausted	High	10 min	Kill idle connections, restart pods
Memory > 90%	Medium	30 min	Clear cache, increase limits
Disk > 80%	Medium	1 hour	Archive old data, expand volume
Certificate Expiring	Low	1 day	Renew certificate

On-Call Procedures

Initial Response:

Acknowledge alert within 5 minutes
Join incident channel
Run initial diagnostics
Escalate if needed

Communication:

Update status page
Notify stakeholders
Document actions in incident channel
Create post-mortem ticket

Tools and Scripts

Health Check Script

#!/bin/bash
# check-health.sh

echo "=== System Health Check ==="
echo "Timestamp: $(date)"

# API Health
echo -e "\n--- API Health ---"
curl -s http://api.conduit.example.com/health/ready | jq

# Provider Health  
echo -e "\n--- Provider Health ---"
curl -s http://admin.conduit.example.com/api/admin/providers/health \
  -H "X-Master-Key: $MASTER_KEY" | jq

# Database Status
echo -e "\n--- Database Status ---"
kubectl exec -n conduit postgres-0 -- psql -U conduit -c \
  "SELECT count(*) as connections, state FROM pg_stat_activity GROUP BY state;"

# Redis Status
echo -e "\n--- Redis Status ---"
kubectl exec -n conduit redis-0 -- redis-cli INFO server | grep uptime

# Pod Status
echo -e "\n--- Pod Status ---"
kubectl get pods -n conduit

Quick Diagnostics

#!/bin/bash
# diagnose.sh

# Recent errors
echo "=== Recent Errors ==="
kubectl logs -n conduit -l app=conduit-api --tail=1000 | \
  grep ERROR | tail -20

# Current metrics
echo -e "\n=== Current Metrics ==="
curl -s http://api.conduit.example.com/metrics | \
  grep -E "(conduit_llm_requests_total|conduit_llm_active_requests)"

# Resource usage
echo -e "\n=== Resource Usage ==="
kubectl top pods -n conduit

Next Steps

Health Checks - Configure health monitoring
Metrics Monitoring - Set up alerting
Production Deployment - Deployment procedures
Troubleshooting Guide - Common issues

Overview​

Incident Response​

High Error Rate​

Database Connection Issues​

Memory Issues​

Maintenance Operations​

Rolling Updates​

Database Maintenance​

Certificate Renewal​

Performance Optimization​

Slow Response Times​

High Latency to Providers​

Disaster Recovery​

Database Recovery​

Split-Brain Recovery​

Monitoring and Alerting​

Alert Response Matrix​

On-Call Procedures​

Tools and Scripts​

Health Check Script​

Quick Diagnostics​

Next Steps​

Overview

Incident Response

High Error Rate

Database Connection Issues

Memory Issues

Maintenance Operations

Rolling Updates

Database Maintenance

Certificate Renewal

Performance Optimization

Slow Response Times

High Latency to Providers

Disaster Recovery

Database Recovery

Split-Brain Recovery

Monitoring and Alerting

Alert Response Matrix

On-Call Procedures

Tools and Scripts

Health Check Script

Quick Diagnostics

Next Steps