Troubleshooting Guide
Solutions to common issues and debugging strategies
Common Issues
Pods Not Starting
Symptoms:
- Pods stuck in
PendingorCrashLoopBackOff - ImagePullBackOff errors
Diagnosis:
# Check pod status
kubectl get pods -n composio
# Describe problematic pod
kubectl describe pod <pod-name> -n composio
# Check pod logs
kubectl logs <pod-name> -n composio
# Check events
kubectl get events -n composio --sort-by='.lastTimestamp'
Solutions:
- ImagePullBackOff: Verify ECR credentials are correct
# Recreate ECR secret kubectl delete secret ecr-secret -n composio helm upgrade composio ./composio -n composio \ --set externalSecrets.ecr.token="$(aws ecr get-login-password --region us-east-1)" - Insufficient Resources: Check cluster has enough CPU/memory
# Check node resources kubectl top nodes kubectl describe nodes - Secret Issues: Verify all secrets exist
# Check secrets kubectl get secrets -n composio kubectl describe secret composio-secrets -n composio kubectl describe secret external-postgres-secret -n composio
Service Not Accessible
Symptoms:
- Cannot connect to Apollo API
- Connection timeouts
Diagnosis:
# Check service status
kubectl get svc -n composio
# Test service connectivity from within cluster
kubectl run test-pod --rm -i --tty --image=curlimages/curl --restart=Never -- \
curl http://composio-apollo:9900/api/v1/health
# Check endpoints
kubectl get endpoints -n composio
Solutions:
- Verify pods are running and ready
- Check service selectors match pod labels
- Verify network policies aren't blocking traffic
- Use port-forward for debugging:
kubectl port-forward -n composio svc/composio-apollo 8080:9900
Knative Issues
Knative Components Not Starting
Diagnosis:
# Check Knative installation
kubectl get pods -n knative-serving
# Check Knative setup job
kubectl get jobs -n composio -l app.kubernetes.io/component=knative-setup
kubectl logs -n composio job/knative-setup-<revision>
# Verify CRDs
kubectl get crd | grep knative
Solutions:
- Reinstall Knative components:
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.0/serving-crds.yaml kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.0/serving-core.yaml kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.15.0/kourier.yaml kubectl patch configmap/config-network \ --namespace knative-serving \ --type merge \ --patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}' - Ensure cluster has sufficient resources
- Check for conflicting ingress controllers
Mercury Service Not Ready
Diagnosis:
# Check Knative service status
kubectl get ksvc -n composio
kubectl describe ksvc composio-mercury -n composio
# Check Mercury pods
kubectl get pods -n composio -l serving.knative.dev/service=composio-mercury
# View Mercury logs
kubectl logs -n composio -l serving.knative.dev/service=composio-mercury
Solutions:
- Check autoscaler annotations are correct
- Verify resource requests don't exceed cluster capacity
- Check Mercury container logs for application errors
- Ensure Knative activator is running
Database Issues
Database Connection Failures
Symptoms:
- Apollo/Thermos pods failing to start
- Error messages about database connectivity
Diagnosis:
# Check database secret
kubectl get secret external-postgres-secret -n composio -o yaml
# Decode database URL
kubectl get secret external-postgres-secret -n composio \
-o jsonpath='{.data.url}' | base64 -d
# Test connectivity from pod
kubectl run test-db --rm -i --tty --image=postgres:15 --restart=Never -- \
psql "$(kubectl get secret external-postgres-secret -n composio -o jsonpath='{.data.url}' | base64 -d)"
# Check Apollo logs for connection errors
kubectl logs -n composio deployment/composio-apollo | grep -i "database\|postgres"
Solutions:
- Verify POSTGRES_URL format is correct:
postgresql://user:password@host:5432/database?sslmode=require - Check network connectivity to database
- Verify database firewall rules allow cluster IPs
- Ensure SSL mode matches database configuration
- Re-run secret setup script:
export POSTGRES_URL="your-connection-string" ./secret-setup.sh -r composio -n composio
Database Initialization Job Failures
Diagnosis:
# Check DB init jobs
kubectl get jobs -n composio | grep db-init
# Check job logs
kubectl logs -n composio job/composio-apollo-db-init
kubectl logs -n composio job/composio-thermos-db-init
Solutions:
- Check database permissions for migrations
- Verify database schema compatibility
- Delete and recreate failed jobs:
kubectl delete job composio-apollo-db-init -n composio helm upgrade composio ./composio -n composio
GKE-Specific Issues
GKE Autopilot Resource Issues
Symptoms:
- Pods rejected with resource limit errors
- Autopilot adjusting resource requests
Diagnosis:
# Check Autopilot events
kubectl get events -n composio --sort-by='.lastTimestamp' | grep autopilot
# Check resource quotas
kubectl describe quota -n composio
# View Autopilot recommendations
gcloud container clusters describe $CLUSTER_NAME --region=$REGION \
--format="value(autopilot.workloadPolicyConfig)"
Solutions:
- Ensure resource requests are within Autopilot limits
- Disable privileged containers and sysctls (Autopilot restriction)
- Use appropriate QoS classes for pods
- Review and adjust resource limits in values.yaml
Cloud SQL Connectivity
Diagnosis:
# Test Cloud SQL connectivity
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
psql "postgresql://postgres:your_password@${POSTGRES_IP}:5432/composio?sslmode=require"
# Check Cloud SQL logs
gcloud logging read \
"resource.type=gce_instance AND resource.labels.instance_id:composio-postgres" \
--limit 50 --format json
# Verify instance status
gcloud sql instances describe composio-postgres
Solutions:
- Ensure Cloud SQL instance allows connections from GKE cluster
- Check VPC network connectivity
- Verify SSL certificates are valid
- Use Cloud SQL Proxy for enhanced security:
# Install Cloud SQL Proxy sidecar # See: https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine
Resource Issues
Out of Memory (OOM)
Diagnosis:
# Check for OOM kills
kubectl get events -n composio | grep OOM
# Check current resource usage
kubectl top pods -n composio
kubectl top nodes
# Describe pod to see resource limits
kubectl describe pod <pod-name> -n composio
Solutions:
- Increase memory limits in values.yaml
- Enable HPA to distribute load
- Check for memory leaks in application logs
- Optimize application configuration
Persistent Volume Issues
Diagnosis:
# Check PVC status
kubectl get pvc -n composio
# Check PV status
kubectl get pv
# Describe PVC
kubectl describe pvc <pvc-name> -n composio
Solutions:
- Multi-Attach Error: Minio uses RollingUpdate strategy with maxUnavailable=1
- Verify storage class is available
- Check cluster has storage provisioner
- Delete pod to force remount if stuck
Debugging Tools & Commands
Essential Debug Commands
# Complete cluster overview
kubectl get all -n composio
# Pod logs (follow)
kubectl logs -f <pod-name> -n composio
# Previous pod logs (after crash)
kubectl logs <pod-name> -n composio --previous
# Multi-container pod logs
kubectl logs <pod-name> -c <container-name> -n composio
# Execute commands in pod
kubectl exec -it <pod-name> -n composio -- /bin/sh
# Port forward for local access
kubectl port-forward -n composio svc/composio-apollo 8080:9900
# Watch resource changes
watch kubectl get pods -n composio
# Get pod YAML
kubectl get pod <pod-name> -n composio -o yaml
# Check pod events
kubectl describe pod <pod-name> -n composio
# Network debugging
kubectl run netshoot --rm -i --tty --image=nicolaka/netshoot --restart=Never
# DNS debugging
kubectl run dnsutils --rm -i --tty --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 --restart=Never -- nslookup composio-apollo.composio.svc.cluster.local
Monitoring Commands
# Resource usage
kubectl top pods -n composio
kubectl top nodes
# HPA status
kubectl get hpa -n composio
kubectl describe hpa <hpa-name> -n composio
# Service endpoints
kubectl get endpoints -n composio
# ConfigMaps and Secrets
kubectl get cm -n composio
kubectl get secrets -n composio
# Knative services
kubectl get ksvc -n composio
kubectl get revisions -n composio
Support Bundle Collection
For complex issues, collect a comprehensive support bundle:
# Enable support bundle in values.yaml
supportBundle:
enabled: true
# Upgrade deployment
helm upgrade composio ./composio -n composio
# Collect support bundle
kubectl support-bundle ./composio/support-bundle-spec.yaml -n composio
# Or manually collect key information
mkdir support-bundle
kubectl get all -n composio -o yaml > support-bundle/all-resources.yaml
kubectl get events -n composio --sort-by='.lastTimestamp' > support-bundle/events.txt
kubectl logs -n composio --all-containers=true --prefix=true > support-bundle/all-logs.txt
kubectl describe nodes > support-bundle/nodes.txt