Debugging Playbook
This playbook provides step-by-step solutions for common issues in ReptiDex. Each scenario includes symptoms, investigation steps, and resolution procedures.Table of Contents
- Service Down / Not Responding
- Database Connection Issues
- Slow API Performance
- Authentication Failures
- Memory Leaks / OOM Errors
- High Error Rate
- Data Inconsistency
- Background Job Failures
- Cache Issues
- Deployment Rollback
Service Down / Not Responding
Symptoms
- Health check endpoint returning 503 or timing out
- All requests to service failing
- Service not listed in ECS task list
Investigation
Step 1: Check if service is runningCommon Causes
Cause 1: Configuration Error
Symptoms: Service crashes immediately on startup Solution:Cause 2: Database Migration Failed
Symptoms: Service starts but crashes when accessing database Logs to check:Cause 3: Port Already in Use
Symptoms: Service fails to start with “address already in use” Logs to check:Cause 4: Resource Limits
Symptoms: Service OOM killed or CPU throttled Logs to check:Resolution Checklist
- Service is running in ECS
- No errors in startup logs
- Configuration is correct
- Database migrations succeeded
- No port conflicts
- Sufficient CPU/memory allocated
- Health check endpoint responding
- Service registered with ALB
Database Connection Issues
Symptoms
DatabaseConnectionErrorOperationalError: connection pool exhausted- Requests timing out waiting for database
- Slow query performance
Investigation
Step 1: Check connection pool statusdb_connections_active(should be < pool_size)db_connections_waiting(should be 0)db_query_duration_seconds(p95 should be < 1s)
Common Causes
Cause 1: Connection Pool Exhausted
Symptoms: All connections in use, requests waiting Logs:- Optimize slow queries
- Move long-running operations out of transaction
- Use async operations
Cause 2: Database Performance Degradation
Symptoms: Queries that were fast are now slow Investigation:Cause 3: Database Server Issues
Symptoms: All services affected simultaneously Check:- Scale up RDS instance
- Add read replicas
- Optimize top queries
- Enable query caching
Resolution Checklist
- Connection pool not exhausted
- No connection leaks detected
- Query performance optimized
- Indexes added for slow queries
- Database CPU/memory healthy
- Connection timeout configured appropriately
- Error rate returned to normal
Slow API Performance
Symptoms
- API requests taking >2 seconds
- User complaints about slow page loads
- Timeouts on frontend
Investigation
Step 1: Identify slow endpointsCommon Causes
Cause 1: N+1 Query Problem
Symptoms: Many database queries for a single API request Logs:Cause 2: Missing Cache
Symptoms: Repeated expensive calculations Solution:Cause 3: External API Timeout
Symptoms: Slow requests to external services Logs:Cause 4: Large Response Payload
Symptoms: Slow serialization, large network transfer Solution:Resolution Checklist
- Identified slow endpoint(s)
- Optimized database queries
- Added appropriate indexes
- Implemented caching where needed
- Fixed N+1 query problems
- Added pagination for large result sets
- Set timeouts on external calls
- Response time < 500ms for p95
Authentication Failures
Symptoms
- Users unable to log in
- 401 Unauthorized errors
- “Invalid token” errors
Investigation
Step 1: Check failure rateCommon Causes
Cause 1: Token Expired
Symptoms: Users logged in previously, now getting 401 Solution:Cause 2: Auth Service Down
Symptoms: All auth requests failing Check:- Check auth service health
- Restart auth service if needed
- Verify database connectivity
Cause 3: Invalid Credentials
Symptoms: User entering wrong password Logs:- User needs to reset password
- Implement account lockout after N failed attempts
- Add CAPTCHA after 3 failed attempts
Cause 4: Token Signature Mismatch
Symptoms: Valid tokens being rejected Logs:Resolution Checklist
- Auth service is healthy
- Token expiry is appropriate
- JWT secret key is correct
- No brute force attacks detected
- Users can successfully authenticate
- Token refresh working (if implemented)
Memory Leaks / OOM Errors
Symptoms
- Service killed with OOM error
- Memory usage steadily increasing
- Slow performance over time
- Service crashes after running for hours/days
Investigation
Step 1: Check memory warningsCommon Causes
Cause 1: Unclosed Database Connections
Symptoms: Memory increases with request count Solution:Cause 2: Large Object Caching
Symptoms: Memory grows with cache size Solution:Cause 3: Memory Leak in Library
Symptoms: Memory increases even without load Investigation:- Update library to latest version
- Find alternative library
- Report bug to library maintainers
Cause 4: Large Response Buffering
Symptoms: Memory spike when serving large files/responses Solution:Resolution Checklist
- Identified memory leak source
- Fixed connection leaks
- Implemented proper cleanup
- Added cache size limits
- Memory usage stable over time
- No OOM errors in last 24h
- Service uptime > 7 days without restart
High Error Rate
Symptoms
- Error rate suddenly increased
- Multiple different error types
- May affect multiple services
Investigation
Step 1: Identify when it startedCommon Causes
Cause 1: Bad Deployment
Symptoms: Errors started right after deployment Solution:Cause 2: Dependency Failure
Symptoms: Multiple services affected Investigation:- Identify failing dependency (database, Redis, external API)
- Fix or restart dependency
- Implement circuit breaker for external dependencies
Cause 3: Traffic Spike
Symptoms: Errors correlate with high traffic Check:- Scale up service instances
- Implement rate limiting
- Add caching layer
- Enable auto-scaling
Cause 4: Database Migration Issue
Symptoms: Errors related to schema/columns Logs:Resolution Checklist
- Identified root cause
- Error rate decreased
- No ongoing issues
- Deployment rolled back if needed
- Dependencies healthy
- Capacity sufficient for traffic
- Post-mortem documented
Data Inconsistency
Symptoms
- Data doesn’t match between services
- Users seeing outdated data
- Cache showing different values than database
Investigation
Step 1: Identify inconsistency- Which data is inconsistent?
- Between which systems? (DB, cache, search index)
- When did it start?
Common Causes & Solutions
Cause 1: Stale Cache
Solution:Cause 2: Transaction Rollback
Symptoms: Write appeared to succeed but data not in DB Logs:Cause 3: Event/Message Loss
Symptoms: Event-driven updates didn’t propagate Solution:- Check message queue (SQS, SNS)
- Verify event consumers are running
- Implement retry mechanism
- Add dead letter queue
Resolution Checklist
- Identified inconsistent data
- Root cause determined
- Data manually reconciled if needed
- Cache cleared/invalidated
- Write operations succeed
- Consistency checks pass
- Monitoring added to detect future issues
Background Job Failures
Symptoms
- Scheduled jobs not running
- Async tasks failing
- Workers crashing
Investigation
Step 1: Check job statusCommon Causes & Solutions
Cause 1: Worker Died
Check:Cause 2: Job Timeout
Logs:Cause 3: Missing Dependencies
Solution:Resolution Checklist
- Workers running
- Jobs executing successfully
- No timeout issues
- Dependencies available
- Error handling implemented
- Retry logic working
Cache Issues
Symptoms
- Slow performance after cache miss
- Users seeing stale data
- Cache hit rate low
Investigation
Step 1: Check cache metricsCommon Causes & Solutions
Cause 1: Cache Eviction
Solution:Cause 2: Cache Stampede
Symptoms: Many requests for same expired key Solution:Resolution Checklist
- Cache hit rate > 80%
- No cache errors
- TTL configured appropriately
- Cache warming implemented
- Stampede protection in place
Deployment Rollback
When to Rollback
- High error rate after deployment
- Critical functionality broken
- Data corruption risk
- Cannot quickly fix forward
Rollback Procedure
Step 1: Identify previous versionDatabase Rollback
If database migration was applied:Post-Rollback
- Error rate normal
- All services healthy
- Users can access features
- Database consistent
- Incident documented
- Fix planned for next deployment
Getting Help
If you can’t resolve the issue:- Check this playbook for similar scenarios
- Search Slack #incidents for recent similar issues
- Escalate to on-call engineer if critical
- Post in #engineering with:
- Service affected
- Symptoms
- What you’ve tried
- Relevant logs/dashboards
Next Steps
- Error Investigation Workflow - Systematic investigation process
- Saved LogQL Queries - Query templates
- Log Correlation - Correlating logs with metrics

