Skip to main content

Debugging Playbook

This playbook provides step-by-step solutions for common issues in ReptiDex. Each scenario includes symptoms, investigation steps, and resolution procedures.

Table of Contents

  1. Service Down / Not Responding
  2. Database Connection Issues
  3. Slow API Performance
  4. Authentication Failures
  5. Memory Leaks / OOM Errors
  6. High Error Rate
  7. Data Inconsistency
  8. Background Job Failures
  9. Cache Issues
  10. Deployment Rollback

Service Down / Not Responding

Symptoms

  • Health check endpoint returning 503 or timing out
  • All requests to service failing
  • Service not listed in ECS task list

Investigation

Step 1: Check if service is running
# Check ECS service status
aws ecs describe-services \
  --cluster dev-reptidex-cluster \
  --services dev-reptidex-core \
  --query 'services[0].{Status:status,DesiredCount:desiredCount,RunningCount:runningCount}'

# Check running tasks
aws ecs list-tasks \
  --cluster dev-reptidex-cluster \
  --service-name dev-reptidex-core \
  --desired-status RUNNING
Step 2: Check service logs
{service="repti-core"}
| json
| message =~ "(?i)starting|started|stopping|stopped|fatal|panic"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"
Step 3: Check for errors at startup
{service="repti-core"}
| json
| level =~ "ERROR|CRITICAL"
| line_format "{{.timestamp}} {{.error_type}}: {{.message}}\n{{.stack_trace}}"

Common Causes

Cause 1: Configuration Error

Symptoms: Service crashes immediately on startup Solution:
# Check environment variables
aws ecs describe-task-definition \
  --task-definition repti-core \
  --query 'taskDefinition.containerDefinitions[0].environment'

# Fix configuration in .env or ECS task definition
# Then redeploy

Cause 2: Database Migration Failed

Symptoms: Service starts but crashes when accessing database Logs to check:
{service="repti-core"}
| json
| message =~ "(?i)migration|alembic|database"
| level="ERROR"
Solution:
# Run migrations manually
docker exec -it <container_id> alembic upgrade head

# Or rollback if migration is broken
docker exec -it <container_id> alembic downgrade -1

Cause 3: Port Already in Use

Symptoms: Service fails to start with “address already in use” Logs to check:
{service="repti-core"}
| json
| message =~ "(?i)address already in use|port.*in use"
Solution:
# Find process using the port
lsof -i :8000

# Kill the process
kill -9 <PID>

# Restart service

Cause 4: Resource Limits

Symptoms: Service OOM killed or CPU throttled Logs to check:
{service="repti-core"}
| json
| message =~ "(?i)oom|out of memory|killed"
Solution:
# Increase memory/CPU in ECS task definition
# Update task definition memory from 512 to 1024
aws ecs register-task-definition \
  --family repti-core \
  --memory 1024

# Force new deployment
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-core \
  --force-new-deployment

Resolution Checklist

  • Service is running in ECS
  • No errors in startup logs
  • Configuration is correct
  • Database migrations succeeded
  • No port conflicts
  • Sufficient CPU/memory allocated
  • Health check endpoint responding
  • Service registered with ALB

Database Connection Issues

Symptoms

  • DatabaseConnectionError
  • OperationalError: connection pool exhausted
  • Requests timing out waiting for database
  • Slow query performance

Investigation

Step 1: Check connection pool status
{service=~"repti-.*"}
| json
| error_type =~ ".*Connection.*"
| details.pool_size != ""
| line_format "Pool: {{.details.pool_size}}, Active: {{.details.active_connections}}, Waiting: {{.details.waiting_connections}}"
Step 2: Check for slow queries
{service=~"repti-.*"}
| json
| message =~ "(?i)query|select|insert|update"
| duration_ms > 1000
| line_format "{{.duration_ms}}ms - {{.message}}"
Step 3: Check database metrics Go to Prometheus and check:
  • db_connections_active (should be < pool_size)
  • db_connections_waiting (should be 0)
  • db_query_duration_seconds (p95 should be < 1s)

Common Causes

Cause 1: Connection Pool Exhausted

Symptoms: All connections in use, requests waiting Logs:
{service=~"repti-.*"}
| json
| error_type="DatabaseConnectionError"
| details.active_connections == details.pool_size
Solutions: Option A: Increase pool size
# In app/core/database.py
engine = create_async_engine(
    settings.database_url,
    pool_size=20,  # Increase from 10
    max_overflow=10,  # Increase from 5
)
Option B: Fix connection leaks
# Ensure all connections are properly closed
# Use context managers
async with get_db() as session:
    # Your code here
    pass  # Connection automatically closed
Option C: Reduce connection hold time
  • Optimize slow queries
  • Move long-running operations out of transaction
  • Use async operations

Cause 2: Database Performance Degradation

Symptoms: Queries that were fast are now slow Investigation:
# Find slow queries
{service=~"repti-.*"}
| json
| message =~ "(?i)query"
| duration_ms > 1000
| line_format "{{.duration_ms}}ms - {{.message}}"
Solutions: Check for missing indexes:
-- Find queries without indexes
SELECT * FROM pg_stat_statements
WHERE calls > 1000
  AND mean_exec_time > 1000
ORDER BY mean_exec_time DESC;
Add indexes:
# In alembic migration
def upgrade():
    op.create_index(
        'ix_animals_species_id',
        'animals',
        ['species_id']
    )

Cause 3: Database Server Issues

Symptoms: All services affected simultaneously Check:
# Check database CPU/memory
aws rds describe-db-instances \
  --db-instance-identifier reptidex-dev \
  --query 'DBInstances[0].{CPU:CPUUtilization,Memory:FreeableMemory}'
Solutions:
  • Scale up RDS instance
  • Add read replicas
  • Optimize top queries
  • Enable query caching

Resolution Checklist

  • Connection pool not exhausted
  • No connection leaks detected
  • Query performance optimized
  • Indexes added for slow queries
  • Database CPU/memory healthy
  • Connection timeout configured appropriately
  • Error rate returned to normal

Slow API Performance

Symptoms

  • API requests taking >2 seconds
  • User complaints about slow page loads
  • Timeouts on frontend

Investigation

Step 1: Identify slow endpoints
{service=~"repti-.*"}
| json
| duration_ms > 2000
| line_format "{{.duration_ms}}ms - {{.method}} {{.endpoint}}"
Step 2: Find slowest operations
topk(10,
  avg by (endpoint) (
    avg_over_time(
      {service=~"repti-.*"}
      | json
      | unwrap duration_ms [1h]
    )
  )
)
Step 3: Check database query time
{service=~"repti-.*"}
| json
| message =~ "(?i)query|database"
| duration_ms > 500

Common Causes

Cause 1: N+1 Query Problem

Symptoms: Many database queries for a single API request Logs:
{service="repti-core"}
| json
| request_id="<REQUEST_ID>"
| message =~ "(?i)query|select"
| line_format "{{.timestamp}} - {{.message}}"
Solution:
# Before (N+1 problem)
animals = session.query(Animal).all()
for animal in animals:
    print(animal.species.name)  # Separate query for each!

# After (eager loading)
from sqlalchemy.orm import joinedload

animals = session.query(Animal)\
    .options(joinedload(Animal.species))\
    .all()
for animal in animals:
    print(animal.species.name)  # No extra queries

Cause 2: Missing Cache

Symptoms: Repeated expensive calculations Solution:
from functools import lru_cache

@lru_cache(maxsize=100)
def calculate_lineage(animal_id: str) -> dict:
    # Expensive operation cached
    pass

# Or use Redis cache
async def get_animal_lineage(animal_id: str):
    cache_key = f"lineage:{animal_id}"
    cached = await redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # Calculate and cache
    result = await calculate_lineage(animal_id)
    await redis.setex(cache_key, 3600, json.dumps(result))
    return result

Cause 3: External API Timeout

Symptoms: Slow requests to external services Logs:
{service=~"repti-.*"}
| json
| message =~ "(?i)external|api call|http request"
| duration_ms > 1000
Solution:
# Add timeout to external calls
import httpx

async with httpx.AsyncClient(timeout=5.0) as client:
    response = await client.get("https://external-api.com/data")

# Use circuit breaker
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_api():
    # Will stop calling if too many failures
    pass

Cause 4: Large Response Payload

Symptoms: Slow serialization, large network transfer Solution:
# Add pagination
@router.get("/animals")
async def list_animals(
    skip: int = 0,
    limit: int = 20  # Limit results
):
    animals = await repo.get_multi(skip=skip, limit=limit)
    return animals

# Use field selection
@router.get("/animals")
async def list_animals(fields: str = None):
    if fields:
        # Return only requested fields
        selected = fields.split(",")
        return [
            {k: v for k, v in animal.dict().items() if k in selected}
            for animal in animals
        ]

Resolution Checklist

  • Identified slow endpoint(s)
  • Optimized database queries
  • Added appropriate indexes
  • Implemented caching where needed
  • Fixed N+1 query problems
  • Added pagination for large result sets
  • Set timeouts on external calls
  • Response time < 500ms for p95

Authentication Failures

Symptoms

  • Users unable to log in
  • 401 Unauthorized errors
  • “Invalid token” errors

Investigation

Step 1: Check failure rate
rate(
  {service="repti-auth"}
  | json
  | message =~ "(?i)authentication failed" [5m]
)
Step 2: Identify affected users
{service="repti-auth"}
| json
| message =~ "(?i)authentication failed"
| line_format "{{.timestamp}} - User: {{.user_id}} - Reason: {{.details.reason}}"
Step 3: Check for suspicious patterns
# Check for brute force
sum by (source_ip) (
  count_over_time(
    {service="repti-auth"}
    | json
    | message =~ "(?i)authentication failed" [5m]
  )
) > 10

Common Causes

Cause 1: Token Expired

Symptoms: Users logged in previously, now getting 401 Solution:
# Check token expiry settings
# In app/core/security.py
ACCESS_TOKEN_EXPIRE_MINUTES = 60  # 1 hour

# Or implement refresh tokens
def create_refresh_token(user_id: str):
    expires = datetime.utcnow() + timedelta(days=30)
    return create_token({"sub": user_id}, expires)

Cause 2: Auth Service Down

Symptoms: All auth requests failing Check:
{service="repti-auth"}
| json
| level =~ "ERROR|CRITICAL"
Solution:
  • Check auth service health
  • Restart auth service if needed
  • Verify database connectivity

Cause 3: Invalid Credentials

Symptoms: User entering wrong password Logs:
{service="repti-auth"}
| json
| message =~ "(?i)invalid credentials|wrong password"
| user_id != ""
Solution:
  • User needs to reset password
  • Implement account lockout after N failed attempts
  • Add CAPTCHA after 3 failed attempts

Cause 4: Token Signature Mismatch

Symptoms: Valid tokens being rejected Logs:
{service="repti-auth"}
| json
| message =~ "(?i)signature|invalid token"
Solution:
# Verify JWT secret key is consistent
# Check SECRET_KEY environment variable
# Ensure it hasn't changed after tokens were issued

# If secret changed, invalidate all tokens:
# 1. Update secret
# 2. Force users to re-authenticate
# 3. Clear Redis token cache

Resolution Checklist

  • Auth service is healthy
  • Token expiry is appropriate
  • JWT secret key is correct
  • No brute force attacks detected
  • Users can successfully authenticate
  • Token refresh working (if implemented)

Memory Leaks / OOM Errors

Symptoms

  • Service killed with OOM error
  • Memory usage steadily increasing
  • Slow performance over time
  • Service crashes after running for hours/days

Investigation

Step 1: Check memory warnings
{service="repti-core"}
| json
| message =~ "(?i)memory|oom|out of memory"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"
Step 2: Check service restarts
{service="repti-core"}
| json
| message =~ "(?i)starting|started"
| line_format "{{.timestamp}} - {{.message}}"
Step 3: Monitor memory trend Check Prometheus:
# Memory usage over time
container_memory_usage_bytes{service="repti-core"}

# Memory usage percentage
(container_memory_usage_bytes / container_memory_limit_bytes) * 100

Common Causes

Cause 1: Unclosed Database Connections

Symptoms: Memory increases with request count Solution:
# Always use context managers
async def get_animals():
    async with get_db() as session:
        return await session.query(Animal).all()
    # Connection automatically closed

# Or use dependency injection
@router.get("/animals")
async def list_animals(db: AsyncSession = Depends(get_db)):
    return await db.query(Animal).all()
    # FastAPI handles cleanup

Cause 2: Large Object Caching

Symptoms: Memory grows with cache size Solution:
# Use LRU cache with size limit
from functools import lru_cache

@lru_cache(maxsize=100)  # Limit cache size
def expensive_calculation(param):
    pass

# Or use Redis instead of in-memory cache
# Move large objects to external cache

Cause 3: Memory Leak in Library

Symptoms: Memory increases even without load Investigation:
# Profile memory usage
pip install memory-profiler

# Add to code
from memory_profiler import profile

@profile
def my_function():
    pass

# Run and check output
Solution:
  • Update library to latest version
  • Find alternative library
  • Report bug to library maintainers

Cause 4: Large Response Buffering

Symptoms: Memory spike when serving large files/responses Solution:
# Use streaming responses
from fastapi.responses import StreamingResponse

@router.get("/large-file")
async def download_file():
    async def iterfile():
        with open("large_file.csv", "rb") as f:
            while chunk := f.read(8192):
                yield chunk

    return StreamingResponse(
        iterfile(),
        media_type="text/csv"
    )

Resolution Checklist

  • Identified memory leak source
  • Fixed connection leaks
  • Implemented proper cleanup
  • Added cache size limits
  • Memory usage stable over time
  • No OOM errors in last 24h
  • Service uptime > 7 days without restart

High Error Rate

Symptoms

  • Error rate suddenly increased
  • Multiple different error types
  • May affect multiple services

Investigation

Step 1: Identify when it started
rate(
  {service=~"repti-.*"}
  | json
  | level="ERROR" [5m]
)
Step 2: Find most common errors
topk(10,
  sum by (error_type) (
    count_over_time(
      {service=~"repti-.*"}
      | json
      | level="ERROR" [1h]
    )
  )
)
Step 3: Check for recent changes
{service=~"repti-.*"}
| json
| message =~ "(?i)deployment|version|started"
| line_format "{{.timestamp}} [{{.service}}] {{.version}}"

Common Causes

Cause 1: Bad Deployment

Symptoms: Errors started right after deployment Solution:
# Rollback to previous version
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-core \
  --task-definition repti-core:123  # Previous revision

# Or use blue/green deployment
# Switch traffic back to old version

Cause 2: Dependency Failure

Symptoms: Multiple services affected Investigation:
# Check all services at same time
{service=~"repti-.*"}
| json
| level="ERROR"
Solution:
  • Identify failing dependency (database, Redis, external API)
  • Fix or restart dependency
  • Implement circuit breaker for external dependencies

Cause 3: Traffic Spike

Symptoms: Errors correlate with high traffic Check:
# Request rate
rate(http_requests_total[5m])
Solution:
  • Scale up service instances
  • Implement rate limiting
  • Add caching layer
  • Enable auto-scaling

Cause 4: Database Migration Issue

Symptoms: Errors related to schema/columns Logs:
{service=~"repti-.*"}
| json
| error_type =~ ".*Column.*|.*Table.*|.*Schema.*"
Solution:
# Rollback migration
docker exec -it <container> alembic downgrade -1

# Or add missing column
docker exec -it <container> alembic upgrade head

Resolution Checklist

  • Identified root cause
  • Error rate decreased
  • No ongoing issues
  • Deployment rolled back if needed
  • Dependencies healthy
  • Capacity sufficient for traffic
  • Post-mortem documented

Data Inconsistency

Symptoms

  • Data doesn’t match between services
  • Users seeing outdated data
  • Cache showing different values than database

Investigation

Step 1: Identify inconsistency
  • Which data is inconsistent?
  • Between which systems? (DB, cache, search index)
  • When did it start?
Step 2: Check for failed writes
{service=~"repti-.*"}
| json
| method =~ "POST|PUT|PATCH|DELETE"
| status_code >= 500
Step 3: Check cache invalidation
{service=~"repti-.*"}
| json
| message =~ "(?i)cache invalidate|cache clear"

Common Causes & Solutions

Cause 1: Stale Cache

Solution:
# Clear specific cache key
await redis.delete(f"animal:{animal_id}")

# Or clear all cache
await redis.flushdb()

# Implement cache invalidation on update
async def update_animal(animal_id: str, data: dict):
    animal = await repo.update(animal_id, data)
    await redis.delete(f"animal:{animal_id}")  # Clear cache
    return animal

Cause 2: Transaction Rollback

Symptoms: Write appeared to succeed but data not in DB Logs:
{service=~"repti-.*"}
| json
| message =~ "(?i)rollback|transaction failed"
Solution:
# Ensure proper transaction handling
async with session.begin():
    await session.execute(stmt)
    # Automatically commits on success
    # Automatically rolls back on exception

Cause 3: Event/Message Loss

Symptoms: Event-driven updates didn’t propagate Solution:
  • Check message queue (SQS, SNS)
  • Verify event consumers are running
  • Implement retry mechanism
  • Add dead letter queue

Resolution Checklist

  • Identified inconsistent data
  • Root cause determined
  • Data manually reconciled if needed
  • Cache cleared/invalidated
  • Write operations succeed
  • Consistency checks pass
  • Monitoring added to detect future issues

Background Job Failures

Symptoms

  • Scheduled jobs not running
  • Async tasks failing
  • Workers crashing

Investigation

Step 1: Check job status
{service=~"repti-.*"}
| json
| message =~ "(?i)job|task|worker"
| level="ERROR"
Step 2: Find failed jobs
{service=~"repti-.*"}
| json
| message =~ "(?i)job failed|task failed"
| line_format "{{.timestamp}} - {{.details.job_name}}: {{.message}}"

Common Causes & Solutions

Cause 1: Worker Died

Check:
# Check worker processes
ps aux | grep celery
Solution:
# Restart workers
systemctl restart celery-worker

Cause 2: Job Timeout

Logs:
{service=~"repti-.*"}
| json
| message =~ "(?i)timeout|timed out"
Solution:
# Increase timeout
@app.task(time_limit=600)  # 10 minutes
def long_running_task():
    pass

Cause 3: Missing Dependencies

Solution:
# Check if resource exists before processing
async def process_animal(animal_id: str):
    animal = await repo.get(animal_id)
    if not animal:
        logger.error(f"Animal {animal_id} not found")
        return  # Don't retry
    # Process...

Resolution Checklist

  • Workers running
  • Jobs executing successfully
  • No timeout issues
  • Dependencies available
  • Error handling implemented
  • Retry logic working

Cache Issues

Symptoms

  • Slow performance after cache miss
  • Users seeing stale data
  • Cache hit rate low

Investigation

Step 1: Check cache metrics
# Cache hit rate
rate(cache_hits_total[5m]) / rate(cache_requests_total[5m])
Step 2: Check for cache errors
{service=~"repti-.*"}
| json
| message =~ "(?i)cache|redis"
| level="ERROR"

Common Causes & Solutions

Cause 1: Cache Eviction

Solution:
# Increase Redis memory
# In redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru

Cause 2: Cache Stampede

Symptoms: Many requests for same expired key Solution:
# Use lock to prevent stampede
async def get_from_cache_or_db(key: str):
    value = await redis.get(key)
    if value:
        return value

    # Acquire lock
    lock_key = f"lock:{key}"
    if await redis.set(lock_key, "1", nx=True, ex=10):
        # We got the lock, fetch from DB
        value = await fetch_from_db(key)
        await redis.setex(key, 3600, value)
        await redis.delete(lock_key)
        return value
    else:
        # Someone else is fetching, wait and retry
        await asyncio.sleep(0.1)
        return await get_from_cache_or_db(key)

Resolution Checklist

  • Cache hit rate > 80%
  • No cache errors
  • TTL configured appropriately
  • Cache warming implemented
  • Stampede protection in place

Deployment Rollback

When to Rollback

  • High error rate after deployment
  • Critical functionality broken
  • Data corruption risk
  • Cannot quickly fix forward

Rollback Procedure

Step 1: Identify previous version
# List task definitions
aws ecs list-task-definitions \
  --family-prefix repti-core \
  --sort DESC \
  --max-items 5
Step 2: Rollback ECS service
# Update to previous task definition
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-core \
  --task-definition repti-core:122  # Previous revision
Step 3: Wait for deployment
# Monitor deployment
aws ecs wait services-stable \
  --cluster dev-reptidex-cluster \
  --services dev-reptidex-core
Step 4: Verify health
# Check health endpoint
curl https://dev.reptidex.com/api/v1/health

# Check error rate
# (Use Grafana dashboard)

Database Rollback

If database migration was applied:
# Rollback migration
docker exec -it <container> alembic downgrade -1

# Verify schema
docker exec -it <container> alembic current

Post-Rollback

  • Error rate normal
  • All services healthy
  • Users can access features
  • Database consistent
  • Incident documented
  • Fix planned for next deployment

Getting Help

If you can’t resolve the issue:
  1. Check this playbook for similar scenarios
  2. Search Slack #incidents for recent similar issues
  3. Escalate to on-call engineer if critical
  4. Post in #engineering with:
    • Service affected
    • Symptoms
    • What you’ve tried
    • Relevant logs/dashboards

Next Steps