Debugging Playbook

This playbook provides step-by-step solutions for common issues in ReptiDex. Each scenario includes symptoms, investigation steps, and resolution procedures.

Service Down / Not Responding
Database Connection Issues
Slow API Performance
Authentication Failures
Memory Leaks / OOM Errors
High Error Rate
Data Inconsistency
Background Job Failures
Cache Issues
Deployment Rollback

Service Down / Not Responding

Symptoms

Health check endpoint returning 503 or timing out
All requests to service failing
Service not listed in ECS task list

Investigation

Step 1: Check if service is running

# Check ECS service status
aws ecs describe-services \
  --cluster dev-reptidex-cluster \
  --services dev-reptidex-core \
  --query 'services[0].{Status:status,DesiredCount:desiredCount,RunningCount:runningCount}'

# Check running tasks
aws ecs list-tasks \
  --cluster dev-reptidex-cluster \
  --service-name dev-reptidex-core \
  --desired-status RUNNING

Step 2: Check service logs

{service="repti-core"}
| json
| message =~ "(?i)starting|started|stopping|stopped|fatal|panic"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"

Step 3: Check for errors at startup

{service="repti-core"}
| json
| level =~ "ERROR|CRITICAL"
| line_format "{{.timestamp}} {{.error_type}}: {{.message}}\n{{.stack_trace}}"

Common Causes

Cause 1: Configuration Error

Symptoms: Service crashes immediately on startup Solution:

# Check environment variables
aws ecs describe-task-definition \
  --task-definition repti-core \
  --query 'taskDefinition.containerDefinitions[0].environment'

# Fix configuration in .env or ECS task definition
# Then redeploy

Cause 2: Database Migration Failed

Symptoms: Service starts but crashes when accessing database Logs to check:

{service="repti-core"}
| json
| message =~ "(?i)migration|alembic|database"
| level="ERROR"

Solution:

# Run migrations manually
docker exec -it <container_id> alembic upgrade head

# Or rollback if migration is broken
docker exec -it <container_id> alembic downgrade -1

Cause 3: Port Already in Use

Symptoms: Service fails to start with “address already in use” Logs to check:

{service="repti-core"}
| json
| message =~ "(?i)address already in use|port.*in use"

Solution:

# Find process using the port
lsof -i :8000

# Kill the process
kill -9 <PID>

# Restart service

Cause 4: Resource Limits

Symptoms: Service OOM killed or CPU throttled Logs to check:

{service="repti-core"}
| json
| message =~ "(?i)oom|out of memory|killed"

Solution:

# Increase memory/CPU in ECS task definition
# Update task definition memory from 512 to 1024
aws ecs register-task-definition \
  --family repti-core \
  --memory 1024

# Force new deployment
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-core \
  --force-new-deployment

Resolution Checklist

Database Connection Issues

Symptoms

DatabaseConnectionError
OperationalError: connection pool exhausted
Requests timing out waiting for database
Slow query performance

Investigation

Step 1: Check connection pool status

{service=~"repti-.*"}
| json
| error_type =~ ".*Connection.*"
| details.pool_size != ""
| line_format "Pool: {{.details.pool_size}}, Active: {{.details.active_connections}}, Waiting: {{.details.waiting_connections}}"

Step 2: Check for slow queries

{service=~"repti-.*"}
| json
| message =~ "(?i)query|select|insert|update"
| duration_ms > 1000
| line_format "{{.duration_ms}}ms - {{.message}}"

Step 3: Check database metrics Go to Prometheus and check:

db_connections_active (should be < pool_size)
db_connections_waiting (should be 0)
db_query_duration_seconds (p95 should be < 1s)

Common Causes

Cause 1: Connection Pool Exhausted

Symptoms: All connections in use, requests waiting Logs:

{service=~"repti-.*"}
| json
| error_type="DatabaseConnectionError"
| details.active_connections == details.pool_size

Solutions: Option A: Increase pool size

# In app/core/database.py
engine = create_async_engine(
    settings.database_url,
    pool_size=20,  # Increase from 10
    max_overflow=10,  # Increase from 5
)

Option B: Fix connection leaks

# Ensure all connections are properly closed
# Use context managers
async with get_db() as session:
    # Your code here
    pass  # Connection automatically closed

Option C: Reduce connection hold time

Optimize slow queries
Move long-running operations out of transaction
Use async operations

Cause 2: Database Performance Degradation

Symptoms: Queries that were fast are now slow Investigation:

# Find slow queries
{service=~"repti-.*"}
| json
| message =~ "(?i)query"
| duration_ms > 1000
| line_format "{{.duration_ms}}ms - {{.message}}"

Solutions: Check for missing indexes:

-- Find queries without indexes
SELECT * FROM pg_stat_statements
WHERE calls > 1000
  AND mean_exec_time > 1000
ORDER BY mean_exec_time DESC;

Add indexes:

# In alembic migration
def upgrade():
    op.create_index(
        'ix_animals_species_id',
        'animals',
        ['species_id']
    )

Cause 3: Database Server Issues

Symptoms: All services affected simultaneously Check:

# Check database CPU/memory
aws rds describe-db-instances \
  --db-instance-identifier reptidex-dev \
  --query 'DBInstances[0].{CPU:CPUUtilization,Memory:FreeableMemory}'

Solutions:

Scale up RDS instance
Add read replicas
Optimize top queries
Enable query caching

Resolution Checklist

Connection pool not exhausted
No connection leaks detected
Query performance optimized
Indexes added for slow queries
Database CPU/memory healthy
Connection timeout configured appropriately
Error rate returned to normal

Slow API Performance

Symptoms

API requests taking >2 seconds
User complaints about slow page loads
Timeouts on frontend

Investigation

Step 1: Identify slow endpoints

{service=~"repti-.*"}
| json
| duration_ms > 2000
| line_format "{{.duration_ms}}ms - {{.method}} {{.endpoint}}"

Step 2: Find slowest operations

topk(10,
  avg by (endpoint) (
    avg_over_time(
      {service=~"repti-.*"}
      | json
      | unwrap duration_ms [1h]
    )
  )
)

Step 3: Check database query time

{service=~"repti-.*"}
| json
| message =~ "(?i)query|database"
| duration_ms > 500

Common Causes

Cause 1: N+1 Query Problem

Symptoms: Many database queries for a single API request Logs:

{service="repti-core"}
| json
| request_id="<REQUEST_ID>"
| message =~ "(?i)query|select"
| line_format "{{.timestamp}} - {{.message}}"

Solution:

# Before (N+1 problem)
animals = session.query(Animal).all()
for animal in animals:
    print(animal.species.name)  # Separate query for each!

# After (eager loading)
from sqlalchemy.orm import joinedload

animals = session.query(Animal)\
    .options(joinedload(Animal.species))\
    .all()
for animal in animals:
    print(animal.species.name)  # No extra queries

Cause 2: Missing Cache

Symptoms: Repeated expensive calculations Solution:

from functools import lru_cache

@lru_cache(maxsize=100)
def calculate_lineage(animal_id: str) -> dict:
    # Expensive operation cached
    pass

# Or use Redis cache
async def get_animal_lineage(animal_id: str):
    cache_key = f"lineage:{animal_id}"
    cached = await redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # Calculate and cache
    result = await calculate_lineage(animal_id)
    await redis.setex(cache_key, 3600, json.dumps(result))
    return result

Cause 3: External API Timeout

Symptoms: Slow requests to external services Logs:

{service=~"repti-.*"}
| json
| message =~ "(?i)external|api call|http request"
| duration_ms > 1000

Solution:

# Add timeout to external calls
import httpx

async with httpx.AsyncClient(timeout=5.0) as client:
    response = await client.get("https://external-api.com/data")

# Use circuit breaker
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_api():
    # Will stop calling if too many failures
    pass

Cause 4: Large Response Payload

Symptoms: Slow serialization, large network transfer Solution:

# Add pagination
@router.get("/animals")
async def list_animals(
    skip: int = 0,
    limit: int = 20  # Limit results
):
    animals = await repo.get_multi(skip=skip, limit=limit)
    return animals

# Use field selection
@router.get("/animals")
async def list_animals(fields: str = None):
    if fields:
        # Return only requested fields
        selected = fields.split(",")
        return [
            {k: v for k, v in animal.dict().items() if k in selected}
            for animal in animals
        ]

Resolution Checklist

Identified slow endpoint(s)
Optimized database queries
Added appropriate indexes
Implemented caching where needed
Fixed N+1 query problems
Added pagination for large result sets
Set timeouts on external calls
Response time < 500ms for p95

Authentication Failures

Symptoms

Users unable to log in
401 Unauthorized errors
“Invalid token” errors

Investigation

Step 1: Check failure rate

rate(
  {service="repti-auth"}
  | json
  | message =~ "(?i)authentication failed" [5m]
)

Step 2: Identify affected users

{service="repti-auth"}
| json
| message =~ "(?i)authentication failed"
| line_format "{{.timestamp}} - User: {{.user_id}} - Reason: {{.details.reason}}"

Step 3: Check for suspicious patterns

# Check for brute force
sum by (source_ip) (
  count_over_time(
    {service="repti-auth"}
    | json
    | message =~ "(?i)authentication failed" [5m]
  )
) > 10

Common Causes

Cause 1: Token Expired

Symptoms: Users logged in previously, now getting 401 Solution:

# Check token expiry settings
# In app/core/security.py
ACCESS_TOKEN_EXPIRE_MINUTES = 60  # 1 hour

# Or implement refresh tokens
def create_refresh_token(user_id: str):
    expires = datetime.utcnow() + timedelta(days=30)
    return create_token({"sub": user_id}, expires)

Cause 2: Auth Service Down

Symptoms: All auth requests failing Check:

{service="repti-auth"}
| json
| level =~ "ERROR|CRITICAL"

Solution:

Check auth service health
Restart auth service if needed
Verify database connectivity

Cause 3: Invalid Credentials

Symptoms: User entering wrong password Logs:

{service="repti-auth"}
| json
| message =~ "(?i)invalid credentials|wrong password"
| user_id != ""

Solution:

User needs to reset password
Implement account lockout after N failed attempts
Add CAPTCHA after 3 failed attempts

Cause 4: Token Signature Mismatch

Symptoms: Valid tokens being rejected Logs:

{service="repti-auth"}
| json
| message =~ "(?i)signature|invalid token"

Solution:

# Verify JWT secret key is consistent
# Check SECRET_KEY environment variable
# Ensure it hasn't changed after tokens were issued

# If secret changed, invalidate all tokens:
# 1. Update secret
# 2. Force users to re-authenticate
# 3. Clear Redis token cache

Resolution Checklist

Auth service is healthy
Token expiry is appropriate
JWT secret key is correct
No brute force attacks detected
Users can successfully authenticate
Token refresh working (if implemented)

Memory Leaks / OOM Errors

Symptoms

Service killed with OOM error
Memory usage steadily increasing
Slow performance over time
Service crashes after running for hours/days

Investigation

Step 1: Check memory warnings

{service="repti-core"}
| json
| message =~ "(?i)memory|oom|out of memory"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"

Step 2: Check service restarts

{service="repti-core"}
| json
| message =~ "(?i)starting|started"
| line_format "{{.timestamp}} - {{.message}}"

Step 3: Monitor memory trend Check Prometheus:

# Memory usage over time
container_memory_usage_bytes{service="repti-core"}

# Memory usage percentage
(container_memory_usage_bytes / container_memory_limit_bytes) * 100

Common Causes

Cause 1: Unclosed Database Connections

Symptoms: Memory increases with request count Solution:

# Always use context managers
async def get_animals():
    async with get_db() as session:
        return await session.query(Animal).all()
    # Connection automatically closed

# Or use dependency injection
@router.get("/animals")
async def list_animals(db: AsyncSession = Depends(get_db)):
    return await db.query(Animal).all()
    # FastAPI handles cleanup

Cause 2: Large Object Caching

Symptoms: Memory grows with cache size Solution:

# Use LRU cache with size limit
from functools import lru_cache

@lru_cache(maxsize=100)  # Limit cache size
def expensive_calculation(param):
    pass

# Or use Redis instead of in-memory cache
# Move large objects to external cache

Cause 3: Memory Leak in Library

Symptoms: Memory increases even without load Investigation:

# Profile memory usage
pip install memory-profiler

# Add to code
from memory_profiler import profile

@profile
def my_function():
    pass

# Run and check output

Solution:

Update library to latest version
Find alternative library
Report bug to library maintainers

Cause 4: Large Response Buffering

Symptoms: Memory spike when serving large files/responses Solution:

# Use streaming responses
from fastapi.responses import StreamingResponse

@router.get("/large-file")
async def download_file():
    async def iterfile():
        with open("large_file.csv", "rb") as f:
            while chunk := f.read(8192):
                yield chunk

    return StreamingResponse(
        iterfile(),
        media_type="text/csv"
    )

Resolution Checklist

High Error Rate

Symptoms

Error rate suddenly increased
Multiple different error types
May affect multiple services

Investigation

Step 1: Identify when it started

rate(
  {service=~"repti-.*"}
  | json
  | level="ERROR" [5m]
)

Step 2: Find most common errors

topk(10,
  sum by (error_type) (
    count_over_time(
      {service=~"repti-.*"}
      | json
      | level="ERROR" [1h]
    )
  )
)

Step 3: Check for recent changes

{service=~"repti-.*"}
| json
| message =~ "(?i)deployment|version|started"
| line_format "{{.timestamp}} [{{.service}}] {{.version}}"

Common Causes

Cause 1: Bad Deployment

Symptoms: Errors started right after deployment Solution:

# Rollback to previous version
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-core \
  --task-definition repti-core:123  # Previous revision

# Or use blue/green deployment
# Switch traffic back to old version

Cause 2: Dependency Failure

Symptoms: Multiple services affected Investigation:

# Check all services at same time
{service=~"repti-.*"}
| json
| level="ERROR"

Solution:

Identify failing dependency (database, Redis, external API)
Fix or restart dependency
Implement circuit breaker for external dependencies

Cause 3: Traffic Spike

Symptoms: Errors correlate with high traffic Check:

# Request rate
rate(http_requests_total[5m])

Solution:

Scale up service instances
Implement rate limiting
Add caching layer
Enable auto-scaling

Cause 4: Database Migration Issue

Symptoms: Errors related to schema/columns Logs:

{service=~"repti-.*"}
| json
| error_type =~ ".*Column.*|.*Table.*|.*Schema.*"

Solution:

# Rollback migration
docker exec -it <container> alembic downgrade -1

# Or add missing column
docker exec -it <container> alembic upgrade head

Resolution Checklist

Data Inconsistency

Symptoms

Data doesn’t match between services
Users seeing outdated data
Cache showing different values than database

Investigation

Step 1: Identify inconsistency

Which data is inconsistent?
Between which systems? (DB, cache, search index)
When did it start?

Step 2: Check for failed writes

{service=~"repti-.*"}
| json
| method =~ "POST|PUT|PATCH|DELETE"
| status_code >= 500

Step 3: Check cache invalidation

{service=~"repti-.*"}
| json
| message =~ "(?i)cache invalidate|cache clear"

Common Causes & Solutions

Cause 1: Stale Cache

Solution:

# Clear specific cache key
await redis.delete(f"animal:{animal_id}")

# Or clear all cache
await redis.flushdb()

# Implement cache invalidation on update
async def update_animal(animal_id: str, data: dict):
    animal = await repo.update(animal_id, data)
    await redis.delete(f"animal:{animal_id}")  # Clear cache
    return animal

Cause 2: Transaction Rollback

Symptoms: Write appeared to succeed but data not in DB Logs:

{service=~"repti-.*"}
| json
| message =~ "(?i)rollback|transaction failed"

Solution:

# Ensure proper transaction handling
async with session.begin():
    await session.execute(stmt)
    # Automatically commits on success
    # Automatically rolls back on exception

Cause 3: Event/Message Loss

Symptoms: Event-driven updates didn’t propagate Solution:

Check message queue (SQS, SNS)
Verify event consumers are running
Implement retry mechanism
Add dead letter queue

Resolution Checklist

Identified inconsistent data
Root cause determined
Data manually reconciled if needed
Cache cleared/invalidated
Write operations succeed
Consistency checks pass
Monitoring added to detect future issues

Background Job Failures

Symptoms

Scheduled jobs not running
Async tasks failing
Workers crashing

Investigation

Step 1: Check job status

{service=~"repti-.*"}
| json
| message =~ "(?i)job|task|worker"
| level="ERROR"

Step 2: Find failed jobs

{service=~"repti-.*"}
| json
| message =~ "(?i)job failed|task failed"
| line_format "{{.timestamp}} - {{.details.job_name}}: {{.message}}"

Common Causes & Solutions

Cause 1: Worker Died

Check:

# Check worker processes
ps aux | grep celery

Solution:

# Restart workers
systemctl restart celery-worker

Cause 2: Job Timeout

Logs:

{service=~"repti-.*"}
| json
| message =~ "(?i)timeout|timed out"

Solution:

# Increase timeout
@app.task(time_limit=600)  # 10 minutes
def long_running_task():
    pass

Cause 3: Missing Dependencies

Solution:

# Check if resource exists before processing
async def process_animal(animal_id: str):
    animal = await repo.get(animal_id)
    if not animal:
        logger.error(f"Animal {animal_id} not found")
        return  # Don't retry
    # Process...

Resolution Checklist

Cache Issues

Symptoms

Slow performance after cache miss
Users seeing stale data
Cache hit rate low

Investigation

Step 1: Check cache metrics

# Cache hit rate
rate(cache_hits_total[5m]) / rate(cache_requests_total[5m])

Step 2: Check for cache errors

{service=~"repti-.*"}
| json
| message =~ "(?i)cache|redis"
| level="ERROR"

Common Causes & Solutions

Cause 1: Cache Eviction

Solution:

# Increase Redis memory
# In redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru

Cause 2: Cache Stampede

Symptoms: Many requests for same expired key Solution:

# Use lock to prevent stampede
async def get_from_cache_or_db(key: str):
    value = await redis.get(key)
    if value:
        return value

    # Acquire lock
    lock_key = f"lock:{key}"
    if await redis.set(lock_key, "1", nx=True, ex=10):
        # We got the lock, fetch from DB
        value = await fetch_from_db(key)
        await redis.setex(key, 3600, value)
        await redis.delete(lock_key)
        return value
    else:
        # Someone else is fetching, wait and retry
        await asyncio.sleep(0.1)
        return await get_from_cache_or_db(key)

Resolution Checklist

Deployment Rollback

When to Rollback

High error rate after deployment
Critical functionality broken
Data corruption risk
Cannot quickly fix forward

Rollback Procedure

Step 1: Identify previous version

# List task definitions
aws ecs list-task-definitions \
  --family-prefix repti-core \
  --sort DESC \
  --max-items 5

Step 2: Rollback ECS service

# Update to previous task definition
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-core \
  --task-definition repti-core:122  # Previous revision

Step 3: Wait for deployment

# Monitor deployment
aws ecs wait services-stable \
  --cluster dev-reptidex-cluster \
  --services dev-reptidex-core

Step 4: Verify health

# Check health endpoint
curl https://dev.reptidex.com/api/v1/health

# Check error rate
# (Use Grafana dashboard)

Database Rollback

If database migration was applied:

# Rollback migration
docker exec -it <container> alembic downgrade -1

# Verify schema
docker exec -it <container> alembic current

Post-Rollback

Getting Help

If you can’t resolve the issue:

Check this playbook for similar scenarios
Search Slack #incidents for recent similar issues
Escalate to on-call engineer if critical
Post in #engineering with:
- Service affected
- Symptoms
- What you’ve tried
- Relevant logs/dashboards

Next Steps

Error Investigation Workflow - Systematic investigation process
Saved LogQL Queries - Query templates
Log Correlation - Correlating logs with metrics

Monitoring & Observability

Deployment

Incident Management

Security

​Debugging Playbook

​Table of Contents

​Service Down / Not Responding

​Symptoms

​Investigation

​Common Causes

​Cause 1: Configuration Error

​Cause 2: Database Migration Failed

​Cause 3: Port Already in Use

​Cause 4: Resource Limits

​Resolution Checklist

​Database Connection Issues

​Symptoms

​Investigation

​Common Causes

​Cause 1: Connection Pool Exhausted

​Cause 2: Database Performance Degradation

​Cause 3: Database Server Issues

​Resolution Checklist

​Slow API Performance

​Symptoms

​Investigation

​Common Causes

​Cause 1: N+1 Query Problem

​Cause 2: Missing Cache

​Cause 3: External API Timeout

​Cause 4: Large Response Payload

​Resolution Checklist

​Authentication Failures

​Symptoms

​Investigation

​Common Causes

​Cause 1: Token Expired

​Cause 2: Auth Service Down

​Cause 3: Invalid Credentials

​Cause 4: Token Signature Mismatch

​Resolution Checklist

​Memory Leaks / OOM Errors

​Symptoms

​Investigation

​Common Causes

​Cause 1: Unclosed Database Connections

​Cause 2: Large Object Caching

​Cause 3: Memory Leak in Library

​Cause 4: Large Response Buffering

​Resolution Checklist

​High Error Rate

​Symptoms

​Investigation

​Common Causes

​Cause 1: Bad Deployment

​Cause 2: Dependency Failure

​Cause 3: Traffic Spike

​Cause 4: Database Migration Issue

​Resolution Checklist

​Data Inconsistency

​Symptoms

​Investigation

​Common Causes & Solutions

​Cause 1: Stale Cache

​Cause 2: Transaction Rollback

​Cause 3: Event/Message Loss

​Resolution Checklist

​Background Job Failures

​Symptoms

​Investigation

​Common Causes & Solutions

​Cause 1: Worker Died

​Cause 2: Job Timeout

​Cause 3: Missing Dependencies

​Resolution Checklist

​Cache Issues

​Symptoms

​Investigation

​Common Causes & Solutions

​Cause 1: Cache Eviction

Debugging Playbook

Table of Contents

Service Down / Not Responding

Symptoms

Investigation

Common Causes

Cause 1: Configuration Error

Cause 2: Database Migration Failed

Cause 3: Port Already in Use

Cause 4: Resource Limits

Resolution Checklist

Database Connection Issues

Symptoms

Investigation

Common Causes

Cause 1: Connection Pool Exhausted

Cause 2: Database Performance Degradation

Cause 3: Database Server Issues

Resolution Checklist

Slow API Performance

Symptoms

Investigation

Common Causes

Cause 1: N+1 Query Problem

Cause 2: Missing Cache

Cause 3: External API Timeout

Cause 4: Large Response Payload

Resolution Checklist

Authentication Failures

Symptoms

Investigation

Common Causes

Cause 1: Token Expired

Cause 2: Auth Service Down

Cause 3: Invalid Credentials

Cause 4: Token Signature Mismatch

Resolution Checklist

Memory Leaks / OOM Errors

Symptoms

Investigation

Common Causes

Cause 1: Unclosed Database Connections

Cause 2: Large Object Caching

Cause 3: Memory Leak in Library

Cause 4: Large Response Buffering

Resolution Checklist

High Error Rate

Symptoms

Investigation

Common Causes

Cause 1: Bad Deployment

Cause 2: Dependency Failure

Cause 3: Traffic Spike

Cause 4: Database Migration Issue

Resolution Checklist

Data Inconsistency

Symptoms

Investigation

Common Causes & Solutions

Cause 1: Stale Cache

Cause 2: Transaction Rollback

Cause 3: Event/Message Loss

Resolution Checklist

Background Job Failures

Symptoms

Investigation

Common Causes & Solutions

Cause 1: Worker Died

Cause 2: Job Timeout

Cause 3: Missing Dependencies

Resolution Checklist

Cache Issues

Symptoms

Investigation

Common Causes & Solutions

Cause 1: Cache Eviction

Cause 2: Cache Stampede

Resolution Checklist

Deployment Rollback

When to Rollback