Error Investigation Workflow
This guide provides a systematic approach to investigating errors in ReptiDex using Loki logs and Grafana dashboards.Quick Start
When you discover an error (via alert, dashboard, or user report), follow this workflow:Step 1: Identify the Error
From Alert
If the error came from an alert:- Note the alert name and severity
- Check alert labels for service, environment, error_type
- Note the timestamp when alert fired
From Dashboard
If you see an error spike in a dashboard:- Note the service name
- Identify the time range when errors spiked
- Check the error type if available
- Look for correlated metric changes (CPU, memory, request rate)
From User Report
If a user reported an issue:- Gather user ID or session ID
- Get approximate timestamp of the issue
- Identify which feature/endpoint was affected
- Note any error messages the user saw
Step 2: Gather Context
Query Initial Error Logs
Start with a broad query to find the error:Narrow Down with Filters
Add more specific filters:Get Error Details
Examine a specific error to understand:- Error type: What kind of error occurred?
- Error message: What does it say?
- Stack trace: Where in the code did it fail?
- Request context: What was the user trying to do?
Example Error Log
- Pool is exhausted (10/10 active connections)
- 5 connections waiting
- Failed after 3 retries
View Surrounding Logs
Use context view to see what happened before/after:- Click on the error log in Grafana
- Select “Show context”
- Review logs 5 minutes before and after
Trace the Request
Follow the request across services usingrequest_id:
- Gateway received request
- Auth service validated token
- Core service processed request
- Database connection failed
Step 3: Find Root Cause
Common Error Patterns
Database Connection Exhaustion
Symptoms:- Error type:
DatabaseConnectionError - Pool size at maximum
- Waiting connections queue growing
- Connection leak (not closing connections)
- Slow queries holding connections
- Traffic spike overwhelming pool
- Database performance degradation
- Check slow query logs
- Review database metrics (CPU, connections, query time)
- Identify if specific endpoint is causing issue
- Check for recent code changes
API Timeout
Symptoms:- Error type:
TimeoutError,ReadTimeout - Duration exceeds threshold
- May affect specific downstream services
- Downstream service slow/unavailable
- Network issues
- Database query performance
- CPU/memory exhaustion
- Check if downstream service is healthy
- Review performance metrics for the service
- Check for slow database queries
- Look at network metrics
Authentication Failure
Symptoms:- Status code: 401, 403
- Message contains “authentication failed”, “invalid token”
- May be isolated to one user or widespread
- Expired tokens
- Token service unavailable
- Invalid credentials
- Brute force attack (if widespread)
- Check if isolated to one user
- Verify token service health
- Check for suspicious IP patterns
- Review recent auth changes
Memory/Resource Issues
Symptoms:- Error type:
MemoryError,OutOfMemory - Warnings about high memory usage before error
- Service may restart after error
- Memory leak
- Large data processing
- Inefficient caching
- Too many concurrent requests
- Check service memory metrics in Prometheus
- Look for memory trend over time
- Identify if specific endpoint causes issue
- Review recent code changes with large data structures
Correlation Analysis
Check if error correlates with other events:Traffic Spike
- Insufficient scaling
- Database connection pool too small
- Rate limiting not effective
Deployment
Check if error started after a deployment:- Note error start time
- Check deployment logs for that time
- Review recent code changes
Dependency Failure
Check if downstream service failed:- Database
- Redis
- Auth service
- External API
Step 4: Assess Impact
User Impact
How many users affected?Business Impact
- Critical: Core user flows broken (auth, payment, data loss)
- High: Major features unavailable (create, update operations)
- Medium: Minor features degraded (slow performance, non-critical errors)
- Low: Background jobs, analytics, non-user-facing
Error Frequency
Is it ongoing?- Last 5 minutes vs last hour
- Current hour vs previous hour
- Today vs yesterday
Data Integrity
Were any writes attempted?- Transaction failed after partial write
- Async job failed midway
- Cache/database out of sync
Step 5: Document Findings
Create Incident Report
Document in your incident management system: Title:[SERVICE] Error Type - Brief Description
Example: [repti-core] DatabaseConnectionError - Pool Exhaustion on /api/v1/animals
Severity: Critical / High / Medium / Low
Timeline:
- First detected:
2025-10-13 14:20 UTC - Last occurrence:
2025-10-13 14:45 UTC - Duration:
25 minutes
- Users affected:
127 unique users - Requests failed:
1,543 requests - Features affected:
Animal creation endpoint
- Grafana logs:
[Link to query] - Error dashboard:
[Link to dashboard] - Prometheus metrics:
[Link to metrics] - Incident ticket:
[Link to Jira/Linear]
Update Runbook
If this is a recurring issue, add to the debugging playbook:- Error signature
- Common causes
- Investigation steps
- Resolution steps
- Prevention measures
Investigation Checklist
Use this checklist for every investigation:- Identified error type and message
- Determined time range of issue
- Found affected service(s)
- Retrieved stack trace
- Examined surrounding logs (context)
- Traced request across services (request_id)
- Checked for correlation with deployments
- Checked for correlation with traffic spikes
- Reviewed dependent service health
- Checked database/cache metrics
- Identified affected users
- Identified affected endpoints
- Determined business impact
- Assessed data integrity risk
- Documented root cause
- Documented resolution
- Documented prevention measures
- Updated runbook if needed
- Created follow-up tasks
Tips for Faster Investigation
- Use saved queries: Start with saved queries from the Saved Queries doc
-
Filter early: Add label filters before text search
-
Limit time range: Start with narrow range (5m) then expand
-
Use dashboards: Jump to pre-built dashboards for common scenarios
- Error tracking dashboard
- Performance analysis dashboard
- Service health dashboard
-
Correlate with metrics: Check Prometheus for related metric changes
- CPU/memory usage
- Request rate
- Error rate
- Database connections
- Copy request_id early: First thing - grab the request_id to trace the full flow
- Check #incidents Slack: See if others are investigating same issue
Common Mistakes to Avoid
- Too broad query: Don’t query all services for all time
- Ignoring context: Always check logs before/after the error
- Assuming single cause: Complex issues often have multiple contributing factors
- Skipping impact assessment: Always quantify user/business impact
- Not documenting: Future you will thank present you for good docs
- Fixing symptoms: Find and fix root cause, not just symptoms
- Solo investigation: Involve team members for complex issues
Next Steps
- Debugging Playbook - Specific scenarios and solutions
- Saved LogQL Queries - Query templates
- Log Correlation - Correlating logs with metrics

