Skip to main content

Error Investigation Workflow

This guide provides a systematic approach to investigating errors in ReptiDex using Loki logs and Grafana dashboards.

Quick Start

When you discover an error (via alert, dashboard, or user report), follow this workflow:
1. Identify the error

2. Gather context

3. Find root cause

4. Assess impact

5. Document findings

Step 1: Identify the Error

From Alert

If the error came from an alert:
  1. Note the alert name and severity
  2. Check alert labels for service, environment, error_type
  3. Note the timestamp when alert fired

From Dashboard

If you see an error spike in a dashboard:
  1. Note the service name
  2. Identify the time range when errors spiked
  3. Check the error type if available
  4. Look for correlated metric changes (CPU, memory, request rate)

From User Report

If a user reported an issue:
  1. Gather user ID or session ID
  2. Get approximate timestamp of the issue
  3. Identify which feature/endpoint was affected
  4. Note any error messages the user saw

Step 2: Gather Context

Query Initial Error Logs

Start with a broad query to find the error:
# By time range and service
{service="repti-core"}
| json
| level="ERROR"

Narrow Down with Filters

Add more specific filters:
{service="repti-core"}
| json
| level="ERROR"
| endpoint="/api/v1/animals"
| user_id="usr_abc123"

Get Error Details

Examine a specific error to understand:
  • Error type: What kind of error occurred?
  • Error message: What does it say?
  • Stack trace: Where in the code did it fail?
  • Request context: What was the user trying to do?

Example Error Log

{
  "timestamp": "2025-10-13T14:23:45.123Z",
  "level": "ERROR",
  "service": "repti-core",
  "request_id": "req_abc123",
  "user_id": "usr_xyz789",
  "endpoint": "/api/v1/animals",
  "method": "POST",
  "status_code": 500,
  "duration_ms": 1234,
  "error_type": "DatabaseConnectionError",
  "message": "Failed to acquire database connection from pool",
  "stack_trace": "Traceback (most recent call last):\n  File ...",
  "details": {
    "pool_size": 10,
    "active_connections": 10,
    "waiting_connections": 5,
    "retry_count": 3
  }
}
Key observations:
  • Pool is exhausted (10/10 active connections)
  • 5 connections waiting
  • Failed after 3 retries

View Surrounding Logs

Use context view to see what happened before/after:
  1. Click on the error log in Grafana
  2. Select “Show context”
  3. Review logs 5 minutes before and after

Trace the Request

Follow the request across services using request_id:
{service=~"repti-.*"}
| json
| request_id="req_abc123"
| line_format "{{.timestamp}} [{{.service}}] {{.level}} - {{.message}}"
This shows the full request flow:
  1. Gateway received request
  2. Auth service validated token
  3. Core service processed request
  4. Database connection failed

Step 3: Find Root Cause

Common Error Patterns

Database Connection Exhaustion

Symptoms:
  • Error type: DatabaseConnectionError
  • Pool size at maximum
  • Waiting connections queue growing
Investigation:
# Check if other services are also affected
{service=~"repti-.*"}
| json
| error_type =~ ".*Connection.*"
| line_format "{{.timestamp}} [{{.service}}] {{.message}}"
Root causes:
  • Connection leak (not closing connections)
  • Slow queries holding connections
  • Traffic spike overwhelming pool
  • Database performance degradation
Next steps:
  1. Check slow query logs
  2. Review database metrics (CPU, connections, query time)
  3. Identify if specific endpoint is causing issue
  4. Check for recent code changes

API Timeout

Symptoms:
  • Error type: TimeoutError, ReadTimeout
  • Duration exceeds threshold
  • May affect specific downstream services
Investigation:
# Find all timeout errors
{service=~"repti-.*"}
| json
| error_type =~ ".*Timeout.*"
| duration_ms > 5000
| line_format "{{.duration_ms}}ms - {{.service}} → {{.endpoint}}"
Root causes:
  • Downstream service slow/unavailable
  • Network issues
  • Database query performance
  • CPU/memory exhaustion
Next steps:
  1. Check if downstream service is healthy
  2. Review performance metrics for the service
  3. Check for slow database queries
  4. Look at network metrics

Authentication Failure

Symptoms:
  • Status code: 401, 403
  • Message contains “authentication failed”, “invalid token”
  • May be isolated to one user or widespread
Investigation:
# Count failures by user
sum by (user_id) (
  count_over_time(
    {service="repti-auth"}
    | json
    | message =~ "(?i)authentication failed" [1h]
  )
)
Root causes:
  • Expired tokens
  • Token service unavailable
  • Invalid credentials
  • Brute force attack (if widespread)
Next steps:
  1. Check if isolated to one user
  2. Verify token service health
  3. Check for suspicious IP patterns
  4. Review recent auth changes

Memory/Resource Issues

Symptoms:
  • Error type: MemoryError, OutOfMemory
  • Warnings about high memory usage before error
  • Service may restart after error
Investigation:
# Find memory warnings leading up to error
{service="repti-core"}
| json
| message =~ "(?i)memory|oom"
| level =~ "WARN|ERROR"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"
Root causes:
  • Memory leak
  • Large data processing
  • Inefficient caching
  • Too many concurrent requests
Next steps:
  1. Check service memory metrics in Prometheus
  2. Look for memory trend over time
  3. Identify if specific endpoint causes issue
  4. Review recent code changes with large data structures

Correlation Analysis

Check if error correlates with other events:

Traffic Spike

# Compare error rate with request rate
rate({service="repti-core"} [5m])
If errors spike with traffic, possible causes:
  • Insufficient scaling
  • Database connection pool too small
  • Rate limiting not effective

Deployment

Check if error started after a deployment:
  1. Note error start time
  2. Check deployment logs for that time
  3. Review recent code changes
{service="repti-core"}
| json
| message =~ "(?i)deployment|starting|started"
| line_format "{{.timestamp}} - {{.message}}"

Dependency Failure

Check if downstream service failed:
# Check all services for errors at same time
{service=~"repti-.*"}
| json
| level="ERROR"
If multiple services have errors, check common dependency:
  • Database
  • Redis
  • Auth service
  • External API

Step 4: Assess Impact

User Impact

How many users affected?
count by (user_id) (
  count_over_time(
    {service="repti-core"}
    | json
    | level="ERROR"
    | error_type="DatabaseConnectionError" [1h]
  )
)
Which features affected?
sum by (endpoint) (
  count_over_time(
    {service="repti-core"}
    | json
    | level="ERROR" [1h]
  )
)

Business Impact

  • Critical: Core user flows broken (auth, payment, data loss)
  • High: Major features unavailable (create, update operations)
  • Medium: Minor features degraded (slow performance, non-critical errors)
  • Low: Background jobs, analytics, non-user-facing

Error Frequency

Is it ongoing?
# Error rate over last hour
rate(
  {service="repti-core"}
  | json
  | level="ERROR" [5m]
)
Is it increasing? Compare error rate:
  • Last 5 minutes vs last hour
  • Current hour vs previous hour
  • Today vs yesterday

Data Integrity

Were any writes attempted?
{service="repti-core"}
| json
| method =~ "POST|PUT|PATCH|DELETE"
| status_code >= 500
| line_format "{{.method}} {{.endpoint}} - User: {{.user_id}}"
Potential data loss scenarios:
  • Transaction failed after partial write
  • Async job failed midway
  • Cache/database out of sync

Step 5: Document Findings

Create Incident Report

Document in your incident management system: Title: [SERVICE] Error Type - Brief Description Example: [repti-core] DatabaseConnectionError - Pool Exhaustion on /api/v1/animals Severity: Critical / High / Medium / Low Timeline:
  • First detected: 2025-10-13 14:20 UTC
  • Last occurrence: 2025-10-13 14:45 UTC
  • Duration: 25 minutes
Impact:
  • Users affected: 127 unique users
  • Requests failed: 1,543 requests
  • Features affected: Animal creation endpoint
Root Cause:
Database connection pool exhausted due to slow queries
on animal_lineage table. Query performance degraded after
adding 100k+ new records without proper indexing.
Resolution:
1. Added index on animal_lineage.parent_id (14:35 UTC)
2. Increased connection pool size from 10 to 20 (14:40 UTC)
3. Error rate returned to normal (14:45 UTC)
Prevention:
- Add database query performance tests to CI/CD
- Monitor query execution time in staging
- Add alerts for slow queries (>1s)
- Implement connection pool monitoring
Related Links:
  • Grafana logs: [Link to query]
  • Error dashboard: [Link to dashboard]
  • Prometheus metrics: [Link to metrics]
  • Incident ticket: [Link to Jira/Linear]

Update Runbook

If this is a recurring issue, add to the debugging playbook:
  • Error signature
  • Common causes
  • Investigation steps
  • Resolution steps
  • Prevention measures

Investigation Checklist

Use this checklist for every investigation:
  • Identified error type and message
  • Determined time range of issue
  • Found affected service(s)
  • Retrieved stack trace
  • Examined surrounding logs (context)
  • Traced request across services (request_id)
  • Checked for correlation with deployments
  • Checked for correlation with traffic spikes
  • Reviewed dependent service health
  • Checked database/cache metrics
  • Identified affected users
  • Identified affected endpoints
  • Determined business impact
  • Assessed data integrity risk
  • Documented root cause
  • Documented resolution
  • Documented prevention measures
  • Updated runbook if needed
  • Created follow-up tasks

Tips for Faster Investigation

  1. Use saved queries: Start with saved queries from the Saved Queries doc
  2. Filter early: Add label filters before text search
    {service="repti-core"} | json | level="ERROR"
    # Not: {service="repti-core"} |= "error"
    
  3. Limit time range: Start with narrow range (5m) then expand
    {service="repti-core"} | json | level="ERROR"
    # Set time range to last 5 minutes in UI
    
  4. Use dashboards: Jump to pre-built dashboards for common scenarios
    • Error tracking dashboard
    • Performance analysis dashboard
    • Service health dashboard
  5. Correlate with metrics: Check Prometheus for related metric changes
    • CPU/memory usage
    • Request rate
    • Error rate
    • Database connections
  6. Copy request_id early: First thing - grab the request_id to trace the full flow
  7. Check #incidents Slack: See if others are investigating same issue

Common Mistakes to Avoid

  1. Too broad query: Don’t query all services for all time
  2. Ignoring context: Always check logs before/after the error
  3. Assuming single cause: Complex issues often have multiple contributing factors
  4. Skipping impact assessment: Always quantify user/business impact
  5. Not documenting: Future you will thank present you for good docs
  6. Fixing symptoms: Find and fix root cause, not just symptoms
  7. Solo investigation: Involve team members for complex issues

Next Steps