Error Investigation Workflow

This guide provides a systematic approach to investigating errors in ReptiDex using Loki logs and Grafana dashboards.

Quick Start

When you discover an error (via alert, dashboard, or user report), follow this workflow:

1. Identify the error
   ↓
2. Gather context
   ↓
3. Find root cause
   ↓
4. Assess impact
   ↓
5. Document findings

Step 1: Identify the Error

From Alert

If the error came from an alert:

Note the alert name and severity
Check alert labels for service, environment, error_type
Note the timestamp when alert fired

From Dashboard

If you see an error spike in a dashboard:

Note the service name
Identify the time range when errors spiked
Check the error type if available
Look for correlated metric changes (CPU, memory, request rate)

From User Report

If a user reported an issue:

Gather user ID or session ID
Get approximate timestamp of the issue
Identify which feature/endpoint was affected
Note any error messages the user saw

Step 2: Gather Context

Query Initial Error Logs

Start with a broad query to find the error:

# By time range and service
{service="repti-core"}
| json
| level="ERROR"

Narrow Down with Filters

Add more specific filters:

{service="repti-core"}
| json
| level="ERROR"
| endpoint="/api/v1/animals"
| user_id="usr_abc123"

Get Error Details

Examine a specific error to understand:

Error type: What kind of error occurred?
Error message: What does it say?
Stack trace: Where in the code did it fail?
Request context: What was the user trying to do?

Example Error Log

{
  "timestamp": "2025-10-13T14:23:45.123Z",
  "level": "ERROR",
  "service": "repti-core",
  "request_id": "req_abc123",
  "user_id": "usr_xyz789",
  "endpoint": "/api/v1/animals",
  "method": "POST",
  "status_code": 500,
  "duration_ms": 1234,
  "error_type": "DatabaseConnectionError",
  "message": "Failed to acquire database connection from pool",
  "stack_trace": "Traceback (most recent call last):\n  File ...",
  "details": {
    "pool_size": 10,
    "active_connections": 10,
    "waiting_connections": 5,
    "retry_count": 3
  }
}

Key observations:

Pool is exhausted (10/10 active connections)
5 connections waiting
Failed after 3 retries

View Surrounding Logs

Use context view to see what happened before/after:

Click on the error log in Grafana
Select “Show context”
Review logs 5 minutes before and after

Trace the Request

Follow the request across services using request_id:

{service=~"repti-.*"}
| json
| request_id="req_abc123"
| line_format "{{.timestamp}} [{{.service}}] {{.level}} - {{.message}}"

This shows the full request flow:

Gateway received request
Auth service validated token
Core service processed request
Database connection failed

Step 3: Find Root Cause

Common Error Patterns

Database Connection Exhaustion

Symptoms:

Error type: DatabaseConnectionError
Pool size at maximum
Waiting connections queue growing

Investigation:

# Check if other services are also affected
{service=~"repti-.*"}
| json
| error_type =~ ".*Connection.*"
| line_format "{{.timestamp}} [{{.service}}] {{.message}}"

Root causes:

Connection leak (not closing connections)
Slow queries holding connections
Traffic spike overwhelming pool
Database performance degradation

Next steps:

Check slow query logs
Review database metrics (CPU, connections, query time)
Identify if specific endpoint is causing issue
Check for recent code changes

API Timeout

Symptoms:

Error type: TimeoutError, ReadTimeout
Duration exceeds threshold
May affect specific downstream services

Investigation:

# Find all timeout errors
{service=~"repti-.*"}
| json
| error_type =~ ".*Timeout.*"
| duration_ms > 5000
| line_format "{{.duration_ms}}ms - {{.service}} → {{.endpoint}}"

Root causes:

Downstream service slow/unavailable
Network issues
Database query performance
CPU/memory exhaustion

Next steps:

Check if downstream service is healthy
Review performance metrics for the service
Check for slow database queries
Look at network metrics

Authentication Failure

Symptoms:

Status code: 401, 403
Message contains “authentication failed”, “invalid token”
May be isolated to one user or widespread

Investigation:

# Count failures by user
sum by (user_id) (
  count_over_time(
    {service="repti-auth"}
    | json
    | message =~ "(?i)authentication failed" [1h]
  )
)

Root causes:

Expired tokens
Token service unavailable
Invalid credentials
Brute force attack (if widespread)

Next steps:

Check if isolated to one user
Verify token service health
Check for suspicious IP patterns
Review recent auth changes

Memory/Resource Issues

Symptoms:

Error type: MemoryError, OutOfMemory
Warnings about high memory usage before error
Service may restart after error

Investigation:

# Find memory warnings leading up to error
{service="repti-core"}
| json
| message =~ "(?i)memory|oom"
| level =~ "WARN|ERROR"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"

Root causes:

Memory leak
Large data processing
Inefficient caching
Too many concurrent requests

Next steps:

Check service memory metrics in Prometheus
Look for memory trend over time
Identify if specific endpoint causes issue
Review recent code changes with large data structures

Correlation Analysis

Check if error correlates with other events:

Traffic Spike

# Compare error rate with request rate
rate({service="repti-core"} [5m])

If errors spike with traffic, possible causes:

Insufficient scaling
Database connection pool too small
Rate limiting not effective

Deployment

Check if error started after a deployment:

Note error start time
Check deployment logs for that time
Review recent code changes

{service="repti-core"}
| json
| message =~ "(?i)deployment|starting|started"
| line_format "{{.timestamp}} - {{.message}}"

Dependency Failure

Check if downstream service failed:

# Check all services for errors at same time
{service=~"repti-.*"}
| json
| level="ERROR"

If multiple services have errors, check common dependency:

Database
Redis
Auth service
External API

Step 4: Assess Impact

User Impact

How many users affected?

count by (user_id) (
  count_over_time(
    {service="repti-core"}
    | json
    | level="ERROR"
    | error_type="DatabaseConnectionError" [1h]
  )
)

Which features affected?

sum by (endpoint) (
  count_over_time(
    {service="repti-core"}
    | json
    | level="ERROR" [1h]
  )
)

Business Impact

Critical: Core user flows broken (auth, payment, data loss)
High: Major features unavailable (create, update operations)
Medium: Minor features degraded (slow performance, non-critical errors)
Low: Background jobs, analytics, non-user-facing

Error Frequency

Is it ongoing?

# Error rate over last hour
rate(
  {service="repti-core"}
  | json
  | level="ERROR" [5m]
)

Is it increasing? Compare error rate:

Last 5 minutes vs last hour
Current hour vs previous hour
Today vs yesterday

Data Integrity

Were any writes attempted?

{service="repti-core"}
| json
| method =~ "POST|PUT|PATCH|DELETE"
| status_code >= 500
| line_format "{{.method}} {{.endpoint}} - User: {{.user_id}}"

Potential data loss scenarios:

Transaction failed after partial write
Async job failed midway
Cache/database out of sync

Step 5: Document Findings

Create Incident Report

Document in your incident management system: Title: [SERVICE] Error Type - Brief Description Example: [repti-core] DatabaseConnectionError - Pool Exhaustion on /api/v1/animals Severity: Critical / High / Medium / Low Timeline:

First detected: 2025-10-13 14:20 UTC
Last occurrence: 2025-10-13 14:45 UTC
Duration: 25 minutes

Impact:

Users affected: 127 unique users
Requests failed: 1,543 requests
Features affected: Animal creation endpoint

Root Cause:

Database connection pool exhausted due to slow queries
on animal_lineage table. Query performance degraded after
adding 100k+ new records without proper indexing.

Resolution:

Added index on animal_lineage.parent_id (14:35 UTC)
Increased connection pool size from 10 to 20 (14:40 UTC)
Error rate returned to normal (14:45 UTC)

Prevention:

- Add database query performance tests to CI/CD
- Monitor query execution time in staging
- Add alerts for slow queries (>1s)
- Implement connection pool monitoring

Related Links:

Grafana logs: [Link to query]
Error dashboard: [Link to dashboard]
Prometheus metrics: [Link to metrics]
Incident ticket: [Link to Jira/Linear]

Update Runbook

If this is a recurring issue, add to the debugging playbook:

Error signature
Common causes
Investigation steps
Resolution steps
Prevention measures

Investigation Checklist

Use this checklist for every investigation:

Tips for Faster Investigation

Use saved queries: Start with saved queries from the Saved Queries doc

Filter early: Add label filters before text search

{service="repti-core"} | json | level="ERROR"
# Not: {service="repti-core"} |= "error"

Limit time range: Start with narrow range (5m) then expand

{service="repti-core"} | json | level="ERROR"
# Set time range to last 5 minutes in UI

Use dashboards: Jump to pre-built dashboards for common scenarios
- Error tracking dashboard
- Performance analysis dashboard
- Service health dashboard
Correlate with metrics: Check Prometheus for related metric changes
- CPU/memory usage
- Request rate
- Error rate
- Database connections
Copy request_id early: First thing - grab the request_id to trace the full flow
Check #incidents Slack: See if others are investigating same issue

Common Mistakes to Avoid

Too broad query: Don’t query all services for all time
Ignoring context: Always check logs before/after the error
Assuming single cause: Complex issues often have multiple contributing factors
Skipping impact assessment: Always quantify user/business impact
Not documenting: Future you will thank present you for good docs
Fixing symptoms: Find and fix root cause, not just symptoms
Solo investigation: Involve team members for complex issues

Next Steps

Debugging Playbook - Specific scenarios and solutions
Saved LogQL Queries - Query templates
Log Correlation - Correlating logs with metrics

Monitoring & Observability

Deployment

Incident Management

Security

Error investigation

Error Investigation Workflow

Quick Start

Step 1: Identify the Error

From Alert

From Dashboard

From User Report

Step 2: Gather Context

Query Initial Error Logs

Narrow Down with Filters

Get Error Details

Example Error Log

View Surrounding Logs

Trace the Request

Step 3: Find Root Cause

Common Error Patterns

Database Connection Exhaustion

API Timeout

Authentication Failure

Memory/Resource Issues

Correlation Analysis

Traffic Spike

Deployment

Dependency Failure

Step 4: Assess Impact

User Impact

Business Impact

Error Frequency

Data Integrity

Step 5: Document Findings

Create Incident Report

Update Runbook

Investigation Checklist

Tips for Faster Investigation

Common Mistakes to Avoid

Next Steps

Monitoring & Observability

Deployment

Incident Management

Security

​Error Investigation Workflow

​Quick Start

​Step 1: Identify the Error

​From Alert

​From Dashboard

​From User Report

​Step 2: Gather Context

​Query Initial Error Logs

​Narrow Down with Filters

​Get Error Details

​Example Error Log

​View Surrounding Logs

​Trace the Request

​Step 3: Find Root Cause

​Common Error Patterns

​Database Connection Exhaustion

​API Timeout

​Authentication Failure

​Memory/Resource Issues

​Correlation Analysis

​Traffic Spike

​Deployment

​Dependency Failure

​Step 4: Assess Impact

​User Impact

​Business Impact

​Error Frequency

​Data Integrity

​Step 5: Document Findings

​Create Incident Report

​Update Runbook

​Investigation Checklist

​Tips for Faster Investigation

​Common Mistakes to Avoid

​Next Steps

Error Investigation Workflow

Quick Start

Step 1: Identify the Error

From Alert

From Dashboard

From User Report

Step 2: Gather Context

Query Initial Error Logs

Narrow Down with Filters

Get Error Details

Example Error Log

View Surrounding Logs

Trace the Request

Step 3: Find Root Cause

Common Error Patterns

Database Connection Exhaustion

API Timeout

Authentication Failure

Memory/Resource Issues

Correlation Analysis

Traffic Spike

Deployment

Dependency Failure

Step 4: Assess Impact

User Impact

Business Impact

Error Frequency

Data Integrity

Step 5: Document Findings

Create Incident Report

Update Runbook

Investigation Checklist

Tips for Faster Investigation

Common Mistakes to Avoid

Next Steps