Saved LogQL Queries

This document contains commonly used LogQL queries for ReptiDex services. These queries are also available as saved queries in Grafana.

Error Analysis

All Errors from a Service

{service="repti-core"} | json | level="ERROR"

Errors with Grouping by Type

{service="repti-core"}
| json
| level="ERROR"
| line_format "{{.error_type}}: {{.message}}"

Top 10 Error Types

sum by (error_type) (
  count_over_time(
    {service="repti-core"}
    | json
    | level="ERROR" [24h]
  )
)

Error Rate Over Time

rate(
  {service="repti-core"}
  | json
  | level="ERROR" [5m]
)

Errors by Endpoint

{service=~"repti-.*"}
| json
| level="ERROR"
| endpoint != ""
| line_format "{{.endpoint}}: {{.error_type}}"

New Error Types (Last 24h)

{service=~"repti-.*"}
| json
| level="ERROR"
| __error__=""
| line_format "{{.timestamp}} [{{.service}}] {{.error_type}}: {{.message}}"

Performance Analysis

Slow Requests (>1 second)

{service=~"repti-.*"}
| json
| duration_ms > 1000
| line_format "{{.duration_ms}}ms - {{.method}} {{.endpoint}} ({{.status_code}})"

Top 10 Slowest Endpoints

topk(10,
  avg by (endpoint) (
    avg_over_time(
      {service=~"repti-.*"}
      | json
      | unwrap duration_ms [1h]
    )
  )
)

Request Duration Percentiles

# P50
quantile_over_time(0.50,
  {service="repti-core"}
  | json
  | unwrap duration_ms [5m]
)

# P95
quantile_over_time(0.95,
  {service="repti-core"}
  | json
  | unwrap duration_ms [5m]
)

# P99
quantile_over_time(0.99,
  {service="repti-core"}
  | json
  | unwrap duration_ms [5m]
)

Slow Database Queries

{service=~"repti-.*"}
| json
| message =~ "(?i)database|postgres|query"
| duration_ms > 1000
| line_format "{{.duration_ms}}ms - {{.message}}"

Database Connection Pool Exhaustion

{service=~"repti-.*"}
| json
| message =~ "(?i)connection pool|max connections|pool exhausted"
| level =~ "WARN|ERROR"

Cache Miss Patterns

{service=~"repti-.*"}
| json
| message =~ "(?i)cache miss|cache not found"
| line_format "{{.timestamp}} [{{.service}}] {{.endpoint}}"

Memory Warnings

{service=~"repti-.*"}
| json
| message =~ "(?i)memory|oom|out of memory"
| level =~ "WARN|ERROR"

Security & Authentication

Failed Authentication Attempts

{service="repti-auth"}
| json
| message =~ "(?i)authentication failed|login failed|invalid credentials"
| line_format "{{.timestamp}} - {{.user_id}} - {{.source_ip}}"

Authentication Failure Rate

rate(
  {service="repti-auth"}
  | json
  | message =~ "(?i)authentication failed" [5m]
)

Unauthorized Access Attempts (401/403)

{service=~"repti-.*"}
| json
| status_code =~ "401|403"
| line_format "{{.timestamp}} [{{.service}}] {{.method}} {{.endpoint}} - {{.user_id}} - {{.source_ip}}"

Top IPs with Auth Failures

sum by (source_ip) (
  count_over_time(
    {service="repti-auth"}
    | json
    | message =~ "(?i)authentication failed" [1h]
  )
)

Suspicious Activity (Multiple Failed Attempts)

sum by (source_ip, user_id) (
  count_over_time(
    {service="repti-auth"}
    | json
    | message =~ "(?i)authentication failed" [5m]
  )
) > 5

API Rate Limits Hit

{service="repti-gateway"}
| json
| status_code="429"
| line_format "{{.timestamp}} - {{.endpoint}} - {{.user_id}} - {{.source_ip}}"

Rate Limit Hits by Endpoint

sum by (endpoint) (
  count_over_time(
    {service="repti-gateway"}
    | json
    | status_code="429" [1h]
  )
)

User Activity

User Activity Timeline

{service=~"repti-.*"}
| json
| user_id="<USER_ID_HERE>"
| line_format "{{.timestamp}} [{{.service}}] {{.method}} {{.endpoint}} ({{.status_code}})"

sum by (service) (
  count_over_time(
    {service=~"repti-.*"}
    | json
    | user_id="<USER_ID_HERE>" [24h]
  )
)

Active Users in Last Hour

count by (user_id) (
  count_over_time(
    {service=~"repti-.*"}
    | json
    | user_id != "" [1h]
  )
)

User Session Analysis

{service=~"repti-.*"}
| json
| session_id="<SESSION_ID_HERE>"
| line_format "{{.timestamp}} [{{.service}}] {{.endpoint}} - Duration: {{.duration_ms}}ms"

Database Operations

Database Errors

{service=~"repti-.*"}
| json
| message =~ "(?i)database|postgres|connection|query error"
| level="ERROR"
| line_format "{{.timestamp}} [{{.service}}] {{.error_type}}: {{.message}}"

Database Connection Errors

{service=~"repti-.*"}
| json
| error_type =~ ".*Connection.*|.*Database.*"
| level="ERROR"

SQL Injection Attempts

{service=~"repti-.*"}
| json
| message =~ "(?i)sql injection|drop table|union select|exec|script"
| level="WARN"

Transaction Rollbacks

{service=~"repti-.*"}
| json
| message =~ "(?i)rollback|transaction failed"
| line_format "{{.timestamp}} [{{.service}}] {{.endpoint}}: {{.message}}"

API Monitoring

5xx Server Errors

{service=~"repti-.*"}
| json
| status_code >= 500
| line_format "{{.timestamp}} [{{.service}}] {{.method}} {{.endpoint}} ({{.status_code}})"

4xx Client Errors

{service=~"repti-.*"}
| json
| status_code >= 400 and status_code < 500
| line_format "{{.timestamp}} [{{.service}}] {{.method}} {{.endpoint}} ({{.status_code}})"

Error Rate by Status Code

sum by (status_code) (
  rate(
    {service=~"repti-.*"}
    | json
    | status_code >= 400 [5m]
  )
)

Top Error Endpoints

topk(10,
  sum by (endpoint) (
    count_over_time(
      {service=~"repti-.*"}
      | json
      | status_code >= 500 [1h]
    )
  )
)

Request Volume by Service

sum by (service) (
  rate({service=~"repti-.*"} [5m])
)

Service Health

Service Startup/Shutdown Events

{service=~"repti-.*"}
| json
| message =~ "(?i)starting|started|stopping|stopped|shutting down"
| line_format "{{.timestamp}} [{{.service}}] {{.message}}"

Unhealthy Service Checks

{service=~"repti-.*"}
| json
| message =~ "(?i)health check failed|unhealthy"
| level =~ "WARN|ERROR"

Background Job Failures

{service=~"repti-.*"}
| json
| message =~ "(?i)job failed|task failed|worker error"
| level="ERROR"

Resource Warnings

{service=~"repti-.*"}
| json
| message =~ "(?i)high cpu|high memory|disk full|low disk space"
| level="WARN"

Debugging Queries

Context View (Logs Around a Specific Time)

{service="repti-core"}
| json
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"

Then in Grafana, click on a log line and select “Show context” to see logs before and after.

Live Tail (Real-time Logs)

{service="repti-core"} | json

Click the “Live” button in Grafana Explore to stream logs in real-time.

Full Request/Response Debug

{service="repti-core"}
| json
| request_id="<REQUEST_ID>"
| line_format "{{.timestamp}} [{{.level}}] {{.message}}\nDetails: {{.details}}"

Unique Error Messages

count by (message) (
  count_over_time(
    {service="repti-core"}
    | json
    | level="ERROR" [24h]
  )
)

Correlation Queries

Errors with Slow Requests

{service="repti-core"}
| json
| duration_ms > 2000 or level="ERROR"
| line_format "{{.duration_ms}}ms [{{.level}}] {{.method}} {{.endpoint}} - {{.message}}"

Failed Requests with User Context

{service=~"repti-.*"}
| json
| status_code >= 400
| user_id != ""
| line_format "User: {{.user_id}} - {{.method}} {{.endpoint}} ({{.status_code}}) - {{.message}}"

Variables for Dashboards

When using these queries in Grafana dashboards, use these variables:

$service     # Service name (e.g., repti-core)
$environment # Environment (dev, staging, prod)
$user_id     # User ID for filtering
$endpoint    # API endpoint
$level       # Log level (DEBUG, INFO, WARN, ERROR)
$timerange   # Time range (e.g., [5m], [1h], [24h])

Example with variables:

{service="$service", environment="$environment"}
| json
| level="$level"
| user_id="$user_id"

Query Tips

Start broad, filter narrow: Begin with {service="repti-core"} then add filters
Use json parsing early: | json extracts fields for filtering
Format output for readability: | line_format "{{.field1}}: {{.field2}}"
Limit results: Add | limit 100 to large result sets
Use aggregation functions: count_over_time(), rate(), sum(), avg()
Regex tips:
- =~ for regex match
- !~ for regex not match
- (?i) for case-insensitive
Time ranges: Always specify time range with [5m], [1h], etc.

Next Steps

Error Investigation Workflow - How to investigate errors
Debugging Playbook - Common debugging scenarios
Log Correlation - Correlating logs with metrics

Monitoring & Observability

Deployment

Incident Management

Security

​Saved LogQL Queries

​Error Analysis

​All Errors from a Service

​Errors with Grouping by Type

​Top 10 Error Types

​Error Rate Over Time

​Errors by Endpoint

​New Error Types (Last 24h)

​Performance Analysis

​Slow Requests (>1 second)

​Top 10 Slowest Endpoints

​Request Duration Percentiles

​Slow Database Queries

​Database Connection Pool Exhaustion

​Cache Miss Patterns

​Memory Warnings

​Security & Authentication

​Failed Authentication Attempts

​Authentication Failure Rate

​Unauthorized Access Attempts (401/403)

​Top IPs with Auth Failures

​Suspicious Activity (Multiple Failed Attempts)

​API Rate Limits Hit

​Rate Limit Hits by Endpoint

​User Activity

​User Activity Timeline

​User Actions by Service

​Active Users in Last Hour

​User Session Analysis

​Database Operations

​Database Errors

​Database Connection Errors

​SQL Injection Attempts

​Transaction Rollbacks

​API Monitoring

​5xx Server Errors

​4xx Client Errors

​Error Rate by Status Code

​Top Error Endpoints

​Request Volume by Service

​Service Health

​Service Startup/Shutdown Events

​Unhealthy Service Checks

​Background Job Failures

​Resource Warnings

​Debugging Queries

​Context View (Logs Around a Specific Time)

​Live Tail (Real-time Logs)

​Full Request/Response Debug

​Unique Error Messages

​Correlation Queries

​Errors with Slow Requests

​Failed Requests with User Context

​Variables for Dashboards

​Query Tips

​Next Steps