Skip to main content

Centralized Logging with Loki

ReptiDex uses Grafana Loki for centralized log aggregation, providing structured logging, error tracking, and performance analysis across all microservices.

Overview

Architecture

┌─────────────────┐
│   FastAPI App   │──┐
│  (repti-core)   │  │
│   [Container]   │  │
└─────────────────┘  │
        │            │    ┌──────────────┐      ┌──────────┐
        └────────────├───▶│  Fluent Bit  │─────▶│   Loki   │
                     │    │  (FireLens   │      │ (ECS)    │
┌─────────────────┐  │    │   Sidecar)   │      └──────────┘
│   FastAPI App   │──┤    └──────────────┘           │
│ (repti-animal)  │  │                               │
│   [Container]   │  │                               ▼
└─────────────────┘  │                          ┌─────────┐
        │            │                          │ Grafana │
        └────────────┤                          │ Explore │
                     │                          └─────────┘
┌─────────────────┐  │
│   Next.js App   │──┘
│  (Frontend)     │
│   [Container]   │
└─────────────────┘

Components

  • Loki: Log aggregation system (ECS Fargate service with S3 storage)
  • Fluent Bit: Log collector using AWS FireLens (sidecar containers)
  • Grafana: Query interface and dashboard platform
  • Storage: S3 for log chunks (with lifecycle policies for retention)

Log Flow

  1. Application writes logs: Services write structured JSON logs to stdout/stderr
  2. Fluent Bit captures logs: FireLens sidecar container intercepts container logs
  3. Fluent Bit filters logs:
    • Adds metadata (cluster, environment, service)
    • Filters sensitive data (passwords, API keys, tokens)
    • Formats logs as JSON
  4. Fluent Bit forwards to Loki: Logs are sent to Loki via HTTP API
  5. Loki stores logs: Logs are indexed and stored in S3
  6. Query via Grafana: Users query logs using LogQL in Grafana Explore

FireLens Configuration

Each ECS task definition includes a Fluent Bit sidecar container:
# Application container
- Name: repti-core
  Image: repti-core:latest
  LogConfiguration:
    LogDriver: awsfirelens
    Options:
      Name: "null"
  DependsOn:
    - ContainerName: log_router
      Condition: START

# Fluent Bit sidecar
- Name: log_router
  Image: fluent-bit:latest
  FirelensConfiguration:
    Type: fluentbit
Fluent Bit filters applied:
  • Add cluster and environment labels
  • Exclude logs containing sensitive keywords (password, api_key, secret, token)
  • Format logs as JSON for Loki
Fluent Bit output configuration:
[OUTPUT]
    Name                loki
    Host                loki.dev-reptidex-monitoring
    Port                3100
    labels              job=ecs,cluster=${ECS_CLUSTER},environment=${ENVIRONMENT}
    label_keys          $container_name
    line_format         json

Accessing Logs

Grafana Explore

Access Loki logs through Grafana Explore: URL: https://grafana-dev.reptidex.com/explore
  1. Select Loki as the data source
  2. Use the query builder or write LogQL directly
  3. Select time range (last 5m, 1h, 24h, 7d, etc.)
  4. Apply filters and run query

Quick Start Examples

# All logs from a specific service
{service="repti-core"}

# Error logs only
{service="repti-core"} | json | level="ERROR"

# Search for specific text
{service="repti-core"} |= "database connection"

# Filter by user
{service=~"repti-.*"} | json | user_id="abc-123"

Structured Logging Standards

Log Format

All services use structured JSON logging with consistent fields:
{
  "timestamp": "2025-10-13T12:34:56.789Z",
  "level": "ERROR",
  "service": "repti-core",
  "version": "1.0.0",
  "environment": "production",
  "request_id": "req_abc123",
  "user_id": "usr_xyz789",
  "session_id": "ses_def456",
  "endpoint": "/api/v1/animals",
  "method": "POST",
  "status_code": 500,
  "duration_ms": 1234,
  "error_type": "DatabaseConnectionError",
  "message": "Failed to connect to database",
  "stack_trace": "Traceback (most recent call last)...",
  "details": {
    "database": "postgres",
    "retry_count": 3
  }
}

Standard Fields

FieldTypeRequiredDescription
timestampISO 8601YesWhen the log was generated
levelStringYesDEBUG, INFO, WARN, ERROR, CRITICAL
serviceStringYesService name (e.g., repti-core)
versionStringYesService version
environmentStringYesdev, staging, production
request_idStringConditionalUnique request identifier
user_idStringOptionalAuthenticated user ID
session_idStringOptionalUser session ID
endpointStringConditionalAPI endpoint path
methodStringConditionalHTTP method
status_codeIntegerConditionalHTTP status code
duration_msFloatOptionalRequest/operation duration
error_typeStringConditionalError class/type
messageStringYesLog message
stack_traceStringOptionalFull error stack trace

Log Levels by Environment

Development:
  • All levels enabled (DEBUG, INFO, WARN, ERROR, CRITICAL)
  • Sample rate: 100%
Staging:
  • INFO and above
  • Sample rate: 100%
Production:
  • INFO and above
  • DEBUG logs sampled at 10%
  • Sample rate: 10% for DEBUG, 100% for others

Correlation IDs

Request ID Propagation

Request IDs are automatically generated and propagated across service boundaries:
# Incoming request
X-Request-ID: req_abc123def456

# Service A logs
{"request_id": "req_abc123def456", "service": "repti-core", ...}

# Service B logs (called by Service A)
{"request_id": "req_abc123def456", "service": "repti-animal", ...}

Tracing Requests

To trace a request across all services:
{service=~"repti-.*"} | json | request_id="req_abc123def456"

PII Filtering & Security

Automatic Redaction

All services automatically filter PII and sensitive data: Filtered Fields:
  • Passwords (any field containing “password”, “passwd”, “pwd”)
  • API keys (any field containing “api_key”, “apikey”, “token”)
  • Email addresses (regex pattern match)
  • Phone numbers (regex pattern match)
  • Credit card numbers (regex pattern match)
  • SSN/Tax IDs (regex pattern match)
Example:
# Before filtering
{
  "email": "[email protected]",
  "password": "secret123",
  "api_key": "sk_live_abc123"
}

# After filtering
{
  "email": "[REDACTED]",
  "password": "[REDACTED]",
  "api_key": "[REDACTED]"
}

Verification

Test that sensitive data is filtered:
# Should return 0 results
{service=~"repti-.*"} |~ "password|api_key|credit_card"

Log Retention Policies

Log LevelRetention PeriodStorage
DEBUG30 daysS3
INFO90 daysS3
WARN1 yearS3
ERROR1 yearS3
CRITICAL1 yearS3
Compaction:
  • Logs are compacted daily by Loki compactor
  • Old chunks are moved to long-term S3 storage
  • DynamoDB index is pruned according to retention policy

Multi-Tenancy

Loki is configured with multi-tenancy to separate logs by environment:
X-Scope-OrgID: dev       # Development environment
X-Scope-OrgID: staging   # Staging environment
X-Scope-OrgID: prod      # Production environment
Grafana automatically sets the appropriate tenant based on your environment selection.

Performance Considerations

Query Limits

  • Max query range: 30 days
  • Max query lookback: 7 days (without specifying end time)
  • Max entries per query: 5000
  • Query timeout: 30 seconds

Caching

Loki caches query results for:
  • Chunk cache: 1 hour
  • Results cache: 10 minutes
  • Index cache: 5 minutes

Best Practices

  1. Always use time ranges: Queries without time bounds are slow
  2. Use label filters first: {service="repti-core"} before text search
  3. Limit result sets: Use | line_format and | limit 100
  4. Avoid regex in labels: Use exact matches when possible
  5. Use metric queries for aggregation: rate(), count_over_time() instead of count

Common Issues

Logs Not Appearing

  1. Check Fluent Bit sidecar is running:
    aws ecs describe-tasks --cluster <cluster-name> --tasks <task-arn> --query 'tasks[0].containers[?name==`log_router`].lastStatus'
    
  2. Verify service is logging:
    aws ecs describe-tasks --cluster <cluster-name> --tasks <task-arn> --query 'tasks[0].containers[0].lastStatus'
    
  3. Check Fluent Bit logs in CloudWatch:
    aws logs tail /ecs/<environment>-reptidex --follow --filter-pattern "log_router"
    
  4. Verify Loki is reachable:
    curl http://loki.<environment>-reptidex-monitoring:3100/ready
    

Slow Queries

  1. Reduce time range
  2. Add more specific label filters
  3. Use | json early in the pipeline
  4. Limit results with | limit 100

Missing Fields

  1. Verify structured logging is enabled in service
  2. Check log format is valid JSON
  3. Ensure repti_telemetry package is up to date

Next Steps