Centralized Logging with Loki
ReptiDex uses Grafana Loki for centralized log aggregation, providing structured logging, error tracking, and performance analysis across all microservices.Overview
Architecture
Components
- Loki: Log aggregation system (ECS Fargate service with S3 storage)
- Fluent Bit: Log collector using AWS FireLens (sidecar containers)
- Grafana: Query interface and dashboard platform
- Storage: S3 for log chunks (with lifecycle policies for retention)
Log Flow
- Application writes logs: Services write structured JSON logs to stdout/stderr
- Fluent Bit captures logs: FireLens sidecar container intercepts container logs
- Fluent Bit filters logs:
- Adds metadata (cluster, environment, service)
- Filters sensitive data (passwords, API keys, tokens)
- Formats logs as JSON
- Fluent Bit forwards to Loki: Logs are sent to Loki via HTTP API
- Loki stores logs: Logs are indexed and stored in S3
- Query via Grafana: Users query logs using LogQL in Grafana Explore
FireLens Configuration
Each ECS task definition includes a Fluent Bit sidecar container:- Add cluster and environment labels
- Exclude logs containing sensitive keywords (password, api_key, secret, token)
- Format logs as JSON for Loki
Accessing Logs
Grafana Explore
Access Loki logs through Grafana Explore: URL: https://grafana-dev.reptidex.com/explore- Select Loki as the data source
- Use the query builder or write LogQL directly
- Select time range (last 5m, 1h, 24h, 7d, etc.)
- Apply filters and run query
Quick Start Examples
Structured Logging Standards
Log Format
All services use structured JSON logging with consistent fields:Standard Fields
| Field | Type | Required | Description |
|---|---|---|---|
timestamp | ISO 8601 | Yes | When the log was generated |
level | String | Yes | DEBUG, INFO, WARN, ERROR, CRITICAL |
service | String | Yes | Service name (e.g., repti-core) |
version | String | Yes | Service version |
environment | String | Yes | dev, staging, production |
request_id | String | Conditional | Unique request identifier |
user_id | String | Optional | Authenticated user ID |
session_id | String | Optional | User session ID |
endpoint | String | Conditional | API endpoint path |
method | String | Conditional | HTTP method |
status_code | Integer | Conditional | HTTP status code |
duration_ms | Float | Optional | Request/operation duration |
error_type | String | Conditional | Error class/type |
message | String | Yes | Log message |
stack_trace | String | Optional | Full error stack trace |
Log Levels by Environment
Development:- All levels enabled (DEBUG, INFO, WARN, ERROR, CRITICAL)
- Sample rate: 100%
- INFO and above
- Sample rate: 100%
- INFO and above
- DEBUG logs sampled at 10%
- Sample rate: 10% for DEBUG, 100% for others
Correlation IDs
Request ID Propagation
Request IDs are automatically generated and propagated across service boundaries:Tracing Requests
To trace a request across all services:PII Filtering & Security
Automatic Redaction
All services automatically filter PII and sensitive data: Filtered Fields:- Passwords (any field containing “password”, “passwd”, “pwd”)
- API keys (any field containing “api_key”, “apikey”, “token”)
- Email addresses (regex pattern match)
- Phone numbers (regex pattern match)
- Credit card numbers (regex pattern match)
- SSN/Tax IDs (regex pattern match)
Verification
Test that sensitive data is filtered:Log Retention Policies
| Log Level | Retention Period | Storage |
|---|---|---|
| DEBUG | 30 days | S3 |
| INFO | 90 days | S3 |
| WARN | 1 year | S3 |
| ERROR | 1 year | S3 |
| CRITICAL | 1 year | S3 |
- Logs are compacted daily by Loki compactor
- Old chunks are moved to long-term S3 storage
- DynamoDB index is pruned according to retention policy
Multi-Tenancy
Loki is configured with multi-tenancy to separate logs by environment:Performance Considerations
Query Limits
- Max query range: 30 days
- Max query lookback: 7 days (without specifying end time)
- Max entries per query: 5000
- Query timeout: 30 seconds
Caching
Loki caches query results for:- Chunk cache: 1 hour
- Results cache: 10 minutes
- Index cache: 5 minutes
Best Practices
- Always use time ranges: Queries without time bounds are slow
- Use label filters first:
{service="repti-core"}before text search - Limit result sets: Use
| line_formatand| limit 100 - Avoid regex in labels: Use exact matches when possible
- Use metric queries for aggregation:
rate(),count_over_time()instead ofcount
Common Issues
Logs Not Appearing
- Check Fluent Bit sidecar is running:
- Verify service is logging:
- Check Fluent Bit logs in CloudWatch:
- Verify Loki is reachable:
Slow Queries
- Reduce time range
- Add more specific label filters
- Use
| jsonearly in the pipeline - Limit results with
| limit 100
Missing Fields
- Verify structured logging is enabled in service
- Check log format is valid JSON
- Ensure
repti_telemetrypackage is up to date
Next Steps
- Saved LogQL Queries - Common query templates
- Error Investigation Workflow - Debugging process
- Debugging Playbook - Common scenarios
- Log Correlation - Correlating logs with metrics

