Performance Metrics
This document defines the performance baselines and key metrics for monitoring the reptidex microservices platform.Overview
All reptidex services expose Prometheus metrics at the/health/metrics endpoint. These metrics are collected by Prometheus, visualized in Grafana, and logs are aggregated in Loki.
Performance Baselines
HTTP Request Latency
Target latency percentiles for all HTTP endpoints:| Percentile | Target | Alert Threshold |
|---|---|---|
| p50 (median) | < 100ms | > 200ms |
| p95 | < 250ms | > 500ms |
| p99 | < 500ms | > 1000ms |
http_request_duration_seconds
Labels: method, endpoint
Database Query Performance
Target latency for database operations:| Query Type | p50 Target | p95 Target | Alert Threshold (p95) |
|---|---|---|---|
| Simple SELECT | < 10ms | < 50ms | > 100ms |
| Complex JOIN | < 50ms | < 200ms | > 500ms |
| INSERT/UPDATE | < 25ms | < 100ms | > 250ms |
database_query_duration_seconds
Labels: query_type
Health Check Performance
Health check endpoints should respond quickly to ensure accurate load balancer routing:| Check Type | Target | Alert Threshold |
|---|---|---|
Basic health (/health) | < 5ms | > 50ms |
Readiness (/health/ready) | < 100ms | > 500ms |
Deep health (/health/deep) | < 500ms | > 2000ms |
health_check_duration_seconds
Labels: check_type, dependency
Resource Utilization
Database Connections
Monitor connection pool usage to prevent exhaustion:| Metric | Normal Range | Warning | Critical |
|---|---|---|---|
| Active connections | < 50% of pool | > 70% | > 90% |
| Connection wait time | < 10ms | > 50ms | > 100ms |
database_connections_totaldatabase_connections_in_use
Concurrent Requests
Track concurrent request processing to identify bottlenecks:| Service | Normal Range | Warning | Critical |
|---|---|---|---|
| repti-core | < 100 | > 200 | > 500 |
| All services | < 50 | > 100 | > 250 |
http_requests_in_progress
Labels: method, endpoint
Error Rates
HTTP Error Rates
Target error rates for production services:| Error Type | Target Rate | Warning | Critical |
|---|---|---|---|
| 4xx (Client errors) | < 1% | > 5% | > 10% |
| 5xx (Server errors) | < 0.1% | > 1% | > 5% |
http_requests_total(label:status_code)errors_total(labels:error_type,endpoint)
Health Check Failures
Services should maintain high availability:| Metric | Target | Warning |
|---|---|---|
| Health check success rate | > 99.9% | < 99% |
| Consecutive failures | 0 | > 3 |
health_check_status(1 = healthy, 0 = unhealthy)health_check_failures_total(labels:dependency,error_type)
Business Metrics
User Activity
Track user engagement and system usage:| Metric | Description | Labels |
|---|---|---|
users_registered_total | Total users registered | - |
active_users | Currently active users | - |
vivariums_created_total | Total vivariums created | - |
Request Throughput
Monitor request volume to understand system load:| Service | Normal RPS | Expected Peak | Alert Threshold |
|---|---|---|---|
| repti-core | 50-200 | 500 | > 1000 |
| All services | 20-100 | 200 | > 500 |
http_requests_total (rate function)
Service-Specific Baselines
repti-core Service
The core service handles authentication, configuration, and system-wide operations: Endpoints:| Endpoint | p50 | p95 | p99 |
|---|---|---|---|
GET /health | < 5ms | < 10ms | < 25ms |
GET /health/ready | < 50ms | < 150ms | < 300ms |
GET /health/metrics | < 25ms | < 100ms | < 200ms |
Monitoring Best Practices
1. Set Up Alerts
Configure Prometheus alerts for:- High latency (p95 > threshold for 5 minutes)
- High error rates (> 5% for 2 minutes)
- Database connection exhaustion (> 90% for 1 minute)
- Health check failures (> 3 consecutive failures)
2. Dashboard Organization
Organize Grafana dashboards by:- Overview: Key metrics across all services
- Service-specific: Detailed metrics per microservice
- Infrastructure: Database, Redis, system resources
- Business: User activity, feature usage
3. Log Correlation
Use structured logging with consistent fields:request_id: Trace requests across servicesuser_id: Track user-specific operationsvivarium_id: Monitor multi-tenant isolationsession_id: Understand user sessions
4. Baseline Updates
Review and update baselines:- Weekly: Check for performance degradation trends
- Monthly: Update baselines based on actual usage patterns
- Quarterly: Reassess thresholds and alert configurations
Metric Labels
Standard Labels
All metrics should include standard labels where applicable:HTTP Metrics Labels
Database Metrics Labels
Error Metrics Labels
Querying Metrics
Common Prometheus Queries
Request rate (requests per second):Troubleshooting Performance Issues
High Latency
- Check database query performance
- Review concurrent request count
- Examine database connection pool
- Check for external service dependencies
- Review application logs for errors
High Error Rate
- Check application logs with error filters
- Review recent deployments
- Verify database connectivity
- Check dependency health
- Review authentication/authorization issues
Resource Exhaustion
- Monitor database connection pool
- Check memory usage
- Review request queue depth
- Examine slow query logs
- Check for connection leaks

