Skip to main content

Performance Metrics

This document defines the performance baselines and key metrics for monitoring the reptidex microservices platform.

Overview

All reptidex services expose Prometheus metrics at the /health/metrics endpoint. These metrics are collected by Prometheus, visualized in Grafana, and logs are aggregated in Loki.

Performance Baselines

HTTP Request Latency

Target latency percentiles for all HTTP endpoints:
PercentileTargetAlert Threshold
p50 (median)< 100ms> 200ms
p95< 250ms> 500ms
p99< 500ms> 1000ms
Metric: http_request_duration_seconds Labels: method, endpoint

Database Query Performance

Target latency for database operations:
Query Typep50 Targetp95 TargetAlert Threshold (p95)
Simple SELECT< 10ms< 50ms> 100ms
Complex JOIN< 50ms< 200ms> 500ms
INSERT/UPDATE< 25ms< 100ms> 250ms
Metric: database_query_duration_seconds Labels: query_type

Health Check Performance

Health check endpoints should respond quickly to ensure accurate load balancer routing:
Check TypeTargetAlert Threshold
Basic health (/health)< 5ms> 50ms
Readiness (/health/ready)< 100ms> 500ms
Deep health (/health/deep)< 500ms> 2000ms
Metric: health_check_duration_seconds Labels: check_type, dependency

Resource Utilization

Database Connections

Monitor connection pool usage to prevent exhaustion:
MetricNormal RangeWarningCritical
Active connections< 50% of pool> 70%> 90%
Connection wait time< 10ms> 50ms> 100ms
Metrics:
  • database_connections_total
  • database_connections_in_use

Concurrent Requests

Track concurrent request processing to identify bottlenecks:
ServiceNormal RangeWarningCritical
repti-core< 100> 200> 500
All services< 50> 100> 250
Metric: http_requests_in_progress Labels: method, endpoint

Error Rates

HTTP Error Rates

Target error rates for production services:
Error TypeTarget RateWarningCritical
4xx (Client errors)< 1%> 5%> 10%
5xx (Server errors)< 0.1%> 1%> 5%
Metrics:
  • http_requests_total (label: status_code)
  • errors_total (labels: error_type, endpoint)

Health Check Failures

Services should maintain high availability:
MetricTargetWarning
Health check success rate> 99.9%< 99%
Consecutive failures0> 3
Metrics:
  • health_check_status (1 = healthy, 0 = unhealthy)
  • health_check_failures_total (labels: dependency, error_type)

Business Metrics

User Activity

Track user engagement and system usage:
MetricDescriptionLabels
users_registered_totalTotal users registered-
active_usersCurrently active users-
vivariums_created_totalTotal vivariums created-

Request Throughput

Monitor request volume to understand system load:
ServiceNormal RPSExpected PeakAlert Threshold
repti-core50-200500> 1000
All services20-100200> 500
Metric: http_requests_total (rate function)

Service-Specific Baselines

repti-core Service

The core service handles authentication, configuration, and system-wide operations: Endpoints:
Endpointp50p95p99
GET /health< 5ms< 10ms< 25ms
GET /health/ready< 50ms< 150ms< 300ms
GET /health/metrics< 25ms< 100ms< 200ms

Monitoring Best Practices

1. Set Up Alerts

Configure Prometheus alerts for:
  • High latency (p95 > threshold for 5 minutes)
  • High error rates (> 5% for 2 minutes)
  • Database connection exhaustion (> 90% for 1 minute)
  • Health check failures (> 3 consecutive failures)

2. Dashboard Organization

Organize Grafana dashboards by:
  • Overview: Key metrics across all services
  • Service-specific: Detailed metrics per microservice
  • Infrastructure: Database, Redis, system resources
  • Business: User activity, feature usage

3. Log Correlation

Use structured logging with consistent fields:
  • request_id: Trace requests across services
  • user_id: Track user-specific operations
  • vivarium_id: Monitor multi-tenant isolation
  • session_id: Understand user sessions

4. Baseline Updates

Review and update baselines:
  • Weekly: Check for performance degradation trends
  • Monthly: Update baselines based on actual usage patterns
  • Quarterly: Reassess thresholds and alert configurations

Metric Labels

Standard Labels

All metrics should include standard labels where applicable:
service="repti-core"
environment="production"
version="1.0.0"

HTTP Metrics Labels

method="GET|POST|PUT|DELETE"
endpoint="/api/v1/resource/{id}"
status_code="200|400|500"

Database Metrics Labels

query_type="select|insert|update|delete"
table="table_name"

Error Metrics Labels

error_type="ValidationError|DatabaseError|AuthenticationError"
endpoint="/api/v1/resource"

Querying Metrics

Common Prometheus Queries

Request rate (requests per second):
rate(http_requests_total[5m])
p95 latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Error rate percentage:
(rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100
Database connection utilization:
(database_connections_in_use / database_connections_total) * 100

Troubleshooting Performance Issues

High Latency

  1. Check database query performance
  2. Review concurrent request count
  3. Examine database connection pool
  4. Check for external service dependencies
  5. Review application logs for errors

High Error Rate

  1. Check application logs with error filters
  2. Review recent deployments
  3. Verify database connectivity
  4. Check dependency health
  5. Review authentication/authorization issues

Resource Exhaustion

  1. Monitor database connection pool
  2. Check memory usage
  3. Review request queue depth
  4. Examine slow query logs
  5. Check for connection leaks