Performance Metrics

This document defines the performance baselines and key metrics for monitoring the reptidex microservices platform.

Overview

All reptidex services expose Prometheus metrics at the /health/metrics endpoint. These metrics are collected by Prometheus, visualized in Grafana, and logs are aggregated in Loki.

Performance Baselines

HTTP Request Latency

Target latency percentiles for all HTTP endpoints:

Percentile	Target	Alert Threshold
p50 (median)	< 100ms	> 200ms
p95	< 250ms	> 500ms
p99	< 500ms	> 1000ms

Metric: http_request_duration_seconds Labels: method, endpoint

Database Query Performance

Target latency for database operations:

Query Type	p50 Target	p95 Target	Alert Threshold (p95)
Simple SELECT	< 10ms	< 50ms	> 100ms
Complex JOIN	< 50ms	< 200ms	> 500ms
INSERT/UPDATE	< 25ms	< 100ms	> 250ms

Metric: database_query_duration_seconds Labels: query_type

Health Check Performance

Health check endpoints should respond quickly to ensure accurate load balancer routing:

Check Type	Target	Alert Threshold
Basic health (`/health`)	< 5ms	> 50ms
Readiness (`/health/ready`)	< 100ms	> 500ms
Deep health (`/health/deep`)	< 500ms	> 2000ms

Metric: health_check_duration_seconds Labels: check_type, dependency

Resource Utilization

Database Connections

Monitor connection pool usage to prevent exhaustion:

Metric	Normal Range	Warning	Critical
Active connections	< 50% of pool	> 70%	> 90%
Connection wait time	< 10ms	> 50ms	> 100ms

Metrics:

database_connections_total
database_connections_in_use

Concurrent Requests

Track concurrent request processing to identify bottlenecks:

Service	Normal Range	Warning	Critical
repti-core	< 100	> 200	> 500
All services	< 50	> 100	> 250

Metric: http_requests_in_progress Labels: method, endpoint

Error Rates

HTTP Error Rates

Target error rates for production services:

Error Type	Target Rate	Warning	Critical
4xx (Client errors)	< 1%	> 5%	> 10%
5xx (Server errors)	< 0.1%	> 1%	> 5%

Metrics:

http_requests_total (label: status_code)
errors_total (labels: error_type, endpoint)

Health Check Failures

Services should maintain high availability:

Metric	Target	Warning
Health check success rate	> 99.9%	< 99%
Consecutive failures	0	> 3

Metrics:

health_check_status (1 = healthy, 0 = unhealthy)
health_check_failures_total (labels: dependency, error_type)

Business Metrics

User Activity

Track user engagement and system usage:

Metric	Description	Labels
`users_registered_total`	Total users registered	-
`active_users`	Currently active users	-
`vivariums_created_total`	Total vivariums created	-

Request Throughput

Monitor request volume to understand system load:

Service	Normal RPS	Expected Peak	Alert Threshold
repti-core	50-200	500	> 1000
All services	20-100	200	> 500

Metric: http_requests_total (rate function)

Service-Specific Baselines

repti-core Service

The core service handles authentication, configuration, and system-wide operations: Endpoints:

Endpoint	p50	p95	p99
`GET /health`	< 5ms	< 10ms	< 25ms
`GET /health/ready`	< 50ms	< 150ms	< 300ms
`GET /health/metrics`	< 25ms	< 100ms	< 200ms

Monitoring Best Practices

1. Set Up Alerts

Configure Prometheus alerts for:

High latency (p95 > threshold for 5 minutes)
High error rates (> 5% for 2 minutes)
Database connection exhaustion (> 90% for 1 minute)
Health check failures (> 3 consecutive failures)

2. Dashboard Organization

Organize Grafana dashboards by:

Overview: Key metrics across all services
Service-specific: Detailed metrics per microservice
Infrastructure: Database, Redis, system resources
Business: User activity, feature usage

3. Log Correlation

Use structured logging with consistent fields:

request_id: Trace requests across services
user_id: Track user-specific operations
vivarium_id: Monitor multi-tenant isolation
session_id: Understand user sessions

4. Baseline Updates

Review and update baselines:

Weekly: Check for performance degradation trends
Monthly: Update baselines based on actual usage patterns
Quarterly: Reassess thresholds and alert configurations

Metric Labels

Standard Labels

All metrics should include standard labels where applicable:

service="repti-core"
environment="production"
version="1.0.0"

HTTP Metrics Labels

method="GET|POST|PUT|DELETE"
endpoint="/api/v1/resource/{id}"
status_code="200|400|500"

Database Metrics Labels

query_type="select|insert|update|delete"
table="table_name"

Error Metrics Labels

error_type="ValidationError|DatabaseError|AuthenticationError"
endpoint="/api/v1/resource"

Querying Metrics

Common Prometheus Queries

Request rate (requests per second):

rate(http_requests_total[5m])

p95 latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate percentage:

(rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100

Database connection utilization:

(database_connections_in_use / database_connections_total) * 100

Troubleshooting Performance Issues

High Latency

Check database query performance
Review concurrent request count
Examine database connection pool
Check for external service dependencies
Review application logs for errors

High Error Rate

Check application logs with error filters
Review recent deployments
Verify database connectivity
Check dependency health
Review authentication/authorization issues

Resource Exhaustion

Monitor database connection pool
Check memory usage
Review request queue depth
Examine slow query logs
Check for connection leaks

Monitoring & Observability

Deployment

Incident Management

Security

Performance Metrics

Performance Metrics

Overview

Performance Baselines

HTTP Request Latency

Database Query Performance

Health Check Performance

Resource Utilization

Database Connections

Concurrent Requests

Error Rates

HTTP Error Rates

Health Check Failures

Business Metrics

User Activity

Request Throughput

Service-Specific Baselines

repti-core Service

Monitoring Best Practices

1. Set Up Alerts

2. Dashboard Organization

3. Log Correlation

4. Baseline Updates

Metric Labels

Standard Labels

HTTP Metrics Labels

Database Metrics Labels

Error Metrics Labels

Querying Metrics

Common Prometheus Queries

Troubleshooting Performance Issues

High Latency

High Error Rate

Resource Exhaustion

Monitoring & Observability

Deployment

Incident Management

Security

​Performance Metrics

​Overview

​Performance Baselines

​HTTP Request Latency

​Database Query Performance

​Health Check Performance

​Resource Utilization

​Database Connections

​Concurrent Requests

​Error Rates

​HTTP Error Rates

​Health Check Failures

​Business Metrics

​User Activity

​Request Throughput

​Service-Specific Baselines

​repti-core Service

​Monitoring Best Practices

​1. Set Up Alerts

​2. Dashboard Organization

​3. Log Correlation

​4. Baseline Updates

​Metric Labels

​Standard Labels

​HTTP Metrics Labels

​Database Metrics Labels

​Error Metrics Labels

​Querying Metrics

​Common Prometheus Queries

​Troubleshooting Performance Issues

​High Latency

​High Error Rate

​Resource Exhaustion

​Related Documentation

Performance Metrics

Overview

Performance Baselines

HTTP Request Latency

Database Query Performance

Health Check Performance

Resource Utilization

Database Connections

Concurrent Requests

Error Rates

HTTP Error Rates

Health Check Failures

Business Metrics

User Activity

Request Throughput

Service-Specific Baselines

repti-core Service

Monitoring Best Practices

1. Set Up Alerts

2. Dashboard Organization

3. Log Correlation

4. Baseline Updates

Metric Labels

Standard Labels

HTTP Metrics Labels

Database Metrics Labels

Error Metrics Labels

Querying Metrics

Common Prometheus Queries

Troubleshooting Performance Issues

High Latency

High Error Rate

Resource Exhaustion

Related Documentation