Monitoring & Observability Complete monitoring and observability strategy for ReptiDex, covering
infrastructure metrics, application performance monitoring, logging,
alerting, and business metrics across all environments.
Quick Navigation
Monitoring Stack Overview
Three-Tier Monitoring Strategy
Infrastructure Layer AWS CloudWatch for infrastructure metrics, ECS Fargate, RDS, ElastiCache, and ALB monitoring with Container Insights and custom dashboards
Application Layer CloudWatch Container Insights for container-level metrics, distributed
tracing, and real-time error tracking via CloudWatch Logs
Business Layer Custom Metrics via CloudWatch for user engagement, feature usage, and business KPIs with automated reporting
ECS Fargate Monitoring : All services run on serverless Fargate containers
with ARM64 Graviton2 processors. Container Insights provides task-level CPU,
memory, network, and storage metrics without agent installation.
Category Tool Purpose Cost Infrastructure AWS CloudWatch Infrastructure metrics, logs, alarms, dashboards $0-15/month APM Grafana Application performance, error tracking, distributed tracing Cost of ECS Service Error Tracking Loki Real-time error monitoring, performance issues Cost of ECS Service Uptime UptimeRobot External uptime monitoring, status pages $7-58/month
Infrastructure Monitoring
CloudWatch Metrics Configuration
ECS Fargate Metrics
Database Metrics
Network & Load Balancer
Container and Task Monitoring ECS Task Metrics (1-minute intervals via Container Insights):
Task CPU Utilization : CPU usage per task (% of allocated vCPU)
Task Memory Utilization : Memory usage per task (% of allocated memory)
Network Bytes : Network I/O per task (bytes in/out)
Task Count : Running tasks per service, desired vs actual
Container-Level Metrics :
Container CPU : Per-container CPU usage within tasks
Container Memory : Memory usage, limits, and OOM events
Container Restarts : Container restart count and reasons
Container Health : Health check status (healthy/unhealthy)
ECS Service Metrics :
Service CPU : Aggregate CPU across all tasks in service
Service Memory : Aggregate memory usage per service
Deployment Status : Running/pending/desired task counts
Service Events : Task start/stop events, health changes
ECS Cluster Metrics :
Cluster CPU Reservation : Total vCPU reserved by tasks
Cluster Memory Reservation : Total memory reserved by tasks
Task Placement : Distribution across availability zones
Cluster Scaling : Auto-scaling activity and trends
Critical Thresholds :
Task CPU : Warning >70%, Critical >90%
Task Memory : Warning >80%, Critical >95%
Task Restarts : Warning >3/hour, Critical >10/hour
Unhealthy Tasks : Critical if >50% tasks unhealthy
Infrastructure Dashboards
CloudWatch Dashboard Structure • System Health : Overall status indicators • Performance KPIs : Response time, uptime, error rates • Cost Metrics : Monthly spend, resource utilization • Business Metrics : Active users, feature usage • Containers : ECS Fargate tasks, service health, cluster metrics • Database : RDS, ElastiCache detailed metrics • Network : ALB, CloudFront, VPC metrics • Security : Failed login attempts, blocked requests
Grafana APM Integration
Application Metrics
Service Monitoring
Business Metrics
Core Application Monitoring Response Time Tracking :
Web Transactions : Response time by endpoint
Database Queries : Query execution time, N+1 queries
External Services : API call response times
Background Jobs : Job processing duration Throughput Metrics :
Requests per Minute : Traffic volume by endpoint
Database Operations : Queries per second by type
Cache Operations : Redis get/set operations per second
File Operations : S3 upload/download rates
Error Tracking :
Exception Rate : Errors per minute/hour
Error Types : Categorized error analysis
Error Trends : Error rate over time
User Impact : Users affected by errors
Resource Utilization :
Memory Usage : Heap usage, garbage collection
CPU Usage : Application CPU consumption
Thread Pool : Thread utilization and blocking
Connection Pools : Database connection usage
Real-time Error Monitoring
Sentry Error Tracking Setup • JavaScript Errors : Unhandled exceptions and Promise rejections • React Component Errors : Component render and lifecycle errors • Network Errors : Failed API calls and timeout errors • User Context : User ID, session info, browser details • Python Exceptions : Unhandled application errors • Database Errors : Connection failures, query errors • API Errors : Validation errors, authentication failures • Background Job Failures : Celery task failures
Centralized Logging
CloudWatch Logs Configuration
Application Logs
Infrastructure Logs
Log Analysis
Structured Application Logging Log Levels and Routing :
DEBUG : Development environment only, detailed execution flow
INFO : General application events, user actions, system events
WARNING : Non-critical issues, deprecated features usage
ERROR : Application errors, failed operations, retryable failures
CRITICAL : System failures, security incidents, data corruption
Service-Specific Log Groups :/ecs/dev-reptidex-core
/ecs/dev-reptidex-animal
/ecs/dev-reptidex-commerce
/ecs/dev-reptidex-community
/ecs/dev-reptidex-media
/ecs/dev-reptidex-ops
/ecs/dev-reptidex-public
/ecs/dev-reptidex-admin
/ecs/dev-reptidex-breeder
/ecs/dev-reptidex-embed
Structured Log Format :{
"timestamp" : "2025-01-15T10:30:00Z" ,
"level" : "INFO" ,
"service" : "repti-animal" ,
"user_id" : "user-12345" ,
"session_id" : "sess-67890" ,
"request_id" : "req-abcdef" ,
"message" : "Animal record created" ,
"context" : {
"animal_id" : "animal-xyz" ,
"species" : "ball_python" ,
"action" : "create_record"
}
}
Alerting Strategy
Alert Severity Levels
Critical Alerts Immediate Response Required - Application completely down - Database
unavailable - Security breach detected - Data corruption identified
Warning Alerts Business Hours Response - Performance degradation - High error rates -
Resource utilization spikes - Failed background jobs
Info Alerts Informational Only - Scheduled maintenance - Deployment notifications -
Usage milestones reached - Resource scaling events
Alert Configuration Matrix
Metric Warning Critical Duration Notification CPU Utilization >70% >90% 5 minutes Email/SMS Memory Usage >80% >95% 3 minutes Email/SMS Response Time >3 seconds >10 seconds 10 minutes Email/Slack Error Rate >5% >15% 5 minutes Email/SMS Database Connections >80 connections >95 connections 2 minutes Email/SMS Disk Space < 20% free < 10% free 1 minute Email/SMS
Notification Channels
Primary Channels
Escalation Matrix
On-Call Schedule
Email Notifications :
Primary
Contact : Immediate alerts for all critical issues
Distribution List :
Team notifications for warnings
Escalation : Manager notifications for unresolved criticals
Formatting : Rich HTML with graphs and context links
SMS/Text Alerts :
Critical Only : Production-down scenarios
After Hours : Critical alerts outside business hours
Rate Limiting : Max 3 SMS per hour to prevent spam
Acknowledgment : SMS reply to acknowledge receipt
Slack Integration :
#alerts Channel : All alerts with severity indicators
#dev-team Channel : Development and staging alerts
#management Channel : Business impact alerts
Bot Commands : /acknowledge, /resolve, /escalate
Mobile Push Notifications :
PagerDuty App : On-call rotation management
Custom App : Team member direct notifications
Escalation Path : Auto-escalate unacknowledged alerts
Service Level Objectives
ReptiDex SLA Targets • Production Uptime : 99.9% (8.77 hours downtime/year) • API Availability : 99.95% (4.38 hours downtime/year) • Database Uptime : 99.99% (52.6 minutes downtime/year) • CDN Availability : 99.9% (geographic redundancy) • Page Load Time : < 3 seconds (95th percentile) • API Response : < 500ms (average), < 2s (95th percentile) • Database Queries : < 100ms (average), < 1s (99th percentile) • Image Upload : < 10s for 5MB files
Response Times
Throughput Targets
Error Rate Thresholds
Web Application Response Times :
Homepage : < 1.5s load time, < 500ms API calls
Animal Records : < 2s load time, < 300ms search queries
Image Gallery : < 3s initial load, < 1s subsequent images
User Dashboard : < 2s load time, < 200ms status updates
API Endpoint Performance :
Authentication : < 200ms login, < 100ms token validation
CRUD Operations : < 300ms creates, < 100ms reads
Search Queries : < 500ms simple search, < 2s complex filters
File Uploads : < 5s for images, < 15s for videos
Database Query Performance :
Simple Selects : < 50ms average response time
Complex Joins : < 200ms for multi-table queries
Aggregations : < 500ms for reporting queries
Bulk Operations : < 5s for batch inserts/updates
External Service Integration :
Payment Processing : < 3s for transaction completion
Email Delivery : < 10s for transactional emails
Image Processing : < 30s for resize and optimization
Backup Operations : < 2 hours for full database backup
Monitoring Automation
Automated Response Actions
Auto-Remediation Workflows Trigger ECS Service Auto Scaling → Increase task count → Clear application cache → Alert team
Kill idle connections → Restart affected tasks → Scale RDS if needed → Alert database team
Force task restart → Update task definition with more memory → Scale out tasks → Alert ops team
ALB removes task from rotation → ECS starts replacement task → Deployment circuit breaker triggers → Escalate immediately
Monitoring Cost Optimization
Cost Management
ROI Tracking
Monitoring Budget Control CloudWatch Cost Optimization :
Log Retention : 30 days for debug logs, 1 year for errors
Metric Resolution : 5-minute standard, 1-minute for critical only
Dashboard Optimization : Minimize widget count and refresh rates
Alert Tuning : Regular review to eliminate false positives
Third-party Tool Budgets :
Grafana Cloud : Start with Free tier, upgrade to Pro ($49/month)
Prometheus : Self-hosted (infrastructure costs only)
Loki : Self-hosted (infrastructure costs only)
Scaling Cost Strategy :
Volume Discounts : Negotiate annual contracts at scale
Feature Utilization : Regular audit of unused features
Data Retention : Automated archival of old monitoring data
Regional Optimization : Use cost-effective AWS regions
External Uptime Monitoring
UptimeRobot Configuration
Why External Monitoring? While CloudWatch provides comprehensive internal monitoring, external uptime monitoring with UptimeRobot adds a critical layer of observability from outside your infrastructure. This catches issues that internal monitoring might miss, such as:
• DNS resolution failures • SSL/TLS certificate expiration • CDN or edge location issues • Complete AWS region outages • Network routing problems • Load balancer misconfigurations
Monitor Configuration
Production Services
Health Check Types
Alert Configuration
Status Page
All Production Endpoints Monitored Backend API Services (6 microservices):
Core API : https://api.reptidex.com/api/v1/health
Animal API : https://animal-api.reptidex.com/api/v1/health
Commerce API : https://commerce-api.reptidex.com/api/v1/health
Media API : https://media-api.reptidex.com/api/v1/health
Community API : https://community-api.reptidex.com/api/v1/health
Ops API : https://ops-api.reptidex.com/api/v1/health
Frontend Applications (4 web apps):
Public Website : https://reptidex.com
Breeder Dashboard : https://app.reptidex.com
Admin Portal : https://admin.reptidex.com
Embeddable Widgets : https://embed.reptidex.com
Check Frequency : Every 5 minutes
Alert Threshold : Down for >2 minutes (1 failed check)
Multi-location : US East, US West, EU West, Asia Pacific
Setup Instructions
Quick Setup Guide Step 1: Create UptimeRobot Account # Sign up at https://uptimerobot.com
# Recommended plan: Pro ($58/month) for 1-minute intervals and SMS alerts
# Generate API key from My Settings > API Settings
Step 2: Run Automated Setup Script # Navigate to monitoring scripts directory
cd infrastructure/monitoring/scripts
# Set API key environment variable
export UPTIMEROBOT_API_KEY = "your-api-key-here"
# Run setup script for all environments
./setup-uptime-monitors.sh
# Or set up specific environment
./setup-uptime-monitors.sh --environment prod
Step 3: Configure Notification Channels
Add email contacts in UptimeRobot dashboard
Configure SMS numbers for critical alerts
Set up Slack webhook integration
Test notification delivery
Step 4: Set Up Public Status Page
Go to Public Status Pages in dashboard
Create new status page
Add custom domain: status.reptidex.com
Configure DNS CNAME record
Customize branding and colors
Step 5: Configure Maintenance Windows
Schedule weekly database backups (Sunday 2 AM EST)
Set monthly system updates (15th, 3 AM EST)
Pause relevant monitors during maintenance
Step 6: Review and Test
Verify all monitors are active
Test alert delivery (use pause/unpause feature)
Review response time baselines
Adjust thresholds as needed
Monitor Management
Configuration as Code
Daily Operations
Incident Response
SLA Tracking
Infrastructure as Code Approach Configuration File : infrastructure/monitoring/uptime-robot-config.yamlThis YAML file defines all monitors, alert contacts, and status page configuration. Benefits:
Version Control : Track changes to monitoring configuration
Reproducibility : Easily recreate monitors in new accounts
Documentation : Self-documenting infrastructure
Automation : Scripted setup and updates
Key Sections :config :
account : # Account settings
status_page : # Public status page config
notifications : # Alert contact configuration
monitors : # All monitor definitions
- name : "[PROD] ReptiDex Core API"
url : "https://api.reptidex.com/api/v1/health"
interval : 5
# ... additional settings
maintenance_windows : # Scheduled downtime
sla_targets : # SLA compliance tracking
Cost Optimization
Plan Monitors Interval Features Cost Free 50 5 min Email alerts, 2-month logs $0 Pro 200 1 min SMS, phone, 3-month logs $58/mo Business 500 1 min All features, 1-year logs $118/mo
Recommended Plan : Start with Free for development/staging, upgrade to Pro for production when launching.
Cost Savings Tips :
Use 5-minute intervals for non-critical services
Consolidate dev/staging monitors
Leverage maintenance windows to prevent false alerts
Use keyword monitors instead of multiple HTTP checks
Review and remove unused monitors monthly
This comprehensive monitoring strategy provides ReptiDex with enterprise-grade
observability while remaining cost-effective for a startup. The three-tier
approach ensures complete visibility from infrastructure through business
metrics, with intelligent alerting that minimizes false positives while
guaranteeing rapid response to critical issues. External uptime monitoring with
UptimeRobot adds a critical safety net, ensuring service availability is
verified from outside your infrastructure.