Skip to main content

Monitoring & Observability

Complete monitoring and observability strategy for ReptiDex, covering infrastructure metrics, application performance monitoring, logging, alerting, and business metrics across all environments.

Quick Navigation


Monitoring Stack Overview

Three-Tier Monitoring Strategy

Infrastructure Layer

AWS CloudWatch for infrastructure metrics, ECS Fargate, RDS, ElastiCache, and ALB monitoring with Container Insights and custom dashboards

Application Layer

CloudWatch Container Insights for container-level metrics, distributed tracing, and real-time error tracking via CloudWatch Logs

Business Layer

Custom Metrics via CloudWatch for user engagement, feature usage, and business KPIs with automated reporting
ECS Fargate Monitoring: All services run on serverless Fargate containers with ARM64 Graviton2 processors. Container Insights provides task-level CPU, memory, network, and storage metrics without agent installation.

Monitoring Tools Matrix

CategoryToolPurposeCost
InfrastructureAWS CloudWatchInfrastructure metrics, logs, alarms, dashboards$0-15/month
APMGrafanaApplication performance, error tracking, distributed tracingCost of ECS Service
Error TrackingLokiReal-time error monitoring, performance issuesCost of ECS Service
UptimeUptimeRobotExternal uptime monitoring, status pages$7-58/month

Infrastructure Monitoring

CloudWatch Metrics Configuration

  • ECS Fargate Metrics
  • Database Metrics
  • Network & Load Balancer

Container and Task Monitoring

ECS Task Metrics (1-minute intervals via Container Insights):
  • Task CPU Utilization: CPU usage per task (% of allocated vCPU)
  • Task Memory Utilization: Memory usage per task (% of allocated memory)
  • Network Bytes: Network I/O per task (bytes in/out)
  • Task Count: Running tasks per service, desired vs actual
Container-Level Metrics:
  • Container CPU: Per-container CPU usage within tasks
  • Container Memory: Memory usage, limits, and OOM events
  • Container Restarts: Container restart count and reasons
  • Container Health: Health check status (healthy/unhealthy)
ECS Service Metrics:
  • Service CPU: Aggregate CPU across all tasks in service
  • Service Memory: Aggregate memory usage per service
  • Deployment Status: Running/pending/desired task counts
  • Service Events: Task start/stop events, health changes
ECS Cluster Metrics:
  • Cluster CPU Reservation: Total vCPU reserved by tasks
  • Cluster Memory Reservation: Total memory reserved by tasks
  • Task Placement: Distribution across availability zones
  • Cluster Scaling: Auto-scaling activity and trends
Critical Thresholds:
  • Task CPU: Warning >70%, Critical >90%
  • Task Memory: Warning >80%, Critical >95%
  • Task Restarts: Warning >3/hour, Critical >10/hour
  • Unhealthy Tasks: Critical if >50% tasks unhealthy

Infrastructure Dashboards

CloudWatch Dashboard Structure

  • System Health: Overall status indicators
  • Performance KPIs: Response time, uptime, error rates
  • Cost Metrics: Monthly spend, resource utilization
  • Business Metrics: Active users, feature usage
  • Containers: ECS Fargate tasks, service health, cluster metrics
  • Database: RDS, ElastiCache detailed metrics
  • Network: ALB, CloudFront, VPC metrics
  • Security: Failed login attempts, blocked requests

Application Performance Monitoring

Grafana APM Integration

  • Application Metrics
  • Service Monitoring
  • Business Metrics

Core Application Monitoring

Response Time Tracking:
  • Web Transactions: Response time by endpoint
  • Database Queries: Query execution time, N+1 queries
  • External Services: API call response times
  • Background Jobs: Job processing duration Throughput Metrics:
  • Requests per Minute: Traffic volume by endpoint
  • Database Operations: Queries per second by type
  • Cache Operations: Redis get/set operations per second
  • File Operations: S3 upload/download rates
Error Tracking:
  • Exception Rate: Errors per minute/hour
  • Error Types: Categorized error analysis
  • Error Trends: Error rate over time
  • User Impact: Users affected by errors
Resource Utilization:
  • Memory Usage: Heap usage, garbage collection
  • CPU Usage: Application CPU consumption
  • Thread Pool: Thread utilization and blocking
  • Connection Pools: Database connection usage

Real-time Error Monitoring

Sentry Error Tracking Setup

  • JavaScript Errors: Unhandled exceptions and Promise rejections
  • React Component Errors: Component render and lifecycle errors
  • Network Errors: Failed API calls and timeout errors
  • User Context: User ID, session info, browser details
  • Python Exceptions: Unhandled application errors
  • Database Errors: Connection failures, query errors
  • API Errors: Validation errors, authentication failures
  • Background Job Failures: Celery task failures

Centralized Logging

CloudWatch Logs Configuration

  • Application Logs
  • Infrastructure Logs
  • Log Analysis

Structured Application Logging

Log Levels and Routing:
  • DEBUG: Development environment only, detailed execution flow
  • INFO: General application events, user actions, system events
  • WARNING: Non-critical issues, deprecated features usage
  • ERROR: Application errors, failed operations, retryable failures
  • CRITICAL: System failures, security incidents, data corruption
Service-Specific Log Groups:
/ecs/dev-reptidex-core
/ecs/dev-reptidex-animal
/ecs/dev-reptidex-commerce
/ecs/dev-reptidex-community
/ecs/dev-reptidex-media
/ecs/dev-reptidex-ops
/ecs/dev-reptidex-public
/ecs/dev-reptidex-admin
/ecs/dev-reptidex-breeder
/ecs/dev-reptidex-embed
Structured Log Format:
{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "service": "repti-animal",
  "user_id": "user-12345",
  "session_id": "sess-67890",
  "request_id": "req-abcdef",
  "message": "Animal record created",
  "context": {
    "animal_id": "animal-xyz",
    "species": "ball_python",
    "action": "create_record"
  }
}

Alerting Strategy

Alert Severity Levels

Critical Alerts

Immediate Response Required - Application completely down - Database unavailable - Security breach detected - Data corruption identified

Warning Alerts

Business Hours Response - Performance degradation - High error rates - Resource utilization spikes - Failed background jobs

Info Alerts

Informational Only - Scheduled maintenance - Deployment notifications - Usage milestones reached - Resource scaling events

Alert Configuration Matrix

MetricWarningCriticalDurationNotification
CPU Utilization>70%>90%5 minutesEmail/SMS
Memory Usage>80%>95%3 minutesEmail/SMS
Response Time>3 seconds>10 seconds10 minutesEmail/Slack
Error Rate>5%>15%5 minutesEmail/SMS
Database Connections>80 connections>95 connections2 minutesEmail/SMS
Disk Space< 20% free< 10% free1 minuteEmail/SMS

Notification Channels

  • Primary Channels
  • Escalation Matrix
  • On-Call Schedule

Immediate Notification Setup

Email Notifications:
  • Primary Contact: Immediate alerts for all critical issues
  • Distribution List: Team notifications for warnings
  • Escalation: Manager notifications for unresolved criticals
  • Formatting: Rich HTML with graphs and context links
SMS/Text Alerts:
  • Critical Only: Production-down scenarios
  • After Hours: Critical alerts outside business hours
  • Rate Limiting: Max 3 SMS per hour to prevent spam
  • Acknowledgment: SMS reply to acknowledge receipt
Slack Integration:
  • #alerts Channel: All alerts with severity indicators
  • #dev-team Channel: Development and staging alerts
  • #management Channel: Business impact alerts
  • Bot Commands: /acknowledge, /resolve, /escalate
Mobile Push Notifications:
  • PagerDuty App: On-call rotation management
  • Custom App: Team member direct notifications
  • Escalation Path: Auto-escalate unacknowledged alerts

Performance Baselines & SLAs

Service Level Objectives

ReptiDex SLA Targets

  • Production Uptime: 99.9% (8.77 hours downtime/year)
  • API Availability: 99.95% (4.38 hours downtime/year)
  • Database Uptime: 99.99% (52.6 minutes downtime/year)
  • CDN Availability: 99.9% (geographic redundancy)
  • Page Load Time: < 3 seconds (95th percentile)
  • API Response: < 500ms (average), < 2s (95th percentile)
  • Database Queries: < 100ms (average), < 1s (99th percentile)
  • Image Upload: < 10s for 5MB files

Baseline Performance Metrics

  • Response Times
  • Throughput Targets
  • Error Rate Thresholds

Expected Performance Baselines

Web Application Response Times:
  • Homepage: < 1.5s load time, < 500ms API calls
  • Animal Records: < 2s load time, < 300ms search queries
  • Image Gallery: < 3s initial load, < 1s subsequent images
  • User Dashboard: < 2s load time, < 200ms status updates
API Endpoint Performance:
  • Authentication: < 200ms login, < 100ms token validation
  • CRUD Operations: < 300ms creates, < 100ms reads
  • Search Queries: < 500ms simple search, < 2s complex filters
  • File Uploads: < 5s for images, < 15s for videos
Database Query Performance:
  • Simple Selects: < 50ms average response time
  • Complex Joins: < 200ms for multi-table queries
  • Aggregations: < 500ms for reporting queries
  • Bulk Operations: < 5s for batch inserts/updates
External Service Integration:
  • Payment Processing: < 3s for transaction completion
  • Email Delivery: < 10s for transactional emails
  • Image Processing: < 30s for resize and optimization
  • Backup Operations: < 2 hours for full database backup

Monitoring Automation

Automated Response Actions

Auto-Remediation Workflows

Trigger ECS Service Auto Scaling → Increase task count → Clear application cache → Alert team

Kill idle connections → Restart affected tasks → Scale RDS if needed → Alert database team

Force task restart → Update task definition with more memory → Scale out tasks → Alert ops team

ALB removes task from rotation → ECS starts replacement task → Deployment circuit breaker triggers → Escalate immediately

Monitoring Cost Optimization

  • Cost Management
  • ROI Tracking

Monitoring Budget Control

CloudWatch Cost Optimization:
  • Log Retention: 30 days for debug logs, 1 year for errors
  • Metric Resolution: 5-minute standard, 1-minute for critical only
  • Dashboard Optimization: Minimize widget count and refresh rates
  • Alert Tuning: Regular review to eliminate false positives
Third-party Tool Budgets:
  • Grafana Cloud: Start with Free tier, upgrade to Pro ($49/month)
  • Prometheus: Self-hosted (infrastructure costs only)
  • Loki: Self-hosted (infrastructure costs only)
Scaling Cost Strategy:
  • Volume Discounts: Negotiate annual contracts at scale
  • Feature Utilization: Regular audit of unused features
  • Data Retention: Automated archival of old monitoring data
  • Regional Optimization: Use cost-effective AWS regions

External Uptime Monitoring

UptimeRobot Configuration

Why External Monitoring?

While CloudWatch provides comprehensive internal monitoring, external uptime monitoring with UptimeRobot adds a critical layer of observability from outside your infrastructure. This catches issues that internal monitoring might miss, such as:

  • • DNS resolution failures
  • • SSL/TLS certificate expiration
  • • CDN or edge location issues
  • • Complete AWS region outages
  • • Network routing problems
  • • Load balancer misconfigurations

Monitor Configuration

  • Production Services
  • Health Check Types
  • Alert Configuration
  • Status Page

All Production Endpoints Monitored

Backend API Services (6 microservices):
  • Core API: https://api.reptidex.com/api/v1/health
  • Animal API: https://animal-api.reptidex.com/api/v1/health
  • Commerce API: https://commerce-api.reptidex.com/api/v1/health
  • Media API: https://media-api.reptidex.com/api/v1/health
  • Community API: https://community-api.reptidex.com/api/v1/health
  • Ops API: https://ops-api.reptidex.com/api/v1/health
Frontend Applications (4 web apps):
  • Public Website: https://reptidex.com
  • Breeder Dashboard: https://app.reptidex.com
  • Admin Portal: https://admin.reptidex.com
  • Embeddable Widgets: https://embed.reptidex.com
Check Frequency: Every 5 minutes Alert Threshold: Down for >2 minutes (1 failed check) Multi-location: US East, US West, EU West, Asia Pacific

Setup Instructions

Quick Setup Guide

Step 1: Create UptimeRobot Account
# Sign up at https://uptimerobot.com
# Recommended plan: Pro ($58/month) for 1-minute intervals and SMS alerts
# Generate API key from My Settings > API Settings
Step 2: Run Automated Setup Script
# Navigate to monitoring scripts directory
cd infrastructure/monitoring/scripts

# Set API key environment variable
export UPTIMEROBOT_API_KEY="your-api-key-here"

# Run setup script for all environments
./setup-uptime-monitors.sh

# Or set up specific environment
./setup-uptime-monitors.sh --environment prod
Step 3: Configure Notification Channels
  • Add email contacts in UptimeRobot dashboard
  • Configure SMS numbers for critical alerts
  • Set up Slack webhook integration
  • Test notification delivery
Step 4: Set Up Public Status Page
  • Go to Public Status Pages in dashboard
  • Create new status page
  • Add custom domain: status.reptidex.com
  • Configure DNS CNAME record
  • Customize branding and colors
Step 5: Configure Maintenance Windows
  • Schedule weekly database backups (Sunday 2 AM EST)
  • Set monthly system updates (15th, 3 AM EST)
  • Pause relevant monitors during maintenance
Step 6: Review and Test
  • Verify all monitors are active
  • Test alert delivery (use pause/unpause feature)
  • Review response time baselines
  • Adjust thresholds as needed

Monitor Management

  • Configuration as Code
  • Daily Operations
  • Incident Response
  • SLA Tracking

Infrastructure as Code Approach

Configuration File: infrastructure/monitoring/uptime-robot-config.yamlThis YAML file defines all monitors, alert contacts, and status page configuration. Benefits:
  • Version Control: Track changes to monitoring configuration
  • Reproducibility: Easily recreate monitors in new accounts
  • Documentation: Self-documenting infrastructure
  • Automation: Scripted setup and updates
Key Sections:
config:
  account: # Account settings
  status_page: # Public status page config
  notifications: # Alert contact configuration

monitors: # All monitor definitions
  - name: "[PROD] ReptiDex Core API"
    url: "https://api.reptidex.com/api/v1/health"
    interval: 5
    # ... additional settings

maintenance_windows: # Scheduled downtime
sla_targets: # SLA compliance tracking

Cost Optimization

PlanMonitorsIntervalFeaturesCost
Free505 minEmail alerts, 2-month logs$0
Pro2001 minSMS, phone, 3-month logs$58/mo
Business5001 minAll features, 1-year logs$118/mo
Recommended Plan: Start with Free for development/staging, upgrade to Pro for production when launching. Cost Savings Tips:
  • Use 5-minute intervals for non-critical services
  • Consolidate dev/staging monitors
  • Leverage maintenance windows to prevent false alerts
  • Use keyword monitors instead of multiple HTTP checks
  • Review and remove unused monitors monthly

This comprehensive monitoring strategy provides ReptiDex with enterprise-grade observability while remaining cost-effective for a startup. The three-tier approach ensures complete visibility from infrastructure through business metrics, with intelligent alerting that minimizes false positives while guaranteeing rapid response to critical issues. External uptime monitoring with UptimeRobot adds a critical safety net, ensuring service availability is verified from outside your infrastructure.