Monitoring & Observability

Complete monitoring and observability strategy for ReptiDex, covering infrastructure metrics, application performance monitoring, logging, alerting, and business metrics across all environments.

Monitoring Stack

Tools & services overview

Infrastructure

Server & cloud metrics

Application

APM & performance

Alerting

Notifications & escalation

Monitoring Stack Overview

Three-Tier Monitoring Strategy

Infrastructure Layer

AWS CloudWatch for infrastructure metrics, ECS Fargate, RDS, ElastiCache, and ALB monitoring with Container Insights and custom dashboards

Application Layer

CloudWatch Container Insights for container-level metrics, distributed tracing, and real-time error tracking via CloudWatch Logs

Business Layer

Custom Metrics via CloudWatch for user engagement, feature usage, and business KPIs with automated reporting

ECS Fargate Monitoring: All services run on serverless Fargate containers with ARM64 Graviton2 processors. Container Insights provides task-level CPU, memory, network, and storage metrics without agent installation.

Monitoring Tools Matrix

Category	Tool	Purpose	Cost
Infrastructure	AWS CloudWatch	Infrastructure metrics, logs, alarms, dashboards	$0-15/month
APM	Grafana	Application performance, error tracking, distributed tracing	Cost of ECS Service
Error Tracking	Loki	Real-time error monitoring, performance issues	Cost of ECS Service
Uptime	UptimeRobot	External uptime monitoring, status pages	$7-58/month

Infrastructure Monitoring

CloudWatch Metrics Configuration

ECS Fargate Metrics
Database Metrics
Network & Load Balancer

Container and Task Monitoring

ECS Task Metrics (1-minute intervals via Container Insights):

Task CPU Utilization: CPU usage per task (% of allocated vCPU)
Task Memory Utilization: Memory usage per task (% of allocated memory)
Network Bytes: Network I/O per task (bytes in/out)
Task Count: Running tasks per service, desired vs actual

Container-Level Metrics:

Container CPU: Per-container CPU usage within tasks
Container Memory: Memory usage, limits, and OOM events
Container Restarts: Container restart count and reasons
Container Health: Health check status (healthy/unhealthy)

ECS Service Metrics:

Service CPU: Aggregate CPU across all tasks in service
Service Memory: Aggregate memory usage per service
Deployment Status: Running/pending/desired task counts
Service Events: Task start/stop events, health changes

ECS Cluster Metrics:

Cluster CPU Reservation: Total vCPU reserved by tasks
Cluster Memory Reservation: Total memory reserved by tasks
Task Placement: Distribution across availability zones
Cluster Scaling: Auto-scaling activity and trends

Critical Thresholds:

Task CPU: Warning >70%, Critical >90%
Task Memory: Warning >80%, Critical >95%
Task Restarts: Warning >3/hour, Critical >10/hour
Unhealthy Tasks: Critical if >50% tasks unhealthy

Infrastructure Dashboards

CloudWatch Dashboard Structure

• System Health: Overall status indicators
• Performance KPIs: Response time, uptime, error rates
• Cost Metrics: Monthly spend, resource utilization
• Business Metrics: Active users, feature usage

• Containers: ECS Fargate tasks, service health, cluster metrics
• Database: RDS, ElastiCache detailed metrics
• Network: ALB, CloudFront, VPC metrics
• Security: Failed login attempts, blocked requests

Application Performance Monitoring

Grafana APM Integration

Application Metrics
Service Monitoring
Business Metrics

Core Application Monitoring

Response Time Tracking:

Web Transactions: Response time by endpoint
Database Queries: Query execution time, N+1 queries
External Services: API call response times
Background Jobs: Job processing duration Throughput Metrics:
Requests per Minute: Traffic volume by endpoint
Database Operations: Queries per second by type
Cache Operations: Redis get/set operations per second
File Operations: S3 upload/download rates

Error Tracking:

Exception Rate: Errors per minute/hour
Error Types: Categorized error analysis
Error Trends: Error rate over time
User Impact: Users affected by errors

Resource Utilization:

Memory Usage: Heap usage, garbage collection
CPU Usage: Application CPU consumption
Thread Pool: Thread utilization and blocking
Connection Pools: Database connection usage

Real-time Error Monitoring

Sentry Error Tracking Setup

• JavaScript Errors: Unhandled exceptions and Promise rejections
• React Component Errors: Component render and lifecycle errors
• Network Errors: Failed API calls and timeout errors
• User Context: User ID, session info, browser details

• Python Exceptions: Unhandled application errors
• Database Errors: Connection failures, query errors
• API Errors: Validation errors, authentication failures
• Background Job Failures: Celery task failures

Centralized Logging

CloudWatch Logs Configuration

Application Logs
Infrastructure Logs
Log Analysis

Structured Application Logging

Log Levels and Routing:

DEBUG: Development environment only, detailed execution flow
INFO: General application events, user actions, system events
WARNING: Non-critical issues, deprecated features usage
ERROR: Application errors, failed operations, retryable failures
CRITICAL: System failures, security incidents, data corruption

Service-Specific Log Groups:

/ecs/dev-reptidex-core
/ecs/dev-reptidex-animal
/ecs/dev-reptidex-commerce
/ecs/dev-reptidex-community
/ecs/dev-reptidex-media
/ecs/dev-reptidex-ops
/ecs/dev-reptidex-public
/ecs/dev-reptidex-admin
/ecs/dev-reptidex-breeder
/ecs/dev-reptidex-embed

Structured Log Format:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "service": "repti-animal",
  "user_id": "user-12345",
  "session_id": "sess-67890",
  "request_id": "req-abcdef",
  "message": "Animal record created",
  "context": {
    "animal_id": "animal-xyz",
    "species": "ball_python",
    "action": "create_record"
  }
}

Alerting Strategy

Alert Severity Levels

Critical Alerts

Immediate Response Required - Application completely down - Database unavailable - Security breach detected - Data corruption identified

Warning Alerts

Business Hours Response - Performance degradation - High error rates - Resource utilization spikes - Failed background jobs

Info Alerts

Informational Only - Scheduled maintenance - Deployment notifications - Usage milestones reached - Resource scaling events

Alert Configuration Matrix

Metric	Warning	Critical	Duration	Notification
CPU Utilization	>70%	>90%	5 minutes	Email/SMS
Memory Usage	>80%	>95%	3 minutes	Email/SMS
Response Time	>3 seconds	>10 seconds	10 minutes	Email/Slack
Error Rate	>5%	>15%	5 minutes	Email/SMS
Database Connections	>80 connections	>95 connections	2 minutes	Email/SMS
Disk Space	< 20% free	< 10% free	1 minute	Email/SMS

Notification Channels

Primary Channels
Escalation Matrix
On-Call Schedule

Immediate Notification Setup

Email Notifications:

Primary Contact: Immediate alerts for all critical issues
Distribution List: Team notifications for warnings
Escalation: Manager notifications for unresolved criticals
Formatting: Rich HTML with graphs and context links

SMS/Text Alerts:

Critical Only: Production-down scenarios
After Hours: Critical alerts outside business hours
Rate Limiting: Max 3 SMS per hour to prevent spam
Acknowledgment: SMS reply to acknowledge receipt

Slack Integration:

#alerts Channel: All alerts with severity indicators
#dev-team Channel: Development and staging alerts
#management Channel: Business impact alerts
Bot Commands: /acknowledge, /resolve, /escalate

Mobile Push Notifications:

PagerDuty App: On-call rotation management
Custom App: Team member direct notifications
Escalation Path: Auto-escalate unacknowledged alerts

Performance Baselines & SLAs

Service Level Objectives

ReptiDex SLA Targets

• Production Uptime: 99.9% (8.77 hours downtime/year)
• API Availability: 99.95% (4.38 hours downtime/year)
• Database Uptime: 99.99% (52.6 minutes downtime/year)
• CDN Availability: 99.9% (geographic redundancy)

• Page Load Time: < 3 seconds (95th percentile)
• API Response: < 500ms (average), < 2s (95th percentile)
• Database Queries: < 100ms (average), < 1s (99th percentile)
• Image Upload: < 10s for 5MB files

Baseline Performance Metrics

Response Times
Throughput Targets
Error Rate Thresholds

Expected Performance Baselines

Web Application Response Times:

Homepage: < 1.5s load time, < 500ms API calls
Animal Records: < 2s load time, < 300ms search queries
Image Gallery: < 3s initial load, < 1s subsequent images
User Dashboard: < 2s load time, < 200ms status updates

API Endpoint Performance:

Authentication: < 200ms login, < 100ms token validation
CRUD Operations: < 300ms creates, < 100ms reads
Search Queries: < 500ms simple search, < 2s complex filters
File Uploads: < 5s for images, < 15s for videos

Database Query Performance:

Simple Selects: < 50ms average response time
Complex Joins: < 200ms for multi-table queries
Aggregations: < 500ms for reporting queries
Bulk Operations: < 5s for batch inserts/updates

External Service Integration:

Payment Processing: < 3s for transaction completion
Email Delivery: < 10s for transactional emails
Image Processing: < 30s for resize and optimization
Backup Operations: < 2 hours for full database backup

Monitoring Automation

Automated Response Actions

Auto-Remediation Workflows

Trigger ECS Service Auto Scaling → Increase task count → Clear application cache → Alert team

Kill idle connections → Restart affected tasks → Scale RDS if needed → Alert database team

Force task restart → Update task definition with more memory → Scale out tasks → Alert ops team

ALB removes task from rotation → ECS starts replacement task → Deployment circuit breaker triggers → Escalate immediately

Monitoring Cost Optimization

Cost Management
ROI Tracking

Monitoring Budget Control

CloudWatch Cost Optimization:

Log Retention: 30 days for debug logs, 1 year for errors
Metric Resolution: 5-minute standard, 1-minute for critical only
Dashboard Optimization: Minimize widget count and refresh rates
Alert Tuning: Regular review to eliminate false positives

Third-party Tool Budgets:

Grafana Cloud: Start with Free tier, upgrade to Pro ($49/month)
Prometheus: Self-hosted (infrastructure costs only)
Loki: Self-hosted (infrastructure costs only)

Scaling Cost Strategy:

Volume Discounts: Negotiate annual contracts at scale
Feature Utilization: Regular audit of unused features
Data Retention: Automated archival of old monitoring data
Regional Optimization: Use cost-effective AWS regions

External Uptime Monitoring

UptimeRobot Configuration

Why External Monitoring?

While CloudWatch provides comprehensive internal monitoring, external uptime monitoring with UptimeRobot adds a critical layer of observability from outside your infrastructure. This catches issues that internal monitoring might miss, such as:

• DNS resolution failures
• SSL/TLS certificate expiration
• CDN or edge location issues
• Complete AWS region outages
• Network routing problems
• Load balancer misconfigurations

Monitor Configuration

Production Services
Health Check Types
Alert Configuration
Status Page

All Production Endpoints Monitored

Backend API Services (6 microservices):

Core API: https://api.reptidex.com/api/v1/health
Animal API: https://animal-api.reptidex.com/api/v1/health
Commerce API: https://commerce-api.reptidex.com/api/v1/health
Media API: https://media-api.reptidex.com/api/v1/health
Community API: https://community-api.reptidex.com/api/v1/health
Ops API: https://ops-api.reptidex.com/api/v1/health

Frontend Applications (4 web apps):

Public Website: https://reptidex.com
Breeder Dashboard: https://app.reptidex.com
Admin Portal: https://admin.reptidex.com
Embeddable Widgets: https://embed.reptidex.com

Check Frequency: Every 5 minutes Alert Threshold: Down for >2 minutes (1 failed check) Multi-location: US East, US West, EU West, Asia Pacific

Setup Instructions

Quick Setup Guide

Step 1: Create UptimeRobot Account

# Sign up at https://uptimerobot.com
# Recommended plan: Pro ($58/month) for 1-minute intervals and SMS alerts
# Generate API key from My Settings > API Settings

Step 2: Run Automated Setup Script

# Navigate to monitoring scripts directory
cd infrastructure/monitoring/scripts

# Set API key environment variable
export UPTIMEROBOT_API_KEY="your-api-key-here"

# Run setup script for all environments
./setup-uptime-monitors.sh

# Or set up specific environment
./setup-uptime-monitors.sh --environment prod

Step 3: Configure Notification Channels

Add email contacts in UptimeRobot dashboard
Configure SMS numbers for critical alerts
Set up Slack webhook integration
Test notification delivery

Step 4: Set Up Public Status Page

Go to Public Status Pages in dashboard
Create new status page
Add custom domain: status.reptidex.com
Configure DNS CNAME record
Customize branding and colors

Step 5: Configure Maintenance Windows

Schedule weekly database backups (Sunday 2 AM EST)
Set monthly system updates (15th, 3 AM EST)
Pause relevant monitors during maintenance

Step 6: Review and Test

Verify all monitors are active
Test alert delivery (use pause/unpause feature)
Review response time baselines
Adjust thresholds as needed

Monitor Management

Configuration as Code
Daily Operations
Incident Response
SLA Tracking

Infrastructure as Code Approach

Configuration File: infrastructure/monitoring/uptime-robot-config.yamlThis YAML file defines all monitors, alert contacts, and status page configuration. Benefits:

Version Control: Track changes to monitoring configuration
Reproducibility: Easily recreate monitors in new accounts
Documentation: Self-documenting infrastructure
Automation: Scripted setup and updates

Key Sections:

config:
  account: # Account settings
  status_page: # Public status page config
  notifications: # Alert contact configuration

monitors: # All monitor definitions
  - name: "[PROD] ReptiDex Core API"
    url: "https://api.reptidex.com/api/v1/health"
    interval: 5
    # ... additional settings

maintenance_windows: # Scheduled downtime
sla_targets: # SLA compliance tracking

Cost Optimization

Plan	Monitors	Interval	Features	Cost
Free	50	5 min	Email alerts, 2-month logs	$0
Pro	200	1 min	SMS, phone, 3-month logs	$58/mo
Business	500	1 min	All features, 1-year logs	$118/mo

Recommended Plan: Start with Free for development/staging, upgrade to Pro for production when launching. Cost Savings Tips:

Use 5-minute intervals for non-critical services
Consolidate dev/staging monitors
Leverage maintenance windows to prevent false alerts
Use keyword monitors instead of multiple HTTP checks
Review and remove unused monitors monthly

This comprehensive monitoring strategy provides ReptiDex with enterprise-grade observability while remaining cost-effective for a startup. The three-tier approach ensures complete visibility from infrastructure through business metrics, with intelligent alerting that minimizes false positives while guaranteeing rapid response to critical issues. External uptime monitoring with UptimeRobot adds a critical safety net, ensuring service availability is verified from outside your infrastructure.

Getting Started

System Architecture

Backend Services

Frontend Applications

Shared Packages

Development Standards

Infrastructure & Operations

API Reference

​Monitoring & Observability

​Quick Navigation

​Monitoring Stack Overview

​Three-Tier Monitoring Strategy

Infrastructure Layer

Application Layer

Business Layer

​Monitoring Tools Matrix

​Infrastructure Monitoring

​CloudWatch Metrics Configuration

​Container and Task Monitoring

​Infrastructure Dashboards

​CloudWatch Dashboard Structure

​Application Performance Monitoring

​Grafana APM Integration

​Core Application Monitoring

​Real-time Error Monitoring

​Sentry Error Tracking Setup

​Centralized Logging

​CloudWatch Logs Configuration

​Structured Application Logging

​Alerting Strategy

​Alert Severity Levels

Critical Alerts

Warning Alerts

Info Alerts

​Alert Configuration Matrix

​Notification Channels

​Immediate Notification Setup

​Performance Baselines & SLAs

​Service Level Objectives

​ReptiDex SLA Targets

​Baseline Performance Metrics

​Expected Performance Baselines

​Monitoring Automation

​Automated Response Actions

​Auto-Remediation Workflows

​Monitoring Cost Optimization

​Monitoring Budget Control

​External Uptime Monitoring

​UptimeRobot Configuration

​Why External Monitoring?

​Monitor Configuration

​All Production Endpoints Monitored

​Setup Instructions

​Quick Setup Guide

​Monitor Management

​Infrastructure as Code Approach

​Cost Optimization

Monitoring & Observability

Quick Navigation

Monitoring Stack Overview

Three-Tier Monitoring Strategy

Monitoring Tools Matrix

Infrastructure Monitoring

CloudWatch Metrics Configuration

Container and Task Monitoring

Infrastructure Dashboards

CloudWatch Dashboard Structure

Application Performance Monitoring

Grafana APM Integration

Core Application Monitoring

Real-time Error Monitoring

Sentry Error Tracking Setup

Centralized Logging

CloudWatch Logs Configuration

Structured Application Logging

Alerting Strategy

Alert Severity Levels

Alert Configuration Matrix

Notification Channels

Immediate Notification Setup

Performance Baselines & SLAs

Service Level Objectives

ReptiDex SLA Targets

Baseline Performance Metrics

Expected Performance Baselines

Monitoring Automation

Automated Response Actions

Auto-Remediation Workflows

Monitoring Cost Optimization

Monitoring Budget Control

External Uptime Monitoring

UptimeRobot Configuration

Why External Monitoring?

Monitor Configuration

All Production Endpoints Monitored

Setup Instructions

Quick Setup Guide

Monitor Management

Infrastructure as Code Approach

Cost Optimization