Skip to main content

ReptiDex AWS Deployment Architecture (Current Implementation)

Table of Contents

  1. Overview
  2. Architecture Diagram
  3. AWS Resources
  4. Deployment Flow
  5. Service Communication
  6. Domain & DNS Structure
  7. Security & Secrets Management
  8. Troubleshooting
  9. Future Improvements

Overview

ReptiDex uses a CloudFormation-managed ECS Fargate microservices architecture deployed on AWS with automated CI/CD pipelines via GitHub Actions. This document reflects the actual current state of our infrastructure as of October 2025.

Current Environment: Development (reptidex-dev)

AWS Account: 974061962050 Region: us-east-1 (US East - N. Virginia) Profile Name: reptidex-dev

Architecture Summary

  • 6 Backend Microservices: FastAPI/Python services running on ECS Fargate (ARM64)
  • 4 Frontend Applications: React/TypeScript SPAs built with Vite, served via nginx on ECS Fargate (ARM64)
  • Application Load Balancer: Routes traffic based on subdomains and paths
  • RDS PostgreSQL: Managed database cluster
  • ElastiCache Redis: Managed Redis cluster for caching
  • ECR: Docker image registry for all services
  • Secrets Manager: Centralized secrets and database credentials
  • CloudFormation: Infrastructure as Code for all AWS resources
  • GitHub Actions: CI/CD automation for build, test, and deploy

Infrastructure Management

All infrastructure is managed via CloudFormation templates located in /infrastructure/templates/:
  1. 01-vpc.yaml: VPC, subnets, NAT gateways, internet gateway
  2. 02-security.yaml: Security groups, IAM roles, instance profiles
  3. 03-database.yaml: RDS PostgreSQL, ElastiCache Redis
  4. 04-compute.yaml: ALB, target groups, listener rules, DNS records
  5. 05-ecs.yaml: ECS cluster, task definitions, services

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                         INTERNET (Users)                            │
└────────────────────────┬────────────────────────────────────────────┘


              ┌──────────────────────┐
              │   Route 53           │
              │  reptidex.com        │
              │  DNS Records:        │
              │  • dev.reptidex.com  │
              │  • api-dev.*         │
              │  • admin-dev.*       │
              └──────────┬───────────┘


┌────────────────────────────────────────────────────────────────────┐
│                    AWS VPC (10.1.0.0/16)                           │
│                    dev-reptidex-vpc                                │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │          Application Load Balancer (ALB)                     │ │
│  │          Public Subnets (10.1.1.0/24, 10.1.2.0/24)          │ │
│  │                                                               │ │
│  │  HTTPS Listener (443) - Host-based Routing:                 │ │
│  │  • api-dev.reptidex.com → Core service                      │ │
│  │  • animal-api-dev.reptidex.com → Animal service             │ │
│  │  • commerce-api-dev.reptidex.com → Commerce service         │ │
│  │  • community-api-dev.reptidex.com → Community service       │ │
│  │  • media-api-dev.reptidex.com → Media service               │ │
│  │  • ops-api-dev.reptidex.com → Ops service                   │ │
│  │  • dev.reptidex.com → Public frontend (default)             │ │
│  │  • admin-dev.reptidex.com → Admin frontend                  │ │
│  │  • breeder-dev.reptidex.com → Breeder frontend              │ │
│  │  • embed-dev.reptidex.com → Embed frontend                  │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                         │                                           │
│                         ▼                                           │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │              ECS Fargate Cluster                             │ │
│  │              Private Subnets (10.1.10.0/24, 10.1.11.0/24)   │ │
│  │                                                               │ │
│  │  Backend Services (FastAPI):                                 │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │ │
│  │  │ repti-core   │  │ repti-animal │  │repti-commerce│      │ │
│  │  │   :8000      │  │   :8001      │  │   :8002      │      │ │
│  │  │ CPU: 256     │  │ CPU: 256     │  │ CPU: 256     │      │ │
│  │  │ Mem: 512     │  │ Mem: 512     │  │ Mem: 512     │      │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘      │ │
│  │                                                               │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │ │
│  │  │ repti-media  │  │repti-community│ │  repti-ops   │      │ │
│  │  │   :8003      │  │   :8004      │  │   :8005      │      │ │
│  │  │ CPU: 256     │  │ CPU: 256     │  │ CPU: 256     │      │ │
│  │  │ Mem: 512     │  │ Mem: 512     │  │ Mem: 512     │      │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘      │ │
│  │                                                               │ │
│  │  Frontend Services (nginx):                                  │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │ │
│  │  │ web-public   │  │  web-admin   │  │ web-breeder  │      │ │
│  │  │   :80        │  │   :80        │  │   :80        │      │ │
│  │  │ CPU: 256     │  │ CPU: 256     │  │ CPU: 256     │      │ │
│  │  │ Mem: 512     │  │ Mem: 512     │  │ Mem: 512     │      │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘      │ │
│  │                                                               │ │
│  │  ┌──────────────┐                                            │ │
│  │  │  web-embed   │                                            │ │
│  │  │   :80        │  All tasks share:                          │ │
│  │  │ CPU: 256     │  • Secrets from Secrets Manager            │ │
│  │  │ Mem: 512     │  • Database/Redis access via SG            │ │
│  │  └──────────────┘  • CloudWatch Logs                         │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                     │
│                         │                                           │
│                         ▼                                           │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │         RDS PostgreSQL Cluster                               │ │
│  │         Private Subnets (10.1.20.0/24, 10.1.21.0/24)        │ │
│  │                                                               │ │
│  │         Engine: PostgreSQL 15.10                             │ │
│  │         Instance: db.t4g.micro (ARM)                         │ │
│  │         Storage: 20GB GP3 (encrypted)                        │ │
│  │         Multi-AZ: No (dev environment)                       │ │
│  │         Database: postgres                                   │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │         ElastiCache Redis Cluster                            │ │
│  │         Private Subnets (10.1.20.0/24, 10.1.21.0/24)        │ │
│  │                                                               │ │
│  │         Engine: Redis 7.1                                    │ │
│  │         Node Type: cache.t4g.micro (ARM)                     │ │
│  │         Nodes: 1 (dev environment)                           │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    AWS Services (Regional)                          │
│                                                                     │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐ │
│  │  ECR Repositories│  │ Secrets Manager  │  │  CloudWatch      │ │
│  │                  │  │                  │  │                  │ │
│  │ • repti-core     │  │ • dev-reptidex-  │  │ • Log Groups     │ │
│  │ • repti-animal   │  │   database-url   │  │ • Metrics        │ │
│  │ • repti-commerce │  │ • dev-reptidex-  │  │ • Alarms         │ │
│  │ • repti-media    │  │   db-connection  │  │                  │ │
│  │ • repti-community│  │                  │  │                  │ │
│  │ • repti-ops      │  │ Contains:        │  │                  │ │
│  │ • reptidex-      │  │ - host, port     │  │                  │ │
│  │   web-public     │  │ - dbname         │  │                  │ │
│  │ • reptidex-      │  │ - username       │  │                  │ │
│  │   web-admin      │  │ - password       │  │                  │ │
│  │ • reptidex-      │  │ - connection url │  │                  │ │
│  │   web-breeder    │  │                  │  │                  │ │
│  │ • reptidex-      │  │                  │  │                  │ │
│  │   web-embed      │  │                  │  │                  │ │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                         GitHub Actions                              │
│                                                                     │
│  Push to staging → Build Docker Image (ARM64) → Push to ECR →     │
│  Update ECS Service → Rolling Deployment → Health Check →          │
│  Success / Rollback                                                 │
└─────────────────────────────────────────────────────────────────────┘

AWS Resources

1. VPC & Networking

ResourceDetails
VPC CIDR10.1.0.0/16
Public Subnets10.1.1.0/24 (us-east-1a), 10.1.2.0/24 (us-east-1b)
Private Subnets10.1.10.0/24, 10.1.11.0/24 (ECS tasks)
Database Subnets10.1.20.0/24, 10.1.21.0/24 (RDS/Redis)
Internet GatewayFor public subnet internet access
NAT Gateways2 (one per AZ) for private subnet outbound traffic
VPC EndpointsECR API, ECR DKR, S3, Secrets Manager, CloudWatch Logs

2. Application Load Balancer (ALB)

ResourceDetails
SchemeInternet-facing
SubnetsPublic subnets (us-east-1a, us-east-1b)
Security GroupAllows 80/443 from 0.0.0.0/0
ListenersHTTP (80) redirects to HTTPS, HTTPS (443)
SSL CertificateACM certificate for *.reptidex.com
Target Groups10 total (6 backend + 4 frontend)

Target Groups

Target GroupPortHealth Check Path
dev-reptidex-core-tg8000/api/v1/health
dev-reptidex-animal-tg8001/api/v1/health
dev-reptidex-commerce-tg8002/api/v1/health
dev-reptidex-media-tg8003/api/v1/health
dev-reptidex-community-tg8004/api/v1/health
dev-reptidex-ops-tg8005/api/v1/health
dev-reptidex-public-tg80/health
dev-reptidex-admin-tg80/health
dev-reptidex-breeder-tg80/health
dev-reptidex-embed-tg80/health

3. ECS Fargate

ResourceDetails
Cluster Namedev-reptidex-cluster
Launch TypeFargate
PlatformARM64 (Graviton2)
Network Modeawsvpc
Task CPU256 (.25 vCPU) per task
Task Memory512 MB per task
Desired Count1 per service (dev environment)

ECS Services

ServiceTask DefinitionPortImage
dev-reptidex-coredev-reptidex-core:*8000repti-core:staging
dev-reptidex-animaldev-reptidex-animal:*8001repti-animal:staging
dev-reptidex-commercedev-reptidex-commerce:*8002repti-commerce:staging
dev-reptidex-mediadev-reptidex-media:*8003repti-media:staging
dev-reptidex-communitydev-reptidex-community:*8004repti-community:staging
dev-reptidex-opsdev-reptidex-ops:*8005repti-ops:staging
dev-reptidex-publicdev-reptidex-public:*80reptidex-web-public:staging
dev-reptidex-admindev-reptidex-admin:*80reptidex-web-admin:staging
dev-reptidex-breederdev-reptidex-breeder:*80reptidex-web-breeder:staging
dev-reptidex-embeddev-reptidex-embed:*80reptidex-web-embed:staging

4. Database (RDS)

ResourceDetails
DB Identifierdev-reptidex-postgres
Instance Classdb.t4g.micro (2 vCPU, 1GB RAM, ARM)
EnginePostgreSQL 15.10
Port5432
Database Namepostgres
Storage20GB GP3 (encrypted)
Multi-AZNo (dev environment)
Backup Retention7 days
EncryptionAWS managed KMS key

5. Cache (ElastiCache Redis)

ResourceDetails
Replication Groupdev-reptidex-redis
Node Typecache.t4g.micro (ARM)
EngineRedis 7.1
Port6379
Number of Nodes1 (dev environment)
EncryptionAt-rest and in-transit

6. Container Registry (ECR)

All repositories in region us-east-1:
Repository NameImage ArchitectureImage Tags
repti-coreARM64staging, main, staging-{sha}
repti-animalARM64staging, main, staging-{sha}
repti-commerceARM64staging, main, staging-{sha}
repti-mediaARM64staging, main, staging-{sha}
repti-communityARM64staging, main, staging-{sha}
repti-opsARM64staging, main, staging-{sha}
reptidex-web-publicARM64staging, main, staging-{sha}
reptidex-web-adminARM64staging, main, staging-{sha}
reptidex-web-breederARM64staging, main, staging-{sha}
reptidex-web-embedARM64staging, main, staging-{sha}
Registry URL: 974061962050.dkr.ecr.us-east-1.amazonaws.com

7. Secrets Manager

Secret NamePurpose
dev-reptidex-database-urlPostgreSQL connection string
dev-reptidex-db-connectionDatabase connection details (JSON)

8. Security Groups

Security GroupPurposeInbound Rules
dev-reptidex-alb-sgALB security group80, 443 from 0.0.0.0/0
dev-reptidex-ecs-sgECS tasks8000-8005, 80 from ALB SG
dev-reptidex-rds-sgRDS database5432 from ECS SG
dev-reptidex-cache-sgElastiCache6379 from ECS SG

9. IAM Roles

RolePurposeManaged Policies
dev-reptidex-ecs-task-execution-roleECS task executionECR pull, CloudWatch Logs, Secrets Manager
dev-reptidex-ecs-task-roleECS task runtimeApplication-specific permissions

Deployment Flow

CloudFormation Stack Deployment

Infrastructure is deployed using the infrastructure/scripts/deploy.sh script:
# Deploy all stacks in order
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev all

# Deploy individual stacks
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev vpc
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev security
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev database
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev compute
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev ecs
Stack deployment order is critical:
  1. VPC (creates network foundation)
  2. Security (creates security groups and IAM roles)
  3. Database (creates RDS and Redis, depends on VPC and Security)
  4. Compute (creates ALB and DNS records, depends on VPC and Security)
  5. ECS (creates cluster and services, depends on all previous stacks)

GitHub Actions CI/CD Pipeline

Each service repository has a .github/workflows/cicd.yml file that automates:

Trigger Events

  • Push to staging branch: Deploy to development environment
  • Push to main branch: Build and tag only (production deployment TBD)
  • Pull Request: Run tests only (no deployment)

Backend Service Workflow

Frontend Service Workflow

Key Deployment Steps

Backend Services:
# Build for ARM64
docker buildx build \
  --platform linux/arm64 \
  --push \
  --tag 974061962050.dkr.ecr.us-east-1.amazonaws.com/repti-animal:staging \
  --tag 974061962050.dkr.ecr.us-east-1.amazonaws.com/repti-animal:staging-abc1234 \
  .

# Update ECS service (triggers rolling deployment)
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-animal \
  --force-new-deployment
Frontend Services:
# Install dependencies including workspace packages from GitHub Packages
npm install

# Build frontend
npm run build

# Build Docker image with nginx
docker buildx build \
  --platform linux/arm64 \
  --push \
  --tag 974061962050.dkr.ecr.us-east-1.amazonaws.com/reptidex-web-public:staging \
  .

# Update ECS service
aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-public \
  --force-new-deployment

ECS Rolling Deployment

ECS performs rolling deployments automatically:
  1. New Task Start: ECS starts new tasks with updated image
  2. Health Check: ALB performs health checks on new tasks
  3. Traffic Shift: Once healthy, ALB routes traffic to new tasks
  4. Old Task Drain: Old tasks are drained and stopped
  5. Cleanup: Old task definitions remain for rollback
Deployment Settings:
  • Maximum: 200% (allows double the desired count during deployment)
  • Minimum: 100% (maintains at least desired count during deployment)
  • Circuit Breaker: Disabled (manual rollback if needed)

Service Communication

Backend Service Architecture

All 6 backend services are independent FastAPI applications that:
  1. Run in separate ECS Fargate tasks
  2. Share the same database (RDS PostgreSQL)
  3. Share the same cache (ElastiCache Redis)
  4. Have their own subdomain for API access
  5. Use /api/v1/* for all endpoints (consistent across services)

Service Endpoints

ServiceSubdomainAPI Base PathHealth Check
Coreapi-dev.reptidex.com/api/v1//api/v1/health
Animalanimal-api-dev.reptidex.com/api/v1//api/v1/health
Commercecommerce-api-dev.reptidex.com/api/v1//api/v1/health
Mediamedia-api-dev.reptidex.com/api/v1//api/v1/health
Communitycommunity-api-dev.reptidex.com/api/v1//api/v1/health
Opsops-api-dev.reptidex.com/api/v1//api/v1/health

Documentation Endpoints

Each service has Swagger UI and ReDoc documentation:
  • Swagger UI: https://{service-subdomain}/docs
  • ReDoc: https://{service-subdomain}/redoc
  • OpenAPI Spec: https://{service-subdomain}/openapi.json
Examples:
  • https://api-dev.reptidex.com/docs - Core service docs
  • https://animal-api-dev.reptidex.com/docs - Animal service docs

Inter-Service Communication

Current Approach: Services communicate via their public ALB endpoints
import httpx

# Example: Commerce service calling Animal service
async with httpx.AsyncClient() as client:
    response = await client.get(
        "https://animal-api-dev.reptidex.com/api/v1/animals/123"
    )
Future Improvement: Use AWS Cloud Map for service discovery and internal DNS

Database Access

All services connect to the same RDS PostgreSQL instance:
  • Connection String: From DATABASE_URL environment variable (Secrets Manager)
  • Driver: asyncpg (async PostgreSQL driver for Python)
  • ORM: SQLAlchemy 2.0 with async support
  • Schema Management: Each service manages its own tables
  • Migrations: Alembic (run separately per service)

Cache Access

All services can access the shared ElastiCache Redis cluster:
  • Connection String: From environment variable
  • Driver: redis-py with async support
  • Usage: Session storage, rate limiting, caching

Domain & DNS Structure

DNS Records (Route 53)

All DNS records point to the Application Load Balancer: Frontend Applications:
  • dev.reptidex.com → ALB → web-public (default)
  • admin-dev.reptidex.com → ALB → web-admin
  • breeder-dev.reptidex.com → ALB → web-breeder
  • embed-dev.reptidex.com → ALB → web-embed
Backend APIs:
  • api-dev.reptidex.com → ALB → Core service
  • animal-api-dev.reptidex.com → ALB → Animal service
  • commerce-api-dev.reptidex.com → ALB → Commerce service
  • community-api-dev.reptidex.com → ALB → Community service
  • media-api-dev.reptidex.com → ALB → Media service
  • ops-api-dev.reptidex.com → ALB → Ops service

ALB Listener Rules

The ALB uses host-based routing to direct traffic: Priority 1: Core service (host: api-dev.reptidex.com) Priority 2: Animal service (host: animal-api-dev.reptidex.com) Priority 3: Community service (host: community-api-dev.reptidex.com) Priority 4: Media service (host: media-api-dev.reptidex.com) Priority 5: Ops service (host: ops-api-dev.reptidex.com) Priority 6: Admin frontend (host: admin-dev.reptidex.com) Priority 7: Breeder frontend (host: breeder-dev.reptidex.com) Priority 8: Embed frontend (host: embed-dev.reptidex.com) Priority 11: Commerce service (host: commerce-api-dev.reptidex.com) Default: Public frontend (host: dev.reptidex.com)

SSL/TLS

  • Certificate: ACM certificate for *.reptidex.com (wildcard)
  • Certificate ARN: arn:aws:acm:us-east-1:974061962050:certificate/f38a801d-5873-42cd-be09-232a396590fb
  • Protocol: TLS 1.2+
  • Termination: At ALB (traffic to ECS tasks is HTTP within VPC)

Security & Secrets Management

AWS Secrets Manager

Database credentials are stored in AWS Secrets Manager: Secret: dev-reptidex-db-connection
{
  "host": "dev-reptidex-postgres.cqjoc0ikql0f.us-east-1.rds.amazonaws.com",
  "port": "5432",
  "dbname": "postgres",
  "username": "reptidex_dev",
  "password": "<auto-generated>",
  "url": "postgresql+asyncpg://reptidex_dev:<password>@<host>:5432/postgres"
}
ECS tasks access secrets via IAM role permissions:
  • Task execution role has secretsmanager:GetSecretValue permission
  • Secrets are injected as environment variables at task startup
  • Secrets are never stored in task definitions or CloudFormation templates

Network Security

Defense in Depth:
  1. VPC Isolation: Private subnets for ECS tasks and databases
  2. Security Groups: Restrict traffic to minimum required
  3. NAT Gateways: Outbound internet access for private subnets (ECR pulls, etc.)
  4. VPC Endpoints: Private connections to AWS services (no internet required)
  5. SSL/TLS: Encrypted traffic from users to ALB
Security Group Rules:
ALB Security Group:
  Inbound: 80, 443 from 0.0.0.0/0
  Outbound: All to ECS security group

ECS Security Group:
  Inbound: 8000-8005, 80 from ALB security group
  Outbound: All (for database, ECR, Secrets Manager access)

RDS Security Group:
  Inbound: 5432 from ECS security group
  Outbound: None

Redis Security Group:
  Inbound: 6379 from ECS security group
  Outbound: None

IAM Roles & Policies

ECS Task Execution Role (dev-reptidex-ecs-task-execution-role):
  • Pull images from ECR
  • Write logs to CloudWatch
  • Read secrets from Secrets Manager
ECS Task Role (dev-reptidex-ecs-task-role):
  • Application-specific AWS service access (if needed)
  • Currently minimal permissions

Authentication & Authorization (Future)

Planned Implementation:
  • JWT-based authentication via Core service
  • OAuth2/OIDC integration
  • API key authentication for service-to-service
  • Rate limiting via API Gateway

Troubleshooting

Common Issues

1. Service Not Healthy

Symptoms: ECS service showing unhealthy tasks, 503 errors Diagnosis:
# Check ECS service status
AWS_PROFILE=reptidex-dev aws ecs describe-services \
  --cluster dev-reptidex-cluster \
  --services dev-reptidex-animal \
  --region us-east-1

# Check task status
AWS_PROFILE=reptidex-dev aws ecs list-tasks \
  --cluster dev-reptidex-cluster \
  --service-name dev-reptidex-animal \
  --region us-east-1

# View task logs
AWS_PROFILE=reptidex-dev aws logs tail \
  /ecs/dev-reptidex \
  --follow \
  --filter-pattern "animal"
Common Causes:
  • Container crashes on startup (check logs)
  • Health check endpoint returning non-200 status
  • Database connection failures
  • Missing or incorrect environment variables

2. Database Connection Errors

Symptoms: asyncpg.exceptions.InvalidPasswordError, connection timeout Diagnosis:
# Check RDS status
AWS_PROFILE=reptidex-dev aws rds describe-db-instances \
  --db-instance-identifier dev-reptidex-postgres

# Check security groups
AWS_PROFILE=reptidex-dev aws ec2 describe-security-groups \
  --group-ids sg-xxx

# Verify secret
AWS_PROFILE=reptidex-dev aws secretsmanager get-secret-value \
  --secret-id dev-reptidex-db-connection \
  --query SecretString \
  --output text
Common Causes:
  • Security group not allowing ECS → RDS traffic
  • Incorrect password in Secrets Manager
  • Database not running
  • Wrong subnet configuration

3. Image Pull Errors

Symptoms: CannotPullContainerError, task fails to start Diagnosis:
# List ECR images
AWS_PROFILE=reptidex-dev aws ecr describe-images \
  --repository-name repti-animal \
  --region us-east-1

# Check task definition
AWS_PROFILE=reptidex-dev aws ecs describe-task-definition \
  --task-definition dev-reptidex-animal \
  --region us-east-1
Common Causes:
  • Image doesn’t exist in ECR (check GitHub Actions logs)
  • Wrong image tag in task definition
  • IAM role missing ECR permissions
  • ARM64 image requested but amd64 built (or vice versa)

4. ALB Target Unhealthy

Symptoms: Targets showing unhealthy in ALB target group Diagnosis:
# Check target health
AWS_PROFILE=reptidex-dev aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn> \
  --region us-east-1

# Check target group configuration
AWS_PROFILE=reptidex-dev aws elbv2 describe-target-groups \
  --target-group-arns <target-group-arn> \
  --region us-east-1
Common Causes:
  • Health check path incorrect (/api/v1/health vs /health)
  • Security group not allowing ALB → ECS traffic
  • Service not listening on expected port
  • Health check timeout too short

Useful Commands

# Deploy infrastructure
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev all

# Check CloudFormation stack status
AWS_PROFILE=reptidex-dev aws cloudformation describe-stacks \
  --stack-name reptidex-dev-05-ecs \
  --region us-east-1

# View CloudFormation events
AWS_PROFILE=reptidex-dev aws cloudformation describe-stack-events \
  --stack-name reptidex-dev-05-ecs \
  --region us-east-1

# Update ECS service (force new deployment)
AWS_PROFILE=reptidex-dev aws ecs update-service \
  --cluster dev-reptidex-cluster \
  --service dev-reptidex-animal \
  --force-new-deployment

# View ECS service logs
AWS_PROFILE=reptidex-dev aws logs tail /ecs/dev-reptidex --follow

# Check ALB listener rules
AWS_PROFILE=reptidex-dev aws elbv2 describe-rules \
  --listener-arn <listener-arn>

# Check DNS resolution
dig api-dev.reptidex.com

Future Improvements

Short Term (Next 3 Months)

  1. Auto Scaling
    • Configure ECS Service Auto Scaling based on CPU/memory
    • Target tracking scaling policies
    • Scale in/out based on traffic patterns
  2. Enhanced Monitoring
    • CloudWatch dashboards for each service
    • Alarms for high error rates, latency, task failures
    • X-Ray integration for distributed tracing
  3. CI/CD Improvements
    • Automated database migrations in deployment pipeline
    • Automated rollback on failed health checks
    • Canary deployments for safer releases
  4. Cost Optimization
    • Review and right-size ECS task resources
    • Implement CloudWatch Logs retention policies
    • Use Savings Plans for Fargate compute

Medium Term (3-6 Months)

  1. Production Environment
    • Separate production AWS account
    • Multi-AZ RDS with read replicas
    • Increased ECS task counts for high availability
    • Redis cluster mode for better performance
  2. API Gateway
    • Centralized API Gateway for all backend services
    • Rate limiting and throttling
    • Request/response validation
    • Unified API documentation
  3. Service Mesh
    • AWS App Mesh for service-to-service communication
    • mTLS between services
    • Advanced traffic routing (retries, circuit breakers)
    • Better observability
  4. Improved Security
    • AWS WAF for ALB
    • GuardDuty for threat detection
    • Security Hub for centralized security monitoring
    • Automated security scanning in CI/CD

Long Term (6+ Months)

  1. Multi-Region Deployment
    • Deploy to multiple AWS regions
    • Route 53 geo-routing
    • Cross-region database replication
    • Global DynamoDB tables
  2. Event-Driven Architecture
    • EventBridge for event bus
    • Lambda functions for background jobs
    • SQS/SNS for async messaging
    • Step Functions for workflows
  3. Advanced Caching
    • CloudFront CDN for static assets
    • API response caching at ALB/API Gateway
    • Redis caching strategies per service
    • Database query result caching
  4. Observability Platform
    • Centralized logging (Grafana)
    • Application Performance Monitoring (APM)
    • Distributed tracing across all services
    • Up Time Monitoring (UptimeRobot)
    • Real-user monitoring (RUM) for frontend

Document Maintenance

Last Updated: October 7, 2025 Updated By: Engineering Team Next Review: When architecture changes significantly Change Log:
  • 2025-10-07: Updated to reflect ECS Fargate + ALB architecture with subdomain-based routing
  • 2025-10-05: Initial EC2-based Docker Compose documentation

Quick Reference

Infrastructure Deployment

# Deploy all stacks
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev all

# Deploy specific stack
AWS_PROFILE=reptidex-dev ./infrastructure/scripts/deploy.sh dev ecs

Service URLs

Frontend: Backend APIs: Documentation:

GitHub Actions Secrets

Required secrets in each repository:
Secret NameDescription
AWS_ACCESS_KEY_IDAWS access key with ECR/ECS permissions
AWS_SECRET_ACCESS_KEYAWS secret access key
GH_PACKAGE_TOKENGitHub token for accessing @reptidex-app packages

Contact & Support

For questions about this infrastructure:
  1. Review this document thoroughly
  2. Check troubleshooting section
  3. Review AWS CloudFormation stack events
  4. Check GitHub Actions workflow logs
  5. Review ECS task logs in CloudWatch
Tip for new engineers: Start by reviewing the CloudFormation templates in /infrastructure/templates/ to understand how resources are defined and connected.