ReptiDex AWS Deployment Architecture (Current Implementation)
Table of Contents
- Overview
- Architecture Diagram
- AWS Resources
- Deployment Flow
- Service Communication
- Domain & DNS Structure
- Security & Secrets Management
- Troubleshooting
- Future Improvements
Overview
ReptiDex uses a CloudFormation-managed ECS Fargate microservices architecture deployed on AWS with automated CI/CD pipelines via GitHub Actions. This document reflects the actual current state of our infrastructure as of October 2025.Current Environment: Development (reptidex-dev)
AWS Account:974061962050
Region: us-east-1 (US East - N. Virginia)
Profile Name: reptidex-dev
Architecture Summary
- 6 Backend Microservices: FastAPI/Python services running on ECS Fargate (ARM64)
- 4 Frontend Applications: React/TypeScript SPAs built with Vite, served via nginx on ECS Fargate (ARM64)
- Application Load Balancer: Routes traffic based on subdomains and paths
- RDS PostgreSQL: Managed database cluster
- ElastiCache Redis: Managed Redis cluster for caching
- ECR: Docker image registry for all services
- Secrets Manager: Centralized secrets and database credentials
- CloudFormation: Infrastructure as Code for all AWS resources
- GitHub Actions: CI/CD automation for build, test, and deploy
Infrastructure Management
All infrastructure is managed via CloudFormation templates located in/infrastructure/templates/:
- 01-vpc.yaml: VPC, subnets, NAT gateways, internet gateway
- 02-security.yaml: Security groups, IAM roles, instance profiles
- 03-database.yaml: RDS PostgreSQL, ElastiCache Redis
- 04-compute.yaml: ALB, target groups, listener rules, DNS records
- 05-ecs.yaml: ECS cluster, task definitions, services
Architecture Diagram
AWS Resources
1. VPC & Networking
| Resource | Details |
|---|---|
| VPC CIDR | 10.1.0.0/16 |
| Public Subnets | 10.1.1.0/24 (us-east-1a), 10.1.2.0/24 (us-east-1b) |
| Private Subnets | 10.1.10.0/24, 10.1.11.0/24 (ECS tasks) |
| Database Subnets | 10.1.20.0/24, 10.1.21.0/24 (RDS/Redis) |
| Internet Gateway | For public subnet internet access |
| NAT Gateways | 2 (one per AZ) for private subnet outbound traffic |
| VPC Endpoints | ECR API, ECR DKR, S3, Secrets Manager, CloudWatch Logs |
2. Application Load Balancer (ALB)
| Resource | Details |
|---|---|
| Scheme | Internet-facing |
| Subnets | Public subnets (us-east-1a, us-east-1b) |
| Security Group | Allows 80/443 from 0.0.0.0/0 |
| Listeners | HTTP (80) redirects to HTTPS, HTTPS (443) |
| SSL Certificate | ACM certificate for *.reptidex.com |
| Target Groups | 10 total (6 backend + 4 frontend) |
Target Groups
| Target Group | Port | Health Check Path |
|---|---|---|
dev-reptidex-core-tg | 8000 | /api/v1/health |
dev-reptidex-animal-tg | 8001 | /api/v1/health |
dev-reptidex-commerce-tg | 8002 | /api/v1/health |
dev-reptidex-media-tg | 8003 | /api/v1/health |
dev-reptidex-community-tg | 8004 | /api/v1/health |
dev-reptidex-ops-tg | 8005 | /api/v1/health |
dev-reptidex-public-tg | 80 | /health |
dev-reptidex-admin-tg | 80 | /health |
dev-reptidex-breeder-tg | 80 | /health |
dev-reptidex-embed-tg | 80 | /health |
3. ECS Fargate
| Resource | Details |
|---|---|
| Cluster Name | dev-reptidex-cluster |
| Launch Type | Fargate |
| Platform | ARM64 (Graviton2) |
| Network Mode | awsvpc |
| Task CPU | 256 (.25 vCPU) per task |
| Task Memory | 512 MB per task |
| Desired Count | 1 per service (dev environment) |
ECS Services
| Service | Task Definition | Port | Image |
|---|---|---|---|
dev-reptidex-core | dev-reptidex-core:* | 8000 | repti-core:staging |
dev-reptidex-animal | dev-reptidex-animal:* | 8001 | repti-animal:staging |
dev-reptidex-commerce | dev-reptidex-commerce:* | 8002 | repti-commerce:staging |
dev-reptidex-media | dev-reptidex-media:* | 8003 | repti-media:staging |
dev-reptidex-community | dev-reptidex-community:* | 8004 | repti-community:staging |
dev-reptidex-ops | dev-reptidex-ops:* | 8005 | repti-ops:staging |
dev-reptidex-public | dev-reptidex-public:* | 80 | reptidex-web-public:staging |
dev-reptidex-admin | dev-reptidex-admin:* | 80 | reptidex-web-admin:staging |
dev-reptidex-breeder | dev-reptidex-breeder:* | 80 | reptidex-web-breeder:staging |
dev-reptidex-embed | dev-reptidex-embed:* | 80 | reptidex-web-embed:staging |
4. Database (RDS)
| Resource | Details |
|---|---|
| DB Identifier | dev-reptidex-postgres |
| Instance Class | db.t4g.micro (2 vCPU, 1GB RAM, ARM) |
| Engine | PostgreSQL 15.10 |
| Port | 5432 |
| Database Name | postgres |
| Storage | 20GB GP3 (encrypted) |
| Multi-AZ | No (dev environment) |
| Backup Retention | 7 days |
| Encryption | AWS managed KMS key |
5. Cache (ElastiCache Redis)
| Resource | Details |
|---|---|
| Replication Group | dev-reptidex-redis |
| Node Type | cache.t4g.micro (ARM) |
| Engine | Redis 7.1 |
| Port | 6379 |
| Number of Nodes | 1 (dev environment) |
| Encryption | At-rest and in-transit |
6. Container Registry (ECR)
All repositories in regionus-east-1:
| Repository Name | Image Architecture | Image Tags |
|---|---|---|
repti-core | ARM64 | staging, main, staging-{sha} |
repti-animal | ARM64 | staging, main, staging-{sha} |
repti-commerce | ARM64 | staging, main, staging-{sha} |
repti-media | ARM64 | staging, main, staging-{sha} |
repti-community | ARM64 | staging, main, staging-{sha} |
repti-ops | ARM64 | staging, main, staging-{sha} |
reptidex-web-public | ARM64 | staging, main, staging-{sha} |
reptidex-web-admin | ARM64 | staging, main, staging-{sha} |
reptidex-web-breeder | ARM64 | staging, main, staging-{sha} |
reptidex-web-embed | ARM64 | staging, main, staging-{sha} |
974061962050.dkr.ecr.us-east-1.amazonaws.com
7. Secrets Manager
| Secret Name | Purpose |
|---|---|
dev-reptidex-database-url | PostgreSQL connection string |
dev-reptidex-db-connection | Database connection details (JSON) |
8. Security Groups
| Security Group | Purpose | Inbound Rules |
|---|---|---|
dev-reptidex-alb-sg | ALB security group | 80, 443 from 0.0.0.0/0 |
dev-reptidex-ecs-sg | ECS tasks | 8000-8005, 80 from ALB SG |
dev-reptidex-rds-sg | RDS database | 5432 from ECS SG |
dev-reptidex-cache-sg | ElastiCache | 6379 from ECS SG |
9. IAM Roles
| Role | Purpose | Managed Policies |
|---|---|---|
dev-reptidex-ecs-task-execution-role | ECS task execution | ECR pull, CloudWatch Logs, Secrets Manager |
dev-reptidex-ecs-task-role | ECS task runtime | Application-specific permissions |
Deployment Flow
CloudFormation Stack Deployment
Infrastructure is deployed using theinfrastructure/scripts/deploy.sh script:
- VPC (creates network foundation)
- Security (creates security groups and IAM roles)
- Database (creates RDS and Redis, depends on VPC and Security)
- Compute (creates ALB and DNS records, depends on VPC and Security)
- ECS (creates cluster and services, depends on all previous stacks)
GitHub Actions CI/CD Pipeline
Each service repository has a.github/workflows/cicd.yml file that automates:
Trigger Events
- Push to
stagingbranch: Deploy to development environment - Push to
mainbranch: Build and tag only (production deployment TBD) - Pull Request: Run tests only (no deployment)
Backend Service Workflow
Frontend Service Workflow
Key Deployment Steps
Backend Services:ECS Rolling Deployment
ECS performs rolling deployments automatically:- New Task Start: ECS starts new tasks with updated image
- Health Check: ALB performs health checks on new tasks
- Traffic Shift: Once healthy, ALB routes traffic to new tasks
- Old Task Drain: Old tasks are drained and stopped
- Cleanup: Old task definitions remain for rollback
- Maximum: 200% (allows double the desired count during deployment)
- Minimum: 100% (maintains at least desired count during deployment)
- Circuit Breaker: Disabled (manual rollback if needed)
Service Communication
Backend Service Architecture
All 6 backend services are independent FastAPI applications that:- Run in separate ECS Fargate tasks
- Share the same database (RDS PostgreSQL)
- Share the same cache (ElastiCache Redis)
- Have their own subdomain for API access
- Use
/api/v1/*for all endpoints (consistent across services)
Service Endpoints
| Service | Subdomain | API Base Path | Health Check |
|---|---|---|---|
| Core | api-dev.reptidex.com | /api/v1/ | /api/v1/health |
| Animal | animal-api-dev.reptidex.com | /api/v1/ | /api/v1/health |
| Commerce | commerce-api-dev.reptidex.com | /api/v1/ | /api/v1/health |
| Media | media-api-dev.reptidex.com | /api/v1/ | /api/v1/health |
| Community | community-api-dev.reptidex.com | /api/v1/ | /api/v1/health |
| Ops | ops-api-dev.reptidex.com | /api/v1/ | /api/v1/health |
Documentation Endpoints
Each service has Swagger UI and ReDoc documentation:- Swagger UI:
https://{service-subdomain}/docs - ReDoc:
https://{service-subdomain}/redoc - OpenAPI Spec:
https://{service-subdomain}/openapi.json
https://api-dev.reptidex.com/docs- Core service docshttps://animal-api-dev.reptidex.com/docs- Animal service docs
Inter-Service Communication
Current Approach: Services communicate via their public ALB endpointsDatabase Access
All services connect to the same RDS PostgreSQL instance:- Connection String: From
DATABASE_URLenvironment variable (Secrets Manager) - Driver:
asyncpg(async PostgreSQL driver for Python) - ORM: SQLAlchemy 2.0 with async support
- Schema Management: Each service manages its own tables
- Migrations: Alembic (run separately per service)
Cache Access
All services can access the shared ElastiCache Redis cluster:- Connection String: From environment variable
- Driver:
redis-pywith async support - Usage: Session storage, rate limiting, caching
Domain & DNS Structure
DNS Records (Route 53)
All DNS records point to the Application Load Balancer: Frontend Applications:dev.reptidex.com→ ALB → web-public (default)admin-dev.reptidex.com→ ALB → web-adminbreeder-dev.reptidex.com→ ALB → web-breederembed-dev.reptidex.com→ ALB → web-embed
api-dev.reptidex.com→ ALB → Core serviceanimal-api-dev.reptidex.com→ ALB → Animal servicecommerce-api-dev.reptidex.com→ ALB → Commerce servicecommunity-api-dev.reptidex.com→ ALB → Community servicemedia-api-dev.reptidex.com→ ALB → Media serviceops-api-dev.reptidex.com→ ALB → Ops service
ALB Listener Rules
The ALB uses host-based routing to direct traffic: Priority 1: Core service (host:api-dev.reptidex.com)
Priority 2: Animal service (host: animal-api-dev.reptidex.com)
Priority 3: Community service (host: community-api-dev.reptidex.com)
Priority 4: Media service (host: media-api-dev.reptidex.com)
Priority 5: Ops service (host: ops-api-dev.reptidex.com)
Priority 6: Admin frontend (host: admin-dev.reptidex.com)
Priority 7: Breeder frontend (host: breeder-dev.reptidex.com)
Priority 8: Embed frontend (host: embed-dev.reptidex.com)
Priority 11: Commerce service (host: commerce-api-dev.reptidex.com)
Default: Public frontend (host: dev.reptidex.com)
SSL/TLS
- Certificate: ACM certificate for
*.reptidex.com(wildcard) - Certificate ARN:
arn:aws:acm:us-east-1:974061962050:certificate/f38a801d-5873-42cd-be09-232a396590fb - Protocol: TLS 1.2+
- Termination: At ALB (traffic to ECS tasks is HTTP within VPC)
Security & Secrets Management
AWS Secrets Manager
Database credentials are stored in AWS Secrets Manager: Secret:dev-reptidex-db-connection
- Task execution role has
secretsmanager:GetSecretValuepermission - Secrets are injected as environment variables at task startup
- Secrets are never stored in task definitions or CloudFormation templates
Network Security
Defense in Depth:- VPC Isolation: Private subnets for ECS tasks and databases
- Security Groups: Restrict traffic to minimum required
- NAT Gateways: Outbound internet access for private subnets (ECR pulls, etc.)
- VPC Endpoints: Private connections to AWS services (no internet required)
- SSL/TLS: Encrypted traffic from users to ALB
IAM Roles & Policies
ECS Task Execution Role (dev-reptidex-ecs-task-execution-role):
- Pull images from ECR
- Write logs to CloudWatch
- Read secrets from Secrets Manager
dev-reptidex-ecs-task-role):
- Application-specific AWS service access (if needed)
- Currently minimal permissions
Authentication & Authorization (Future)
Planned Implementation:- JWT-based authentication via Core service
- OAuth2/OIDC integration
- API key authentication for service-to-service
- Rate limiting via API Gateway
Troubleshooting
Common Issues
1. Service Not Healthy
Symptoms: ECS service showing unhealthy tasks, 503 errors Diagnosis:- Container crashes on startup (check logs)
- Health check endpoint returning non-200 status
- Database connection failures
- Missing or incorrect environment variables
2. Database Connection Errors
Symptoms:asyncpg.exceptions.InvalidPasswordError, connection timeout
Diagnosis:
- Security group not allowing ECS → RDS traffic
- Incorrect password in Secrets Manager
- Database not running
- Wrong subnet configuration
3. Image Pull Errors
Symptoms:CannotPullContainerError, task fails to start
Diagnosis:
- Image doesn’t exist in ECR (check GitHub Actions logs)
- Wrong image tag in task definition
- IAM role missing ECR permissions
- ARM64 image requested but amd64 built (or vice versa)
4. ALB Target Unhealthy
Symptoms: Targets showing unhealthy in ALB target group Diagnosis:- Health check path incorrect (
/api/v1/healthvs/health) - Security group not allowing ALB → ECS traffic
- Service not listening on expected port
- Health check timeout too short
Useful Commands
Future Improvements
Short Term (Next 3 Months)
-
Auto Scaling
- Configure ECS Service Auto Scaling based on CPU/memory
- Target tracking scaling policies
- Scale in/out based on traffic patterns
-
Enhanced Monitoring
- CloudWatch dashboards for each service
- Alarms for high error rates, latency, task failures
- X-Ray integration for distributed tracing
-
CI/CD Improvements
- Automated database migrations in deployment pipeline
- Automated rollback on failed health checks
- Canary deployments for safer releases
-
Cost Optimization
- Review and right-size ECS task resources
- Implement CloudWatch Logs retention policies
- Use Savings Plans for Fargate compute
Medium Term (3-6 Months)
-
Production Environment
- Separate production AWS account
- Multi-AZ RDS with read replicas
- Increased ECS task counts for high availability
- Redis cluster mode for better performance
-
API Gateway
- Centralized API Gateway for all backend services
- Rate limiting and throttling
- Request/response validation
- Unified API documentation
-
Service Mesh
- AWS App Mesh for service-to-service communication
- mTLS between services
- Advanced traffic routing (retries, circuit breakers)
- Better observability
-
Improved Security
- AWS WAF for ALB
- GuardDuty for threat detection
- Security Hub for centralized security monitoring
- Automated security scanning in CI/CD
Long Term (6+ Months)
-
Multi-Region Deployment
- Deploy to multiple AWS regions
- Route 53 geo-routing
- Cross-region database replication
- Global DynamoDB tables
-
Event-Driven Architecture
- EventBridge for event bus
- Lambda functions for background jobs
- SQS/SNS for async messaging
- Step Functions for workflows
-
Advanced Caching
- CloudFront CDN for static assets
- API response caching at ALB/API Gateway
- Redis caching strategies per service
- Database query result caching
-
Observability Platform
- Centralized logging (Grafana)
- Application Performance Monitoring (APM)
- Distributed tracing across all services
- Up Time Monitoring (UptimeRobot)
- Real-user monitoring (RUM) for frontend
Document Maintenance
Last Updated: October 7, 2025 Updated By: Engineering Team Next Review: When architecture changes significantly Change Log:- 2025-10-07: Updated to reflect ECS Fargate + ALB architecture with subdomain-based routing
- 2025-10-05: Initial EC2-based Docker Compose documentation
Quick Reference
Infrastructure Deployment
Service URLs
Frontend:- Public: https://dev.reptidex.com
- Admin: https://admin-dev.reptidex.com
- Breeder: https://breeder-dev.reptidex.com
- Embed: https://embed-dev.reptidex.com
- Core: https://api-dev.reptidex.com/api/v1/
- Animal: https://animal-api-dev.reptidex.com/api/v1/
- Commerce: https://commerce-api-dev.reptidex.com/api/v1/
- Community: https://community-api-dev.reptidex.com/api/v1/
- Media: https://media-api-dev.reptidex.com/api/v1/
- Ops: https://ops-api-dev.reptidex.com/api/v1/
- Core Docs: https://api-dev.reptidex.com/docs
- Animal Docs: https://animal-api-dev.reptidex.com/docs
- (etc. for each service)
GitHub Actions Secrets
Required secrets in each repository:| Secret Name | Description |
|---|---|
AWS_ACCESS_KEY_ID | AWS access key with ECR/ECS permissions |
AWS_SECRET_ACCESS_KEY | AWS secret access key |
GH_PACKAGE_TOKEN | GitHub token for accessing @reptidex-app packages |
Contact & Support
For questions about this infrastructure:- Review this document thoroughly
- Check troubleshooting section
- Review AWS CloudFormation stack events
- Check GitHub Actions workflow logs
- Review ECS task logs in CloudWatch
/infrastructure/templates/ to understand how resources are defined and connected.
