Deep Technical Report & Architecture Documentation
Prepared for: Internship Reference
Date: May 2026
Codebase: Amazon Internal β Brazil Workspace Package
Table of Contents
- What is EPIC?
- High-Level Architecture
- System Components (5 Packages)
- Data Models & Domain Objects
- Database Architecture
- REST API Reference
- Event Readiness Workflow (Milestones)
- Trigger System (Java Lambdas)
- Notification & Messaging System
- Infrastructure (CDK Stacks)
- Frontend Architecture (React)
- Traffic & Throttling System
- Deployment Pipeline
- Key Business Concepts Glossary
- Developer Setup Cheatsheet
1. What is EPIC?
EPIC (Everyday Peak In Charge) is an Amazon-internal tool that helps engineering teams plan, manage, and execute capacity scaling for peak traffic events β like Prime Day, Black Friday, Holiday season, and BAU (Business As Usual) scaling.
Core Problem It Solves
Amazon services need to handle massive traffic spikes during events. Without coordination:
- Services under-order hardware β crash during peak
- Services over-order hardware β wasteful costs
- Upstream/downstream service teams donβt communicate TPM (traffic) needs
- No single view of readiness across hundreds of services
What EPIC Does
Without EPIC With EPIC
βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ
β Manual spreadsheets β
Central database of all fleets
β Email chains for TPM numbers β
Automated gather/communicate TPM
β No hardware order tracking β
Milestone tracking with deadlines
β Manual throttling updates β
Automated throttle config push
β No readiness dashboard β
Leadership dashboards + HOTW
β Services forget about descaling β
Descale milestones & automation
Key Events EPIC Manages (Examples from Code)
| Event ID | Event Name | Type |
|---|---|---|
PrimeDay21 |
Prime Day 2021 | Peak |
NewYearSale2025 |
New Year Sale 2025 | Peak |
NewYearSale2026 |
New Year Sale 2026 | Peak |
SPRINGSALE26 |
Spring Sale 2026 | Peak |
EUSPRINGSALE24 |
EU Spring Sale 2024 | Peak |
BAU |
Business As Usual | BAU |
2. High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPIC SYSTEM ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββ
β SERVICE OWNER / USER β (Browser)
ββββββββββββββββββ¬βββββββββββββββββ
β HTTPS
βΌ
βββββββββββββββββββββββββββββββββββ
β EPICFrontend β React.js + AWS CloudScape UI
β Hosted on Amazon Harmony β (Beta/Gamma/Prod via CodePipeline)
β https://console.harmony. β
β a2z.com/epic/ β
ββββββββββββββββββ¬βββββββββββββββββ
β REST API (IAM Auth via Harmony)
βΌ
βββββββββββββββββββββββββββββββββββ
β AWS API Gateway β REST API
β (EPICApiStack β CDK deployed) β ~60+ routes across 12 domains
ββββββββββββββββββ¬βββββββββββββββββ
β Lambda Proxy Integration
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPICBackend β Node.js Lambda Functions β
β β
β βββββββββββ βββββββββββ βββββββββββ ββββββββββββ βββββββββββββββββ β
β β Fleet β β Service β β Event β βEventPlan β β Projection β β
β β Lambda β β Lambda β β Lambda β β Lambda β β Lambda β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬ββββββ βββββββββ¬ββββββββ β
β β β β β β β
β βββββββββββ βββββββββββ βββββββββββ ββββββββββββ βββββββββββββββββ β
β βThrottle β βExceptionβ β HOTW β β Ticket β β BulkJobs β β
β β Lambda β β Lambda β β Lambda β β Lambda β β Lambda β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬ββββββ βββββββββββββββββ β
β β β β β β
βββββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌββββββββββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS DynamoDB Tables β
β FleetTable ServiceTable EventTable β
β EventPlanTable ProjectionsTable SchemaTable β
β ExceptionTable ThrottlingTable HOTWTable β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β DynamoDB Streams
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPICBackendTriggers β Java Lambda Functions β
β β
β ApolloHandler AxonHandler ThrottlingExecutor BAUScalingHandler β
β FloTriggerHandler ConsensusHandler MilestoneWorkflowHandler β
β ScalingPlannerHandler PmetHandler VarianceExceededHandler β
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Apollo β β Axon β β SDC / Gizmo β β SIM / FLO β
β(config push) β β(traffic) β β (throttling) β β (ticketing) β
ββββββββββββββββ ββββββββββββ ββββββββββββββββ ββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MESSAGING & NOTIFICATIONS β
β β
β SNS (NotificationSNS) βββΊ SQS (emailQueue) βββΊ Email Lambda β
β SQS (EventFleetCreation) βββΊ Fleet trigger Lambda β
β SQS (EventTicketCreation) βββΊ Ticket creation Lambda β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββ
β AWS RDS (MySQL/Aurora) β β Ticketing, SQL analytics, HOTW data
ββββββββββββββββββββββββββββββ
3. System Components β 5 Packages
Package Map
EPIC/ (Brazil Workspace)
βββ EPICFrontend/ β React.js web application
β βββ src/pages/ β 25+ page views
β βββ src/components/ β Reusable UI components
β βββ src/client/ β API Gateway client
β βββ src/store/ β Redux state management
β
βββ EPICBackend/ β Node.js Lambda business logic
β βββ src/epiclambda/api/ β 20+ domain API handlers
β βββ src/epiclambda/operations/ β DB operations layer
β βββ src/epiclambda/common/ β Constants & utilities
β βββ src/epiclambda/notification/ β SNS notifications
β βββ src/epiclambda/sqs/ β SQS message sending
β
βββ EPICBackendCDK/ β TypeScript CDK infrastructure
β βββ lib/apiStack.ts β DynamoDB tables, SQS, SNS
β βββ lib/EPICApiStack.ts β API Gateway routes
β βββ lib/Fleet/ β Fleet Lambda stack
β βββ lib/Event/ β Event Lambda stack
β βββ lib/Service/ β Service Lambda stack
β βββ lib/HOTW/ β HOTW Lambda stack
β βββ lib/rdsStack.ts β RDS MySQL cluster
β
βββ EPICBackendTriggers/ β Java Lambda event processors
β βββ src/com/amazon/epicbackendtriggers/lambda/
β βββ handler/ β 30+ trigger handlers
β βββ apollo/ β Apollo config integration
β βββ throttling/ β SDC/Gizmo throttle updates
β βββ bau/ β BAU scaling automation
β βββ milestone/ β Milestone workflows
β βββ scalingplanner/ β Auto scaling planner
β
βββ EPICBackendTriggersIntegrationTests/ β Integration test suite
3.1 EPICFrontend β React Application
| Attribute | Value |
|---|---|
| Framework | React.js |
| UI Library | AWS CloudScape (@amzn/awsui-components-react) |
| State Management | Redux Toolkit (@reduxjs/toolkit) |
| Hosting | Amazon Harmony |
| Node version | v16.0.0 |
| API Auth | Harmony IAM Role (HarmonyAPIGatewayAccessRole) |
| Env configs | .env.development / .env.production |
Pages Overview:
| Page File | Description |
|---|---|
events.jsx |
Browse all peak events |
createEvent.jsx |
Create new peak/BAU event |
fleetConfigurations.jsx |
View & configure fleet scaling |
serviceDetails.jsx |
Per-service configuration detail |
service.jsx |
List all services |
serviceOnboarding.jsx |
Onboard a new service to EPIC |
serviceReadinessDashboard.jsx |
Readiness status per event |
serviceOwnerDashboardDetail.jsx |
Service ownerβs view of milestones |
hotwDashboard.jsx |
Head of the Week run dashboard |
hotwRunHistory.jsx |
Historical HOTW run data |
hotwAsgRunDetails.jsx |
ASG (Auto Scaling Group) details |
actionDashboard.jsx |
All outstanding action items |
createException.jsx |
Submit a capacity exception |
approveException.jsx |
Review & approve exceptions |
serviceThrottling.jsx |
Throttle config per service |
descaleFleetConfigurations.jsx |
Post-event descale config |
descaleServiceThrottling.jsx |
Post-event descale throttling |
dimensionView.jsx |
Traffic dimension metrics view |
configureDimension.jsx |
Configure metric dimensions |
bauServiceOwnerDashboard.jsx |
BAU scaling dashboard |
hostMigration.jsx |
Host migration tracking |
syncSettings.jsx |
Configuration sync settings |
upstreamDetails.jsx |
Upstream service dependencies |
serviceDescaleReadinessDashboard.jsx |
Descale readiness view |
onboardingChecklist.jsx |
Service onboarding checklist |
3.2 EPICBackend β Node.js Lambda Handlers
Each API handler is a class with static async methods:
| Handler File | Responsibility |
|---|---|
Fleet.js |
CRUD for fleet objects, TPM updates, traffic config, approvals |
Service.js |
CRUD for services, upstream/downstream links, notifications |
Event.js |
CRUD for peak/BAU events, leadership dashboards |
EventPlan.js |
Milestone list management per fleet per event |
Projection.js |
Traffic projections for capacity planning |
Schema.js |
Fleet downstream schema (how traffic is counted) |
Throttling.js |
SDC/Gizmo throttle config data management |
Exception.js |
Capacity exception creation, approval, propagation |
Ticket.js |
UpstreamβDownstream coordination tickets (MySQL) |
HOTW.js |
Head of the Week run & execution details |
BulkJobs.js |
Async bulk job processing (bulk PMET upload etc.) |
Calendar.js |
Excluded dates management for events |
EventProfile.js |
Event profiles for configuration templates |
Organization.js |
Org-level grouping of services |
Philosophy.js |
Scaling philosophy rules per service |
CustomInputSF.js |
Custom input scaling factors |
CustomFormula.js |
Custom TPM computation formulas |
Employee.js |
Employee/user lookup for ownership |
Dimension.js |
Traffic dimension configurations |
Traffic.js |
Input/output traffic management |
BAUHostJob.js |
BAU host ordering job management |
Pmet.js |
PMET (Peak Metric) link management |
SIM.js |
SIM (Amazon ticketing) integration |
VarianceExceeded.js |
Variance detection & alerts |
3.3 EPICBackendCDK β AWS Infrastructure
Written in TypeScript, deploys via AWS CDK through Brazil Build System.
Build commands:
brazil-build # Compile TypeScript
brazil-build release # Build for deployment
brazil-build cdk list # List available stacks
brazil-build cdk deploy <StackName>
3.4 EPICBackendTriggers β Java Lambda Handlers
These are event-driven β triggered by DynamoDB Streams, SQS messages, or CloudWatch Events.
Key Handlers:
| Handler | Trigger | What it does |
|---|---|---|
ApolloHandler.java |
Schedule/DDB Stream | Pushes capacity configs to Apollo (Amazon config system) |
ApolloTriggerHandler.java |
SQS | Executes Apollo config push for a fleet |
BAUScalingHandler.java |
Schedule | Runs BAU scaling recommendations |
ThrottlingExecutor.java |
DDB Stream | Pushes throttle changes to SDC/Gizmo systems |
FloTriggerHandler.java |
SQS | Runs FLO (Fleet Light Operations) one-box scaling |
FloExecutionHandler.java |
Schedule | Executes FLO scaling decisions |
MilestoneWorkflowHandler.java (WorkflowHandler) |
API/SQS | Updates milestone completion statuses |
ScalingPlannerHandler.java |
Schedule | Generates scaling plan recommendations |
ConsensusHandler.java |
Schedule | Runs consensus algorithm for host counts |
AxonHandler.java |
Schedule/Event | Integrates with Axon traffic management |
GatherEmailTriggerHandler.java |
SQS | Sends TPM gather request emails |
PmetHandler.java |
Schedule | Refreshes PMET (Peak Metric) links |
VarianceExceededHandler.java |
CloudWatch | Detects TPM variance and alerts |
HotwHandler.java |
Schedule | Runs HOTW automation (ASG management) |
OnboardingHandler.java |
DDB Stream | Processes new service onboarding steps |
DescaleHostsHandler.java |
Schedule | Automates post-event descaling |
EAPDetailsHandler.java |
Event | Updates EAP (Emergency Adjustment Process) details |
TicketServiceReadinessTriggerHandler.java |
DDB Stream | Creates tickets for service readiness |
TotalPeakProjectionHandler.java |
Schedule | Calculates total peak projection across services |
ValidateUserHandler.java |
API | Validates user permissions |
FmbiHandler.java |
S3 | Processes FMBI (Fleet Management Business Intelligence) data |
CapacityInventoryHandler.java |
Schedule | Tracks hardware capacity inventory |
UpdateFleetTrafficHandler.java |
DDB Stream | Cascades traffic updates to downstream fleets |
4. Data Models & Domain Objects
4.1 Service Object
{
"ServiceId": "FORTRESSService",
"ServiceIndexId": 42,
"VersionId": 3,
"Email": "fortress-dev@amazon.com",
"Ldap": "fortress-dev",
"Owner": "johndoe",
"PointOfContact": "janedoe",
"OrganizationId": 1,
"ServiceType": "Registered",
"Api": [{ "Name": "EvaluateInternalTransaction", "UsedForScaling": true }],
"CTI": { "Category": "...", "Type": "...", "Item": "..." },
"Upstreams": ["VCS-NA", "ARMService-NA"],
"DownStreams": ["OrderService-NA"],
"Fleet": ["FORTRESSService-X1-NA", "FORTRESSService-X2-EU"],
"OnboardingStatus": {
"FinalStatusComplete": false,
"CustomerChecklist": {
"ServiceDetailsVerified": false,
"UpstreamsAudited": false,
"DownstreamsAudited": false,
"PermissionsGiven": false,
"HostThroughputTPMUpdated": false,
"PMETLinksGiven": false,
"CloudTuneDriver": null,
"CustomerChecklistSignOff": false
}
},
"AuditMetadata": { "User": "johndoe", "Timestamp": "06/01/2024 12:00:00", "Message": "..." }
}
4.2 Fleet Object
{
"FleetId": "FORTRESSService-X1-NA",
"ServiceId": "FORTRESSService",
"EventId": "PrimeDay26",
"FleetIndexId": 123,
"VersionId": 2,
"FleetType": "Registered",
"FleetConfiguration": {
"ApolloName": "FORTRESSService/NA/X1/Prod",
"Region": "us-east-1",
"AzFactor": 1.125,
"MaxHostCount": 500,
"HostThroughputTPM": 570,
"IsFLORunAutomated": true,
"ApolloNameForFLO": "FORTRESSService/NA/X1/OneBox"
},
"InputTraffic": [
{
"Type": "Self",
"ApiName": "EvaluateInternalTransaction",
"FleetId": "FortressSILService-NA",
"InputTPM": 100
},
{
"Type": "Upstream",
"ApiName": "ComputeRiskProfile",
"FleetId": "VCS-NA",
"ScalingFactor": 1.2,
"ScalingProperties": { "BufferFactor": 0.3 }
},
{
"Type": "CloudTune",
"ApiName": "EvaluateInternalTransaction",
"FleetId": "ARMService-NA",
"CloudTuneProjection": { "ProjectionId": "Physical-Order-Rate-NA", "VersionId": 1 }
}
],
"OutputTraffic": [
{ "Type": "Auto", "ApiName": "ComputeRiskProfile", "FleetId": "VCS-NA", "ScalingFactor": 1.2 }
],
"HostOrderStatuses": {
"HostOrdersNeeded": 200,
"HostsPendingDelivery": 50,
"HostsPendingApproval": 10
},
"ScalingStatus": "Completed",
"BauMetadata": { ... },
"AuditMetadata": { "User": "johndoe", "Timestamp": "...", "Message": "..." }
}
4.3 Event Object
{
"EventId": "PrimeDay26",
"EventName": "Prime Day 2026",
"EventType": "Peak",
"VersionId": 1,
"LatestVersionId": 1,
"RegionList": ["NA", "FE", "EU", "CN"],
"EventStartDate": "07/14/2026 00:00:00",
"EventEndDate": "07/15/2026 23:59:59",
"EventInitialHardwareOrderDate": "02/01/2026 12:00:00",
"EventHardwareReadinessDate": "06/01/2026 12:00:00",
"BAUMonth": "06/2026",
"CloudtunePeakFactor": { "NA": 2.1, "EU": 1.8, "FE": 1.6, "CN": 1.4 },
"SPCOEventDatesByRegion": {
"NA": { "SPCOEventStartDate": "07/01/2026 00:00:00", "SPCOEventEndDate": "07/31/2026 00:00:00" }
},
"AuditMetadata": { "User": "johndoe", "Timestamp": "...", "Message": "Creating PrimeDay26" }
}
4.4 EventPlan (Milestone Tracking)
{
"EventPlanId": "PrimeDay26#FORTRESSService-X1-NA",
"EventId": "PrimeDay26",
"FleetId": "FORTRESSService-X1-NA",
"ServiceId": "FORTRESSService",
"VersionId": 3,
"EventReadinessStatus": false,
"EventMilestone": [
{
"MilestoneId": "GatherProjectionFromUpstream",
"MilestoneCompletionStatus": "Completed",
"ETA": "03/01/2026",
"MilestoneMessage": "Projections gathered"
},
{
"MilestoneId": "HardwareOrder",
"MilestoneCompletionStatus": "Pending",
"SubMilestones": [
{ "MilestoneId": "PlaceHardwareOrder", "MilestoneCompletionStatus": "Completed" },
{ "MilestoneId": "HardwareOrderApproval", "MilestoneCompletionStatus": "Pending" }
]
},
{ "MilestoneId": "HardwareFulfillment", "MilestoneCompletionStatus": "NotStarted" },
{ "MilestoneId": "CommunicateTPMToDownstream", "MilestoneCompletionStatus": "NotStarted" },
{ "MilestoneId": "ThrottlingUpdateBeforeEvent", "MilestoneCompletionStatus": "NotStarted" }
],
"EventDescaleMilestone": [
{ "MilestoneId": "DescaleCompletionMilestone" },
{ "MilestoneId": "GatherDescaleProjectionFromUpstream" },
{ "MilestoneId": "CommunicateDescaleTPMToDownstream" },
{ "MilestoneId": "DescaleThrottlingUpdate" }
]
}
4.5 Throttling Object
{
"RecordId": "FORTRESSService-X1-NA#PrimeDay26",
"FleetIndexId": 123,
"ServiceIndexId": 42,
"EventIndexId": 7,
"Region": "us-east-1",
"EPICUpstream": "VCS-NA",
"Operation": "EvaluateInternalTransaction",
"CurrentLimit": 5000,
"UpscalingLimit": 10000,
"DescalingLimit": 2000,
"IsDisabled": false
}
5. Database Architecture
DynamoDB Tables
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DYNAMODB TABLE LAYOUT β
βββββββββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββ€
β Table Name β Primary Key β Purpose β
βββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββ€
β FleetTable β FleetId + EventIdβ All fleet scaling data β
β FleetIndexTable β FleetIndexId β Auto-increment fleet IDs β
β FleetLockTable β FleetId β Optimistic locking β
β ServiceTable β ServiceId β Service configurations β
β ServiceIndexTable β ServiceIndexId β Auto-increment service IDs β
β EventTable β EventId + VersionIdβ Peak event metadata β
β EventIndexTable β EventIndexId β Auto-increment event IDs β
β EventPlanTable β EventPlanId + VersionIdβ Milestone tracking β
β ProjectionsTable β ProjectionId β Traffic projections β
β SchemaTable β FleetId β Fleet downstream schemas β
β EventProfileTable β EventProfileId β Event profile templates β
β ExceptionTable β ExceptionId β Capacity exceptions β
β JobDetailsTable β JobId β Async job status β
β ThrottlingTable β RecordId β Throttle data per fleet β
β ThrottlingConfigTable β ConfigId β Throttle config templates β
β BAUServiceDashboard β ServiceId β BAU scaling dashboard β
βββββββββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββββββ
DynamoDB Key Design Pattern (Versioning)
EPIC uses a dual-table versioning pattern for most entities:
Main Table (FleetTable): Stores ALL versions
PK: FleetId + VersionId
Latest Version Index (GSI): Quick lookup for current data
GSI: FleetId-LatestVersionId-index
This allows:
- Complete history of every change
- Fast latest-version reads
- Audit trail with AuditMetadata on every record
SQS Queues
| Queue | Purpose | DLQ |
|---|---|---|
emailQueue |
Email notifications via SES | emailDLQ |
EventFleetCreationQueue |
Async fleet creation on event | EventFleetCreationDLQ |
EventTicketCreationQueue |
Auto-create coordination tickets | EventTicketCreationDLQ |
DetectFleetInconsistenciesQueue |
Background fleet validation | DetectFleetInconsistenciesDLQ |
PmetLinksRfrshQueue |
Refresh PMET links periodically | PmetLinksRfrshDLQ |
CustomFormulaRefreshQueue |
Refresh formula calculations | CustomFormulaRefreshDLQ |
All DLQs retain messages for 14 days.
RDS (MySQL)
Used for:
- Ticketing data (relational upstreamβdownstream tickets)
- SQL analytics (ExecuteSQL endpoint)
- HOTW execution data (ASG run history)
- Host ordering details (OrderDetails)
6. REST API Reference
Base URL: https://<api-gateway-id>.execute-api.<region>.amazonaws.com/Prod
Fleet APIs
| Method | Path | Description |
|---|---|---|
| POST | /fleet |
Create a new fleet |
| GET | /fleet/{FleetId}/{EventId} |
Get fleet data |
| PUT | /fleet/{FleetId}/{EventId} |
Update fleet |
| GET | /fleet/{FleetId}/{EventId}/configuration |
Get fleet config |
| PUT | /fleet/{FleetId}/{EventId}/configuration/host_throughput |
Update HostTPM |
| PUT | /fleet/{FleetId}/{EventId}/configuration/apollo_properties |
Update Apollo config |
| PUT | /fleet/{FleetId}/{EventId}/configuration/AZ_Factor |
Update AZ Factor |
| PUT | /fleet/{FleetId}/{EventId}/configuration/custom_thresholds |
Update thresholds |
| PUT | /fleet/{FleetId}/{EventId}/configuration/region |
Update region |
| GET | /fleet/{FleetId}/{EventId}/Traffic |
Get traffic data |
| PUT | /fleet/{FleetId}/{EventId}/Traffic |
Update traffic |
| PUT | /fleet/{FleetId}/{EventId}/Traffic/disable |
Disable fleet traffic |
| PUT | /fleet/{FleetId}/{EventId}/updateFleetTrigger |
Trigger scaling |
| PUT | /fleet/{FleetId}/{EventId}/overrideTotalInputTPM |
Override input TPM |
| PUT | /fleet/{FleetId}/{EventId}/updateOutputTpm |
Update output TPM |
| PUT | /fleet/{FleetId}/{EventId}/updateBAUTPM |
Update BAU TPM |
| PUT | /fleet/{FleetId}/{EventId}/updateDescaleTPM |
Update descale TPM |
| PUT | /fleet/{FleetId}/{EventId}/hostOrderStatuses |
Update host orders |
| PUT | /fleet/{FleetId}/{EventId}/approval |
Submit approval |
| GET | /fleet/{FleetId}/{EventId}/fleetVersionList |
Get version history |
| GET | /fleet/{FleetId}/{EventId}/inputTrafficSnapshot |
Input traffic snapshot |
| GET | /fleet/{FleetId}/Version/{VersionId} |
Get specific version |
| GET | /fleet/batch |
Get batch of fleet IDs |
Service APIs
| Method | Path | Description |
|---|---|---|
| POST | /service |
Create service |
| GET | /service |
List all services |
| GET | /service/{ServiceId} |
Get service |
| PUT | /service/{ServiceId} |
Update service |
| GET | /service/{ServiceId}/upstreams/{EventId} |
Get upstream services |
| GET | /service/{ServiceId}/downstreams/{EventId} |
Get downstream readiness |
| PUT | /service/{ServiceId}/upstreams/{EventId} |
Send gather TPM email |
| PUT | /service/{ServiceId}/throttling/config |
Update SDC throttle config |
| PUT | /service/{ServiceId}/throttling/gizmo |
Update Gizmo config |
| PUT | /service/{ServiceId}/throttling/fleetStatus |
Update fleet throttle status |
| GET | /service/{ServiceId}/throttling/{EventId} |
Check throttle readiness |
| PUT | /service/{ServiceId}/onboarding |
Update onboarding status |
| GET | /service/{ServiceId}/preference |
Get service preferences |
| PUT | /service/{ServiceId}/preference |
Update service preferences |
| GET | /service/dashboard |
BAU service dashboard |
Event APIs
| Method | Path | Description |
|---|---|---|
| GET | /event |
List all events |
| POST | /event |
Create event |
| GET | /event/{EventId} |
Get event |
| PUT | /event/{EventId} |
Update event |
| GET | /event/{EventId}/fleets |
Get all fleets for event |
| PUT | /event/{EventId}/dashboard |
Leadership dashboard data |
| PUT | /event/{EventId}/dashboard/descaling |
Descale dashboard data |
| GET | /event/{EventId}/automatedMetricPercentage |
Metric automation % |
EventPlan (Milestones) APIs
| Method | Path | Description |
|---|---|---|
| POST | /eventPlan |
Create event plan |
| GET | /eventPlan/{EventId}/{FleetId} |
Get event plan |
| PUT | /eventPlan/{EventId}/{FleetId}/eventMilestoneList |
Bulk update milestones |
| PUT | /eventPlan/{EventId}/{FleetId}/eventMilestoneDetail |
Update milestone detail |
| PUT | /eventPlan/{EventId}/{FleetId}/milestoneStatusUpdate |
Update milestone status |
| GET | /eventPlan/{EventId}/{FleetId}/version/{VersionId} |
Get versioned plan |
Other Domain APIs
| Domain | Routes include |
|---|---|
| Projection | GET/POST /projection, GET/PUT /projection/{ProjectionId} |
| Schema | PUT /fleet/{FleetId}/{EventId}/schema/downstream, GET /fleet/{FleetId}/schema |
| Exception | POST/PUT /exception, GET /exception/{ExceptionId} |
| Ticket | POST /ticket, GET/PUT /ticket/... |
| HOTW | POST/PUT /hotw/run, POST /hotw/execution, POST/GET /hotw/dashboard |
| Calendar | GET/PUT /calendar |
| BulkJobs | POST/PUT/GET /jobs, GET /jobs/{JobId} |
7. Event Readiness Workflow (Milestones)
This is the core operational workflow that EPIC manages for each fleet per peak event.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PEAK EVENT READINESS WORKFLOW (Per Fleet) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
EVENT CREATED
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β MILESTONE 1: Gather Projection From β
β Upstream β
β β
β β’ Send email to all upstream services β
β β’ Upstreams provide expected TPM β
β β’ EPIC auto-calculates required hosts β
β Status: NotStarted β Pending β Completedβ
βββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β MILESTONE 2: Hardware Order β
β βββββββββββββββββββββββββββββββββββ β
β β Sub-milestone 2a: β β
β β Place Hardware Order β β
β β (SPCO override submitted) β β
β ββββββββββββββββ¬βββββββββββββββββββ β
β β β
β ββββββββββββββββΌβββββββββββββββββββ β
β β Sub-milestone 2b: β β
β β Hardware Order Approval β β
β β (Business/Regional/Financial) β β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β MILESTONE 3: Hardware Fulfillment β
β β
β β’ Hardware physically delivered β
β β’ Hosts come online in datacenter β
β β’ Fleet host count verified β
βββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β MILESTONE 4: Communicate TPM β
β To Downstream β
β β
β β’ Send peak TPM numbers to downstream β
β β’ Downstream updates their scaling too β
β β’ Tickets created for coordination β
βββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β MILESTONE 5: Throttling Update β
β Before Event β
β β
β β’ SDC/Gizmo throttle limits pushed β
β β’ Limits set to peak capacity β
β β’ Throttling marked ready β
βββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
β
EVENT READINESS STATUS = TRUE
β
ββββββββββββββββββββββ
PEAK EVENT RUNS
ββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β DESCALE MILESTONE 1: Descale Completion β
β DESCALE MILESTONE 2: Gather Descale TPM β
β DESCALE MILESTONE 3: Communicate β
β Descale TPM To Downstream β
β DESCALE MILESTONE 4: Descale Throttling β
ββββββββββββββββββββββββββββββββββββββββββββ
Milestone Statuses: NotStarted β Pending β Completed
(Also: NotApplicable, NotAvailable)
8. Trigger System (Java Lambdas)
How Triggers Work
DynamoDB FleetTable
β (Stream)
βΌ
βββββββββββββββββββββββββββββββββββ
β DynamoDB Stream Processor β
β (EPICBackendTriggers) β
β β
β INSERT/MODIFY/REMOVE event β
β βββΊ Route to correct handler β
ββββββββββββββ¬βββββββββββββββββββββ
β
βββββββββΌβββββββββββββββββ¬ββββββββββββββββββ
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββββββ βββββββββββββββββ ββββββββββββββββ
β Apollo β β Throttling β β Milestone β β Traffic β
β Handler β β Executor β β Workflow β β Update β
β β β β β Handler β β Handler β
β Pushes β β SDC + Gizmo β β β β β
β config β β throttle β β Auto-complete β β Cascades TPM β
β to β β limit β β milestones β β changes to β
β Apollo β β update β β β β downstream β
βββββββββββ βββββββββββββββ βββββββββββββββββ ββββββββββββββββ
Amazon Internal Systems Integrated
| System | What it is | EPICβs Integration |
|---|---|---|
| Apollo | Amazonβs internal configuration deployment system | Pushes fleet capacity configs (SPCO overrides) |
| Axon | Amazonβs traffic management system | Reads/writes traffic shaping rules |
| SDC | Service Dependency Control (throttling) | Updates max TPM throttle limits |
| Gizmo | Another throttling framework | Alternative throttle config push |
| FLO | Fleet Light Operations (one-box scaling) | Automated one-box scale tests |
| SIM | Amazon internal ticketing system | Creates SIM tickets for fleet actions |
| CloudTune | Amazonβs ML-based capacity recommendation | Source of scaling factor projections |
| Conduit | Amazonβs credential management | AWS credential provisioning |
| Harmony | Amazonβs frontend app hosting | Hosts the EPIC web UI |
| Brazil | Amazonβs build/package management | Used to build and deploy all packages |
| PMET | Peak Metric tracking system | Links to metrics for each fleet |
| HOTW | Head of the Week | Weekly operational automation |
| FMBI | Fleet Management Business Intelligence | Fleet analytics data source |
| Superstar | Amazonβs CDK pipeline framework | Used for deploying CDK stacks |
9. Notification & Messaging System
Any Lambda (Fleet update, Service create, etc.)
β
β publish(TopicArn: SNS_TOPIC_ARN)
βΌ
ββββββββββββββββββββββββ
β SNS Topic β
β (notificationSNS) β
ββββββββββββ¬ββββββββββββ
β SqsSubscription
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β emailQueue (SQS) ββββββββΆβ Email Lambda ββββΆ Amazon SES β Email
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β
β maxReceiveCount: 2
βΌ
ββββββββββββββββββββββββ
β emailDLQ β (14-day retention)
ββββββββββββββββββββββββ
Message Types (MessageAttributes β NotificationName):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β’ UPDATE - Fleet/Service object updated β
β β’ CREATE_SERVICE - New service created β
β β’ GATHER_EMAIL - Request TPM from upstream β
β β’ PEAK_READINESS - Readiness status change β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
10. Infrastructure (CDK Stacks)
Stack Dependency Tree
SuperStarPersonalBootstrap
βββ VpcStack (Virtual Network)
βββ ApiStack (DynamoDB + SQS + SNS + SecurityGroup)
βββ FleetLambdaStack
βββ ServiceLambdaStack
βββ EventLambdaStack
βββ EventPlanLambdaStack
βββ ProjectionsLambdaStack
βββ SchemaLambdaStack
βββ EventProfileLambdaStack
βββ ExceptionLambdaStack ββ uses ExceptionTable from ApiStack
βββ BulkJobsLambdaStack
βββ TicketLambdaStack
βββ CalendarLambdaStack
βββ HOTWLambdaStack
βββ ThrottlingLambdaStack
βββ CommonStack
βββ OrganizationStack
βββ PhilosophyStack
βββ CustomInputSFStack
βββ CustomFormulaStack
βββ ExceptionStack
βββ RdsStack (MySQL + Lambda integrations)
βββ EPICApiStack (API Gateway β ALL routes)
βββ TriggersStack (CloudWatch / SNS triggers)
Optional:
βββ SIMStack
βββ MilestoneWorkflowStack
βββ AxonStack
βββ ApprovalStack
βββ ScalingPlannerStack
βββ PmetStack
βββ TicketingStack
Deployment Stages (CI/CD)
Commit to mainline
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS CodePipeline (EPIC-Prod account) β
β us-west-2 region β
ββββββββββββ¬ββββββββββββββββββββββββββββββββββββ β
β
βββββββββΌβββββββββ
β Beta Stage β βββ First deployment, automated tests
βββββββββ¬βββββββββ
β
βββββββββΌβββββββββ
β Gamma Stage β βββ Staging environment
βββββββββ¬βββββββββ
β
βββββββββΌβββββββββ
β Prod Stage β βββ Live at console.harmony.a2z.com/epic/
ββββββββββββββββββ
11. Frontend Architecture (React)
src/
βββ Epic.js β Root component, routing setup
βββ index.js β Entry point, Redux store setup
βββ store/ β Redux Toolkit state slices
βββ client/
β βββ getApigClient.js β API Gateway client factory
β (uses aws-api-gateway-client)
βββ pages/ β 25+ page components (see table above)
βββ components/
β βββ cards/ β Card layout components
β βββ tables/ β Data table components
β βββ Tabs/ β Tab navigation
β βββ Headers/ β Page header components
β βββ Flashbar/ β Alert/notification bar
β βββ Container/ β Layout containers
β βββ crumbs/ β Breadcrumb navigation
β βββ VersionHistory/ β Version history viewer
β βββ sideNav.jsx β Left navigation sidebar
β βββ topNavigation.jsx β Top navigation bar
β βββ withEventTypeRoutes.jsxβ HOC for event routing
βββ common/ β Shared utilities
βββ configuration/ β App configuration
βββ featureutils/ β Feature flag utilities
βββ styles/ β Global CSS
βββ tutorials/ β Onboarding tutorials
Key Libraries:
@amzn/awsui-components-react β AWS CloudScape Design System
@reduxjs/toolkit β State management
aws-api-gateway-client β API calls to backend
moment β Date formatting
lodash β Utility functions
csv-string / json2csv β Export to CSV
react-scripts β Build tooling (CRA)
12. Traffic & Throttling System
TPM Flow β How Traffic Numbers Flow
CloudTune (ML predictions)
β CloudTune Peak Factor
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPIC Projection β
β ProjectionId: "Physical-Order-Rate-NA" β
β BAU TPM Γ CloudtunePeakFactor β Peak TPM β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Fleet InputTraffic β
β β
β Type: "Self" β Direct traffic measurement β
β Type: "Upstream" β Driven by upstream fleet TPM β
β Type: "CloudTune" β ML model driven traffic β
β β
β Total InputTPM = Ξ£(InputTraffic sources) β
β Required Hosts = InputTPM / HostThroughputTPM β
β Γ AzFactor (AZ redundancy) β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Throttling Update β
β β
β SDC Throttle Config: β
β CurrentLimit: BAU TPM β
β UpscalingLimit: Peak TPM β
β DescalingLimit: Post-peak TPM β
β β
β Gizmo Throttle Config: β
β Alternative throttle system with revisions β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AZ Factor Values (by region):
ββββββββββββββββββββββββββββββββ
β us-east-1 (NA) β 1.125 β (need 12.5% extra for AZ redundancy)
β eu-west-1 (EU) β 1.35 β
β eu-south-2 β 1.35 β
β us-west-2 (FE) β 1.35 β
β cn-north-1 (CN) β 2.0 β
β eu-central-1 β 1.35 β
ββββββββββββββββββββββββββββββββ
13. Deployment Pipeline
Frontend Deployment (EPICFrontend)
# Local Development
npm install
npm start # β localhost:3000
# Testing
npm run build_test # build + test combined
npm test # Jest tests only
# Deploy to Harmony
git push to mainline
# β CodePipeline auto-deploys: Beta β Gamma β Prod
Backend Deployment (EPICBackend + CDK)
# Step 1: Build backend packages
cd EPICBackend && brazil-build release
cd EPICBackendTriggers && brazil-build release
cd EPICBackendCDK && brazil-build release
# Step 2: Bootstrap CDK
brazil-build bootstrap
# Step 3: Deploy stacks
brazil-build cdk deploy Personal-ApiStack # DynamoDB, SQS, SNS
brazil-build cdk deploy Personal-VpcStack # VPC networking
brazil-build cdk deploy Personal-FleetLambdaStack # Fleet Lambdas
brazil-build cdk deploy Personal-EPICApiStack # API Gateway + all routes
# Quick Lambda code update (no full CloudFormation deploy)
bb cdk deploy --hotswap Personal-FleetLambdaStack
14. Key Business Concepts Glossary
| Term | Definition |
|---|---|
| Fleet | A group of Amazon servers (hosts) running a specific service in a region (e.g., FORTRESSService-X1-NA) |
| TPM | Transactions Per Minute β the traffic volume metric used for all scaling decisions |
| HostThroughputTPM | How many TPM a single host can handle (e.g., 570 TPM/host) |
| Peak Event | A scheduled high-traffic period requiring extra capacity (Prime Day, Black Friday, etc.) |
| BAU | Business As Usual β normal (non-peak) operations, also managed in EPIC |
| EventPlan | The per-fleet milestone tracking plan for a peak event |
| Milestone | A specific readiness gate each fleet must pass before peak (hardware order, throttling, etc.) |
| AZ Factor | Availability Zone redundancy multiplier β how much extra capacity to add for multi-AZ redundancy |
| SPCO Override | Service Provider Capacity Override β a request to AWS to provision extra hardware |
| Throttling | Rate-limiting traffic to protect a service from overload |
| SDC | Service Dependency Control β Amazonβs internal throttling system |
| Gizmo | Another Amazon throttling framework |
| Apollo | Amazonβs internal configuration deployment system |
| CloudTune | Amazonβs ML-based capacity recommendation system |
| HOTW | Head of the Week β weekly operational run that automates scaling decisions |
| FLO | Fleet Light Operations β one-box (single host) automated scaling tests |
| Axon | Amazonβs traffic management/routing system |
| SIM | Amazonβs internal ticketing system (like Jira) |
| Upstream | A service that sends traffic TO the current service |
| Downstream | A service that receives traffic FROM the current service |
| Projection | An estimated traffic forecast for a future event |
| Exception | A request for capacity outside the normal scaling plan |
| Harmony | Amazonβs internal frontend app hosting platform |
| Brazil | Amazonβs internal build, package, and dependency management system |
| Bindle | Amazonβs resource ownership/permission tracking system |
| PMET | Peak Metric β a pre-defined performance metric used to validate scaling |
| FMBI | Fleet Management Business Intelligence β data source for fleet analytics |
15. Developer Setup Cheatsheet
Prerequisites
β Amazon dev-desk (dev environment)
β Brazil CLI installed
β AWS credentials (mwinit)
β Node.js v16.0.0 + npm 8
β Java 11 (JDK)
β BATS CLI: toolbox install batscli
β Harmony CLI installed
Workspace Setup
# Create workspace and pull all packages
brazil ws create --name EPIC
cd EPIC
brazil ws use --versionset EPICBackend/development
brazil ws use --package EPICBackendCDK
brazil ws use --package EPICBackend
brazil ws use --package EPICBackendTriggers
brazil ws use --package EPICFrontend
Running Frontend Locally
cd src/EPICFrontend
npm install
npm start # β localhost:3000 (connected to EPIC-Devo backend)
Running Backend Tests
cd src/EPICBackend
npm test # Runs Jest tests with β₯70% coverage requirement
Common Troubleshooting
| Error | Fix |
|---|---|
sh: react-scripts: command not found |
rm -rf node_modules && npm install |
harmony command not found |
Run harmony npm to point to Amazon npm registry |
Error: Integrity check failed |
rm package-lock.json && harmony npm && npm install |
NOT Found - GET https://registry.npmjs.org/@amzn |
Run harmony npm first |
CDK token expired |
Run mwinit -o or ada credentials update ... |
npm ERR! ERR_STRING_TOO_LONG |
rm -rf aws_lambda.bundle.primary.* then re-run |
JAVA_HOME not found |
echo $JAVA_HOME β install JDK and set PATH |
Summary
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EPIC AT A GLANCE β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ€
β Purpose β Amazon peak capacity planning & execution β
β Users β Service owners, EPIC team, leadership β
β Scale β Hundreds of services, thousands of fleets β
β Events Managed β Prime Day, Black Friday, Holiday, BAU β
β Frontend β React + CloudScape UI on Amazon Harmony β
β Backend β Node.js Lambda functions (~20 domains) β
β Infrastructure β AWS CDK TypeScript (~25 stacks) β
β Triggers β Java Lambdas (~30 handlers) β
β Primary Database β DynamoDB (15+ tables with versioning) β
β Secondary Database β RDS MySQL (tickets, analytics) β
β Messaging β SNS β SQS (6 queues + DLQs) β
β Auth β AWS IAM via Harmony proxy β
β CI/CD β AWS CodePipeline: Beta β Gamma β Prod β
β Regions β NA (us-east-1), EU, FE (us-west-2), CN β
β Build System β Amazon Brazil β
βββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
Report generated from deep codebase analysis of EPIC/EPIC workspace.
Internal Amazon project β not for external distribution.