What youβll learn: The complete journey of a peak event from creation to post-peak descaling β every system involved, every code path triggered, every database write.
Section 1 β The Big Picture
A βPeak Eventβ in EPICβs lifecycle has three phases:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 1: PRE-PEAK (Weeks before the event) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Create Event β Gather Projections β Order Hardware β β
β Confirm Delivery β Communicate TPM β Update Throttling β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β PHASE 2: PEAK DAY (During the event) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Monitor traffic β Handle emergent orders β Scale if needed β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β PHASE 3: POST-PEAK (After the event) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Gather Descale Projections β Place Descale Orders β β
β Communicate Descale TPM β Update Descale Throttling β β
β Final Sign-off β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Timeline example for Prime Day 2024:
- T-8 weeks: Event created in EPIC
- T-6 weeks: Services submit traffic projections
- T-5 weeks: HOTW runs, places hardware orders (SPCO)
- T-4 weeks: Hardware delivered to data centers (FMC fulfillment)
- T-2 weeks: TPM communicated to downstream services
- T-1 week: Throttling updated in Gizmo/SDC
- T-0: Prime Day starts!
- T+1 week: Descale projections gathered
- T+2 weeks: Descale hardware orders placed
- T+3 weeks: Hosts returned to pool
Section 2 β Step 1: Event Creation
Who does it: EPIC team or service teams
Where: createEvent.jsx β backend API
User fills form in createEvent.jsx:
βββ EventId: "PD2024"
βββ EventName: "Prime Day 2024"
βββ EventType: "Peak"
βββ RegionList: ["us-east-1", "eu-west-1", "us-west-2"]
βββ SPCOEventDates per region:
βββ us-east-1: { spcoStart: "2024-05-15", spcoEnd: "2024-07-20" }
βββ eu-west-1: { spcoStart: "2024-05-15", spcoEnd: "2024-07-20" }
β
POST /event
β
Event.createEvent() in Node.js backend
β
DynamoDB: Creates item in EventTable
{
EventId: "PD2024",
EventName: "Prime Day 2024",
VersionId: 1,
LatestVersionId: 1,
RegionList: [...],
// ...
}
β
SNS publishes "EventCreated" notification
β
SQS EventFleetCreationQueue receives message
β
FleetReceiver.js processes:
β Gets all registered services
β For each service Γ fleet: creates EventPlan in DynamoDB
EventPlan = {
EventPlanId: "PD2024#RIPE-NA",
EventMilestone: [
{ EventMilestoneId: "GatherProjectionFromUpstream", Status: "NotStarted" },
{ EventMilestoneId: "HardwareOrder", Status: "NotStarted" },
{ EventMilestoneId: "HardwareFulfillment", Status: "NotStarted" },
{ EventMilestoneId: "CommunicateTPMToDownstream", Status: "NotStarted" },
{ EventMilestoneId: "UpdateThrottlingInSDC", Status: "NotStarted" }
]
}
β
SQS EventTicketCreationQueue receives message
β
EventTicketReceiver.js:
β Creates SIM ticket for the event
β Links ticket to EPIC dashboard
Section 3 β Step 2: Gather Projections
Who does it: Service owners (manually + via Axon/PMET automation)
Where: serviceDetails.jsx β Host Projections tab
Service owner opens serviceDetails.jsx for fleet "RIPE-NA" event "PD2024"
β
Frontend calls:
backend_api.getFleetDetails("RIPE-NA", "PD2024")
apollo_api.getApolloData("RIPE_NA_PROD")
axon_traffic_api.getTrafficData("RIPE-NA", "PD2024")
β
Page shows:
- Current hosts in Apollo: 150
- Current TPM (from Axon): 35,000
- CloudTune projection: 50,000 TPM at peak
β
Service owner can:
1. Accept auto-calculated projection
2. Override Peak TPM (submitModalForOverridingInputTPM.jsx)
3. Override specific ASG projections
β
PUT /projection/{fleetId}/{eventId}
β
Projection.updateProjection() saves to ProjectionsTable
β
MilestoneWorkflow triggered:
β GatherProjectionsMilestoneHandler checks all projections
β If upstream projections verified β update status to "Completed"
β Publish to MilestoneSNSTopic: { milestone: "HardwareOrder", fleetId, eventId }
What happens with upstream projections:
Service A sends traffic TO Service B
β
Service A submits "downstream projection" for Service B
β
Service B receives "upstream projection" in their upstreamProjectionsTab
β
Service B verifies it (checks if the number makes sense)
β
Once all upstream projections verified β GatherProjections milestone = Completed
Section 4 β Step 3: HOTW Hardware Ordering (The Core)
Who does it: Automated (HOTW system) + occasional manual trigger
When: Weekly cron job + triggered by Milestone SNS
βββββββββββββββββββββββββββββββββββββββββββββ
OPTION A: Scheduled Weekly Run
βββββββββββββββββββββββββββββββββββββββββββββ
Cron fires every week
β
HotwHandler.handleSQSRequestForUpdateSpco()
β
For each active event:
For each registered service:
For each fleet in correct region:
β Send to validateAndUpdateSPCOSQSQueue
Message: { FleetId, EventId, RunId }
β
HotwHandler.validateAndUpdateSpco() (SQS consumer)
β
For each message: calls HotwUpscalingHelper.handle(fleetId, eventId, runId)
β
[The HOTW execution begins β see Section 5]
βββββββββββββββββββββββββββββββββββββββββββββ
OPTION B: Manual Atomic Run (Single Fleet)
βββββββββββββββββββββββββββββββββββββββββββββ
Service owner opens hotwDashboard.jsx
β Sees their fleet is "behind"
β Clicks "Run HOTW" button
β
atomicHotwModal.jsx opens
β Shows previous vs current version
β User confirms
β
POST /hotw/atomic/{fleetId}/{eventId}
Body: { PreviousVersionId: "3", CurrentVersionId: "4" }
β
HotwHandler.atomicHotwTrigger()
β Creates RunId
β Sends to atomicHOTWSQSQueue
β
HotwHandler.atomicHotwExecutor() (SQS consumer)
β Calls HotwUpscalingHelper.handleAtomic(fleetId, eventId, runId, prevVer, currVer)
β
[The HOTW execution begins β see Section 5]
Section 5 β HOTW Execution Details
This is HotwUpscalingHelper.getResult() step by step:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HotwUpscalingHelper.getResult(fleetId, eventId, runId, vId) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βββββββΌβββββββ
β Step 1 β Validate fleet
β validateFleetβ β throws if fleet invalid
βββββββ¬βββββββ
β
βββββββΌβββββββ
β Step 2 β Get fleet data from EPIC Backend
β getFleet() β β Fleet: { apolloName, hostThroughputTPM, ctFactor... }
βββββββ¬βββββββ
β
βββββββΌβββββββ
β Step 3 β Get event data
β getEvent() β β Event: { peakDate, regionList... }
βββββββ¬βββββββ
β
βββββββΌβββββββ
β Step 4 β Get Apollo current host count
β apolloHelperβ β apolloMaxHosts = 150
βββββββ¬βββββββ
β
βββββββΌβββββββ
β Step 5 β Get FMC pending orders
β fmcHelper β β pendingEpicFmcHostOrder = 20
βββββββ¬βββββββ
β
βββββββΌβββββββ
β Step 6 β Get ASG details from EAP/ScalingPlanner
β eapHelper β β For each ASG: EAP status, capacity override
βββββββ¬βββββββ
β
βββββββΌβββββββββββββββββββββββββββββββββββββββ
β Step 7 CALCULATE HOSTS NEEDED β
β β
β Input TPM = 50,000 (from CloudTune/manual) β
β CT Peak Factor = 1.5 β
β Buffer Factor = 1.1 β
β Host Throughput = 250 TPM/host β
β β
β BAU TPM = 50000 / (1.5 Γ 1.1) = 30,303 β
β Required Hosts = 50000 / 250 = 200 β
β (Γ AZ Factor if multi-AZ) β
β β
β Hosts Needed = 200 - 150 - 20 = 30 β
β (required - apollo - pending) β
βββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββΌβββββββββββββββββββββββββββββββββββββββ
β Step 8 IS ORDER NEEDED? β
β β
β hostsNeeded > 0? β
β YES β place order β
β NO β no order needed (log reason) β
βββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββΌβββββββββββββββββββββββββββββββββββββββ
β Step 9 IS THIS EMERGENT? β
β β
β Check: is today past emergent start date? β
β YES β create Sev2 SIM ticket β
β NO β standard SPCO order β
βββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β hostsNeeded > 0? β
ββββββββββββ¬βββββββββββ
ββββββββ ββββββββββββββββββββββββββββ
β YES β NO
βΌ βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β Step 10a β β Step 10b β
β fetchArnUpdateSPCO β β Log "no order needed"β
β AndReturnSIMLink() β β Update execution β
β β β details with reason β
β β For each ASG: β ββββββββββββββββββββββββ
β - Get ASG ARN β
β - Enroll in EAP β
β if not enrolled β
β - Place SPCO order β
β - Open SIM ticket β
ββββββββββββ¬ββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step 11 UPDATE HOTW EXECUTION DETAILS β
β β
β POST /hotw/executionDetail β
β Body: { β
β FleetId, EventId, RunId, β
β Status: "Success"/"Fail"/"PartialSuccess", β
β HotwExecutionDetails: { peakTPM, bauTPM, ... }, β
β CapacityOverrideDetails: [...], β
β FulfillmentDetails: [...], β
β ASGDetails: [...] β
β } β
ββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Step 12 PUBLISH TO SNS β
β β
β publishHardwareDetailsToSNS()β
β β SNS: hardware order summaryβ
β β triggers email to team β
ββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββΌβββββββββββββββββββ
β Step 13 FINALLY BLOCK β
β (ALWAYS RUNS) β
β β
β 1. Send to ApolloSQSQueue β
β β refreshes Apollo data β
β β
β 2. Publish to MilestoneSNS β
β β triggers milestone β
β status update β
βββββββββββββββββββββββββββββββ
Section 6 β Step 4: Hardware Delivery Tracking
After orders are placed, FMC tracks delivery:
SPCO order placed (ASG ARN submitted)
β
AWS hardware team processes order
β
FMC order status: "Pending" β "InReview" β "Approved" β "Fulfilled"
β
FmcTrigger cron fires (every few hours)
β
FmcHandler fetches latest status from FMC API
β
FmcHandler updates MySQL fulfillment_details table
β
HardwareFulfillmentMilestoneHandler runs
β
Checks: are all orders fulfilled?
YES β Update HardwareFulfillment milestone to "Completed"
NO β Update with pending count
β
Once all fulfilled:
β Publish to MilestoneSNSTopic: "CommunicateTPM"
Frontend shows this as:
serviceDetails.jsx β Fleet Milestone Readiness tab:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Milestone Status Last Updated β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Gather Projections β
Completed 2024-06-01 β
β Hardware Order β
Completed 2024-06-15 β
β Hardware Fulfillment π‘ In Progress 2024-06-20 β
β - 3 orders fulfilled β
β - 1 order pending approval β
β Communicate TPM βͺ Not Started β
β Update Throttling βͺ Not Started β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Section 7 β Step 5: TPM Communication
After hardware arrives:
HardwareFulfillment milestone completes
β
CommunicateTPMMilestoneHandler triggered
β
For each downstream service:
β Creates/updates projection: "RIPE-NA will send X TPM to ServiceB"
β ServiceB sees this in their upstreamProjectionsTab
β ServiceB needs to verify/acknowledge
β
Once all downstreams acknowledge:
β CommunicateTPM milestone = "Completed"
β Publish to MilestoneSNSTopic: "UpdateThrottling"
Section 8 β Step 6: Throttling Update
UpdateThrottlingMilestoneHandler triggered
β
Reads throttling config from ThrottlingTable
β
Sends to GizmoThrottlingUpdateQueue
β
ThrottlingExecutor processes:
β Calls Gizmo/SDC API with new throttling limits
β Sets max TPS for this service during peak
β
ThrottlingExecutor confirms update successful
β
UpdateThrottlingInSDC milestone = "Completed"
β
ALL 5 MILESTONES COMPLETE!
β
Service shows π’ PEAK READY in serviceReadinessDashboard.jsx
Section 9 β Step 7: Peak Day
During the actual peak event:
Traffic spikes as customers start shopping
β
Axon tracking real-time TPM for all services
β
EPIC monitors:
- Are actual TPMs within projected range?
- Are hosts enough?
- Any throttling issues?
β
If emergency (actual > projected by variance threshold):
β VarianceExceededHandler fires
β Alerts team via SIM ticket
β EPIC operators can manually trigger emergent HOTW
β
Emergent HOTW (if needed):
HotwUpscalingHelper detects emergent=true
β Creates Sev2 SIM ticket
β Places urgent SPCO order
β FLO scales up hosts as fast as possible
EmergentAnnouncementFlashbar.jsx shows red banner:
π¨ EMERGENT SCALING IN PROGRESS
Fleet RIPE-NA requires 50 additional hosts urgently.
Order placed: SIM-123456 | FMC Order: SPCO-789012
Section 10 β Step 8: Descaling (Post-Peak)
After the peak ends, hosts need to be returned:
Event end date passes
β
Descale milestones become active:
1. GatherDescaleProjections
2. DescaleHardwareOrder
3. CommunicateDescaleTPM
4. UpdateDescaleThrottling
5. DescaleCompletion
β
Service owners open descaleFleetConfigurations.jsx
β Review descale recommendation from DescaleHostRecommendationHelper
β Recommendation: "You can safely return 50 hosts from AZ 1a"
β
Owner confirms descale
β
DescaleRecommendationHandler.handle(fleetId, eventId)
β Greedy algorithm determines which ASGs to remove from
β For each ASG to descale:
β Place SPCO descale order
β FMC tracks fulfillment
β
BAUScalingHandler takes over:
β Computes BAU capacity needed
β Returns any hosts above BAU level
β
DescaleCompletion milestone = "Completed"
β
Event lifecycle complete! β
Descale recommendation algorithm (from DescaleHostRecommendationHelper.java):
// Greedy approach: always remove from ASG with MOST hosts first
// Goal: get to target host count as evenly as possible
private Map<String, Integer> descaleHostRecommendation(Fleet fleet, int hostsToRemove) {
Map<String, Integer> asgCurrentHosts = getCurrentHostsPerASG(fleet);
Map<String, Integer> toRemovePerASG = new HashMap<>();
while (hostsToRemove > 0) {
// Find ASG with most hosts
String maxAsg = asgCurrentHosts.entrySet().stream()
.max(Map.Entry.comparingByValue())
.get().getKey();
// Remove one host from that ASG
asgCurrentHosts.merge(maxAsg, -1, Integer::sum);
toRemovePerASG.merge(maxAsg, 1, Integer::sum);
hostsToRemove--;
}
return toRemovePerASG;
}
// Example: [100, 80, 70] hosts needed β recommend removing from 100-host ASG first
Section 11 β Exception Flow (When Normal Rules Donβt Apply)
Sometimes a service needs more buffer than standard:
Standard buffer factor = 1.1 (10% safety margin)
But Service XYZ says: "Our traffic is unpredictable, we need 1.3"
β
Service owner opens createException.jsx:
β Fills exception request:
ExceptionType: "BufferFactor"
RequestedValue: 1.3
BusinessReason: "Our traffic has 30% variance due to external APIs"
BusinessDrivers: ["ThirdPartyDependency", "SeasonalVariance"]
β Submits
β
POST /exception
ExceptionOperations stores in ExceptionTable
β
EPIC team sees in approveException.jsx
β Reviews business justification
β Approves or rejects
β
ApprovalsHandler.approveOrReject()
β If approved: updates fleet's bufferFactor field to 1.3
β Sends email notification
β HOTW will use 1.3 as buffer factor for this fleet going forward
Section 12 β BAU (Business As Usual) Flow
BAU is different from peak β itβs ongoing, not event-specific:
BAU capacity = what you need for normal (non-peak) operations
Every week, BAUScalingHandler runs:
1. Gets BAU TPM from Axon (last week's average)
2. Applies BAU buffer factor
3. Calculates BAU hosts needed = BAU TPM / Host Throughput / AZ Factor
4. Compares with current hosts in Apollo
5. If needed > current: places BAU SPCO order
6. If needed < current (excess hosts): logs for potential release
β
Service owners see in bauServiceOwnerDashboard.jsx:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Service: RIPE β
β Fleet: RIPE-NA BAU TPM: 30,000 Required Hosts: 120 β
β Current Hosts: 115 Status: ORDERING 5 HOSTS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Section 13 β Onboarding Flow (New Service)
When a new service joins EPIC:
Service owner opens serviceOnboarding.jsx
β Fills in service details:
ServiceId: "MyNewService"
Owner: "owner@amazon.com"
ServiceType: "Registered"
Fleets: [{ region: "us-east-1", fleetIds: ["MyNewService-NA"] }]
β
POST /service
Service.createService() creates in ServiceTable
β
SNS notification sent β Email to EPIC team
β
OnboardingChecklist shows steps:
β
Service registered
β
Fleet created
βͺ Connect Axon metrics
βͺ Configure PMET links
βͺ Set BAU host throughput
βͺ Complete onboarding
β
Each step has specific UI + backend integration:
- Axon: configure which metric tracks TPM
- PMET: link CloudWatch metric to fleet
- Host Throughput: how many TPM per host type
β
Once all steps complete:
β Service is onboarded
β HOTW will start managing it
β Shows in readiness dashboards
β Gets included in weekly HOTW runs
Section 14 β Key Formulas Reference
| Formula | What It Calculates | Code Location |
|---|---|---|
BAU TPM = Peak TPM Γ· (CT Factor Γ Buffer Factor) |
TPM during normal operations | HardwareOrdersUtil.calculateBauTpm() |
Required Hosts = Peak TPM Γ· Host Throughput TPM |
Raw host count for peak | HotwUpscalingHelper.getResult() |
Required Hosts (multi-AZ) = Req Hosts Γ AZ Factor |
Adjusted for AZ redundancy | Same, AZ factor from OtherConstants |
Hosts Needed = Required - Apollo - Pending FMC |
Additional hosts to order | checkIfDeltaHostsPositive() |
Hosts Needed (emergent) = Required - Apollo |
Urgent order, ignore pending | Same, emergent path |
AZ Factors (from OtherConstants.js):
| Availability Zones | Factor |
|---|---|
| 2 AZs | 1.5 |
| 3 AZs | 1.33 |
| 4 AZs | 1.25 |
AZ Factor ensures each AZ has enough hosts even if one AZ fails.
Section 15 β Notification Chain
When HOTW completes for a fleet, this notification chain fires:
1. HotwUpscalingHelper publishes to SNS (hardware order details)
β Email sent to: service owner, fleet owner, EPIC team
Template: hardwareOrderDetails.html.ts
2. Published to MilestoneSNSTopic
β HotwHandler.handleRequest() receives it
β Updates HardwareOrder milestone status
3. ApolloSQSQueue receives message
β ApolloHandler refreshes Apollo data for this fleet
β If Apollo now shows enough hosts β updates dashboard
4. hotwDashboard.jsx polls every 30 seconds
β Shows updated status for this fleet
Section 16 β State Machine: Milestone Status Transitions
ββββββββββββββββ
β NotStarted β
ββββββββ¬ββββββββ
β Previous milestone completed
βΌ
ββββββββββββββββ
β InProgress ββββββββββββββ External system updates
ββββββββ¬ββββββββ
β All checks pass
βΌ
ββββββββββββββββ
β Completed β
ββββββββββββββββ
Also possible:
ββββββββββββββββββββ
β PendingOnEPIC β β EPIC automation needs to act
ββββββββββββββββββββ
ββββββββββββββββββββββββ
β PendingOnServiceOwnerβ β Human needs to act
ββββββββββββββββββββββββ
ββββββββββββββββ
β Blocked β β Something went wrong
ββββββββββββββββ
Section 17 β Data Flow Diagram
EPIC Frontend
ββββββββββββ
β React β
β Redux β
ββββββ¬ββββββ
β REST
βββββββββββΌβββββββββββ
β API Gateway β
βββββββββββ¬βββββββββββ
β
βββββββββββββββΌββββββββββββββββββββββββββ
β EPICBackend (Node.js) β
β Fleet Service Event EventPlan HOTW β
βββββββββββββββ¬ββββββββββββββββββββββββββ
β
βββββββββββ΄βββββββββββ
β β
ββββββΌβββββ βββββββΌβββββββ
βDynamoDB β β MySQL β
β β β (Aurora) β
βFleet β β β
βService β βhotw_run β
βEvent β βhotw_exec β
βEventPlanβ βasg_details β
βProjectionsβ βcapacity_ β
βSchema β β override β
ββββββ¬βββββ ββββββββββββββ
β
DynamoDB Streams
β
βΌ
ββββββββββββββββββββββββ
β EPICBackendTriggers β
β (Java Lambda) β
β β
β ββββββββββββββββββ β
β β SQS Queues β β
β β 40+ queues β β
β ββββββββββ¬ββββββββ β
β β β
β ββββββββββΌββββββββ β
β β Handlers β β
β β HOTW/Apollo/ β β
β β FMC/Milestone/β β
β β BAU/Throttlingβ β
β ββββββββββ¬ββββββββ β
βββββββββββββΌβββββββββββ
β
βββββββββββββΌββββββββββββββββββββββββββββ
β Amazon Internal Systems β
β β
β Apollo FMC SIM Gizmo β
β (config) (order) (ticket) (throttle) β
β β
β ScalingPlanner CloudTune Axon β
β (EAP) (ML proj) (traffic) β
ββββββββββββββββββββββββββββββββββββββββββ