What you’ll learn: The complete journey of a peak event from creation to post-peak descaling β€” every system involved, every code path triggered, every database write.


Section 1 β€” The Big Picture

A β€œPeak Event” in EPIC’s lifecycle has three phases:

╔══════════════════════════════════════════════════════════════════╗
β•‘  PHASE 1: PRE-PEAK (Weeks before the event)                      β•‘
β•‘  ─────────────────────────────────────────────────────────────   β•‘
β•‘  Create Event β†’ Gather Projections β†’ Order Hardware β†’            β•‘
β•‘  Confirm Delivery β†’ Communicate TPM β†’ Update Throttling          β•‘
╠══════════════════════════════════════════════════════════════════╣
β•‘  PHASE 2: PEAK DAY (During the event)                            β•‘
β•‘  ─────────────────────────────────────────────────────────────   β•‘
β•‘  Monitor traffic β†’ Handle emergent orders β†’ Scale if needed      β•‘
╠══════════════════════════════════════════════════════════════════╣
β•‘  PHASE 3: POST-PEAK (After the event)                            β•‘
β•‘  ─────────────────────────────────────────────────────────────   β•‘
β•‘  Gather Descale Projections β†’ Place Descale Orders β†’             β•‘
β•‘  Communicate Descale TPM β†’ Update Descale Throttling β†’           β•‘
β•‘  Final Sign-off                                                   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Timeline example for Prime Day 2024:

  • T-8 weeks: Event created in EPIC
  • T-6 weeks: Services submit traffic projections
  • T-5 weeks: HOTW runs, places hardware orders (SPCO)
  • T-4 weeks: Hardware delivered to data centers (FMC fulfillment)
  • T-2 weeks: TPM communicated to downstream services
  • T-1 week: Throttling updated in Gizmo/SDC
  • T-0: Prime Day starts!
  • T+1 week: Descale projections gathered
  • T+2 weeks: Descale hardware orders placed
  • T+3 weeks: Hosts returned to pool

Section 2 β€” Step 1: Event Creation

Who does it: EPIC team or service teams
Where: createEvent.jsx β†’ backend API

User fills form in createEvent.jsx:
β”œβ”€β”€ EventId: "PD2024"
β”œβ”€β”€ EventName: "Prime Day 2024"
β”œβ”€β”€ EventType: "Peak"
β”œβ”€β”€ RegionList: ["us-east-1", "eu-west-1", "us-west-2"]
└── SPCOEventDates per region:
    β”œβ”€β”€ us-east-1: { spcoStart: "2024-05-15", spcoEnd: "2024-07-20" }
    └── eu-west-1: { spcoStart: "2024-05-15", spcoEnd: "2024-07-20" }
         ↓
POST /event
         ↓
Event.createEvent() in Node.js backend
         ↓
DynamoDB: Creates item in EventTable
{
    EventId: "PD2024",
    EventName: "Prime Day 2024",
    VersionId: 1,
    LatestVersionId: 1,
    RegionList: [...],
    // ...
}
         ↓
SNS publishes "EventCreated" notification
         ↓
SQS EventFleetCreationQueue receives message
         ↓
FleetReceiver.js processes:
    β†’ Gets all registered services
    β†’ For each service Γ— fleet: creates EventPlan in DynamoDB
    EventPlan = {
        EventPlanId: "PD2024#RIPE-NA",
        EventMilestone: [
            { EventMilestoneId: "GatherProjectionFromUpstream", Status: "NotStarted" },
            { EventMilestoneId: "HardwareOrder", Status: "NotStarted" },
            { EventMilestoneId: "HardwareFulfillment", Status: "NotStarted" },
            { EventMilestoneId: "CommunicateTPMToDownstream", Status: "NotStarted" },
            { EventMilestoneId: "UpdateThrottlingInSDC", Status: "NotStarted" }
        ]
    }
         ↓
SQS EventTicketCreationQueue receives message
         ↓
EventTicketReceiver.js:
    β†’ Creates SIM ticket for the event
    β†’ Links ticket to EPIC dashboard

Section 3 β€” Step 2: Gather Projections

Who does it: Service owners (manually + via Axon/PMET automation)
Where: serviceDetails.jsx β†’ Host Projections tab

Service owner opens serviceDetails.jsx for fleet "RIPE-NA" event "PD2024"
         ↓
Frontend calls:
    backend_api.getFleetDetails("RIPE-NA", "PD2024")
    apollo_api.getApolloData("RIPE_NA_PROD")
    axon_traffic_api.getTrafficData("RIPE-NA", "PD2024")
         ↓
Page shows:
    - Current hosts in Apollo: 150
    - Current TPM (from Axon): 35,000
    - CloudTune projection: 50,000 TPM at peak
         ↓
Service owner can:
    1. Accept auto-calculated projection
    2. Override Peak TPM (submitModalForOverridingInputTPM.jsx)
    3. Override specific ASG projections
         ↓
PUT /projection/{fleetId}/{eventId}
         ↓
Projection.updateProjection() saves to ProjectionsTable
         ↓
MilestoneWorkflow triggered:
    β†’ GatherProjectionsMilestoneHandler checks all projections
    β†’ If upstream projections verified β†’ update status to "Completed"
    β†’ Publish to MilestoneSNSTopic: { milestone: "HardwareOrder", fleetId, eventId }

What happens with upstream projections:

Service A sends traffic TO Service B
    ↓
Service A submits "downstream projection" for Service B
    ↓
Service B receives "upstream projection" in their upstreamProjectionsTab
    ↓
Service B verifies it (checks if the number makes sense)
    ↓
Once all upstream projections verified β†’ GatherProjections milestone = Completed

Section 4 β€” Step 3: HOTW Hardware Ordering (The Core)

Who does it: Automated (HOTW system) + occasional manual trigger
When: Weekly cron job + triggered by Milestone SNS

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OPTION A: Scheduled Weekly Run
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Cron fires every week
         ↓
HotwHandler.handleSQSRequestForUpdateSpco()
         ↓
For each active event:
    For each registered service:
        For each fleet in correct region:
            β†’ Send to validateAndUpdateSPCOSQSQueue
            Message: { FleetId, EventId, RunId }
         ↓
HotwHandler.validateAndUpdateSpco() (SQS consumer)
         ↓
For each message: calls HotwUpscalingHelper.handle(fleetId, eventId, runId)
         ↓
[The HOTW execution begins β€” see Section 5]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OPTION B: Manual Atomic Run (Single Fleet)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Service owner opens hotwDashboard.jsx
    β†’ Sees their fleet is "behind"
    β†’ Clicks "Run HOTW" button
         ↓
atomicHotwModal.jsx opens
    β†’ Shows previous vs current version
    β†’ User confirms
         ↓
POST /hotw/atomic/{fleetId}/{eventId}
Body: { PreviousVersionId: "3", CurrentVersionId: "4" }
         ↓
HotwHandler.atomicHotwTrigger()
    β†’ Creates RunId
    β†’ Sends to atomicHOTWSQSQueue
         ↓
HotwHandler.atomicHotwExecutor() (SQS consumer)
    β†’ Calls HotwUpscalingHelper.handleAtomic(fleetId, eventId, runId, prevVer, currVer)
         ↓
[The HOTW execution begins β€” see Section 5]

Section 5 β€” HOTW Execution Details

This is HotwUpscalingHelper.getResult() step by step:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HotwUpscalingHelper.getResult(fleetId, eventId, runId, vId)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 1    β”‚  Validate fleet
                    β”‚ validateFleetβ”‚ ← throws if fleet invalid
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 2    β”‚  Get fleet data from EPIC Backend
                    β”‚ getFleet() β”‚ β†’ Fleet: { apolloName, hostThroughputTPM, ctFactor... }
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 3    β”‚  Get event data
                    β”‚ getEvent() β”‚ β†’ Event: { peakDate, regionList... }
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 4    β”‚  Get Apollo current host count
                    β”‚ apolloHelperβ”‚ β†’ apolloMaxHosts = 150
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 5    β”‚  Get FMC pending orders
                    β”‚ fmcHelper  β”‚ β†’ pendingEpicFmcHostOrder = 20
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 6    β”‚  Get ASG details from EAP/ScalingPlanner
                    β”‚ eapHelper  β”‚ β†’ For each ASG: EAP status, capacity override
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 7  CALCULATE HOSTS NEEDED             β”‚
                    β”‚                                             β”‚
                    β”‚  Input TPM = 50,000 (from CloudTune/manual) β”‚
                    β”‚  CT Peak Factor = 1.5                       β”‚
                    β”‚  Buffer Factor = 1.1                        β”‚
                    β”‚  Host Throughput = 250 TPM/host             β”‚
                    β”‚                                             β”‚
                    β”‚  BAU TPM = 50000 / (1.5 Γ— 1.1) = 30,303   β”‚
                    β”‚  Required Hosts = 50000 / 250 = 200         β”‚
                    β”‚  (Γ— AZ Factor if multi-AZ)                  β”‚
                    β”‚                                             β”‚
                    β”‚  Hosts Needed = 200 - 150 - 20 = 30        β”‚
                    β”‚  (required - apollo - pending)              β”‚
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 8  IS ORDER NEEDED?                   β”‚
                    β”‚                                             β”‚
                    β”‚  hostsNeeded > 0?                           β”‚
                    β”‚  YES β†’ place order                          β”‚
                    β”‚  NO β†’ no order needed (log reason)          β”‚
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Step 9  IS THIS EMERGENT?                  β”‚
                    β”‚                                             β”‚
                    β”‚  Check: is today past emergent start date?  β”‚
                    β”‚  YES β†’ create Sev2 SIM ticket               β”‚
                    β”‚  NO β†’ standard SPCO order                   β”‚
                    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚ hostsNeeded > 0?    β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”Œβ”€β”€β”€β”€β”€β”€β”˜  └──────────────────────────┐
                    β”‚ YES                                  β”‚ NO
                    β–Ό                                     β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Step 10a             β”‚             β”‚ Step 10b             β”‚
         β”‚ fetchArnUpdateSPCO   β”‚             β”‚ Log "no order needed"β”‚
         β”‚  AndReturnSIMLink()  β”‚             β”‚ Update execution     β”‚
         β”‚                      β”‚             β”‚ details with reason   β”‚
         β”‚ β†’ For each ASG:      β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚   - Get ASG ARN      β”‚
         β”‚   - Enroll in EAP    β”‚
         β”‚     if not enrolled  β”‚
         β”‚   - Place SPCO order β”‚
         β”‚   - Open SIM ticket  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Step 11  UPDATE HOTW EXECUTION DETAILS               β”‚
         β”‚                                                       β”‚
         β”‚ POST /hotw/executionDetail                           β”‚
         β”‚ Body: {                                               β”‚
         β”‚   FleetId, EventId, RunId,                           β”‚
         β”‚   Status: "Success"/"Fail"/"PartialSuccess",         β”‚
         β”‚   HotwExecutionDetails: { peakTPM, bauTPM, ... },    β”‚
         β”‚   CapacityOverrideDetails: [...],                    β”‚
         β”‚   FulfillmentDetails: [...],                         β”‚
         β”‚   ASGDetails: [...]                                   β”‚
         β”‚ }                                                     β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Step 12  PUBLISH TO SNS      β”‚
         β”‚                              β”‚
         β”‚ publishHardwareDetailsToSNS()β”‚
         β”‚ β†’ SNS: hardware order summaryβ”‚
         β”‚ β†’ triggers email to team     β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Step 13  FINALLY BLOCK      β”‚
          β”‚ (ALWAYS RUNS)               β”‚
          β”‚                             β”‚
          β”‚ 1. Send to ApolloSQSQueue   β”‚
          β”‚    β†’ refreshes Apollo data  β”‚
          β”‚                             β”‚
          β”‚ 2. Publish to MilestoneSNS  β”‚
          β”‚    β†’ triggers milestone     β”‚
          β”‚      status update          β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Section 6 β€” Step 4: Hardware Delivery Tracking

After orders are placed, FMC tracks delivery:

SPCO order placed (ASG ARN submitted)
         ↓
AWS hardware team processes order
         ↓
FMC order status: "Pending" β†’ "InReview" β†’ "Approved" β†’ "Fulfilled"
         ↓
FmcTrigger cron fires (every few hours)
         ↓
FmcHandler fetches latest status from FMC API
         ↓
FmcHandler updates MySQL fulfillment_details table
         ↓
HardwareFulfillmentMilestoneHandler runs
         ↓
Checks: are all orders fulfilled?
    YES β†’ Update HardwareFulfillment milestone to "Completed"
    NO  β†’ Update with pending count
         ↓
Once all fulfilled:
    β†’ Publish to MilestoneSNSTopic: "CommunicateTPM"

Frontend shows this as:

serviceDetails.jsx β†’ Fleet Milestone Readiness tab:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Milestone                   Status        Last Updated  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Gather Projections          βœ… Completed  2024-06-01   β”‚
β”‚ Hardware Order              βœ… Completed  2024-06-15   β”‚
β”‚ Hardware Fulfillment        🟑 In Progress 2024-06-20  β”‚
β”‚   - 3 orders fulfilled                                  β”‚
β”‚   - 1 order pending approval                            β”‚
β”‚ Communicate TPM             βšͺ Not Started              β”‚
β”‚ Update Throttling           βšͺ Not Started              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Section 7 β€” Step 5: TPM Communication

After hardware arrives:

HardwareFulfillment milestone completes
         ↓
CommunicateTPMMilestoneHandler triggered
         ↓
For each downstream service:
    β†’ Creates/updates projection: "RIPE-NA will send X TPM to ServiceB"
    β†’ ServiceB sees this in their upstreamProjectionsTab
    β†’ ServiceB needs to verify/acknowledge
         ↓
Once all downstreams acknowledge:
    β†’ CommunicateTPM milestone = "Completed"
    β†’ Publish to MilestoneSNSTopic: "UpdateThrottling"

Section 8 β€” Step 6: Throttling Update

UpdateThrottlingMilestoneHandler triggered
         ↓
Reads throttling config from ThrottlingTable
         ↓
Sends to GizmoThrottlingUpdateQueue
         ↓
ThrottlingExecutor processes:
    β†’ Calls Gizmo/SDC API with new throttling limits
    β†’ Sets max TPS for this service during peak
         ↓
ThrottlingExecutor confirms update successful
         ↓
UpdateThrottlingInSDC milestone = "Completed"
         ↓
ALL 5 MILESTONES COMPLETE!
         ↓
Service shows 🟒 PEAK READY in serviceReadinessDashboard.jsx

Section 9 β€” Step 7: Peak Day

During the actual peak event:

Traffic spikes as customers start shopping
         ↓
Axon tracking real-time TPM for all services
         ↓
EPIC monitors:
    - Are actual TPMs within projected range?
    - Are hosts enough?
    - Any throttling issues?
         ↓
If emergency (actual > projected by variance threshold):
    β†’ VarianceExceededHandler fires
    β†’ Alerts team via SIM ticket
    β†’ EPIC operators can manually trigger emergent HOTW
         ↓
Emergent HOTW (if needed):
    HotwUpscalingHelper detects emergent=true
    β†’ Creates Sev2 SIM ticket
    β†’ Places urgent SPCO order
    β†’ FLO scales up hosts as fast as possible

EmergentAnnouncementFlashbar.jsx shows red banner:

🚨 EMERGENT SCALING IN PROGRESS
Fleet RIPE-NA requires 50 additional hosts urgently.
Order placed: SIM-123456 | FMC Order: SPCO-789012

Section 10 β€” Step 8: Descaling (Post-Peak)

After the peak ends, hosts need to be returned:

Event end date passes
         ↓
Descale milestones become active:
    1. GatherDescaleProjections
    2. DescaleHardwareOrder
    3. CommunicateDescaleTPM
    4. UpdateDescaleThrottling
    5. DescaleCompletion
         ↓
Service owners open descaleFleetConfigurations.jsx
    β†’ Review descale recommendation from DescaleHostRecommendationHelper
    β†’ Recommendation: "You can safely return 50 hosts from AZ 1a"
         ↓
Owner confirms descale
         ↓
DescaleRecommendationHandler.handle(fleetId, eventId)
    β†’ Greedy algorithm determines which ASGs to remove from
    β†’ For each ASG to descale:
        β†’ Place SPCO descale order
        β†’ FMC tracks fulfillment
         ↓
BAUScalingHandler takes over:
    β†’ Computes BAU capacity needed
    β†’ Returns any hosts above BAU level
         ↓
DescaleCompletion milestone = "Completed"
         ↓
Event lifecycle complete! βœ…

Descale recommendation algorithm (from DescaleHostRecommendationHelper.java):

// Greedy approach: always remove from ASG with MOST hosts first
// Goal: get to target host count as evenly as possible

private Map<String, Integer> descaleHostRecommendation(Fleet fleet, int hostsToRemove) {
    Map<String, Integer> asgCurrentHosts = getCurrentHostsPerASG(fleet);
    Map<String, Integer> toRemovePerASG = new HashMap<>();
    
    while (hostsToRemove > 0) {
        // Find ASG with most hosts
        String maxAsg = asgCurrentHosts.entrySet().stream()
            .max(Map.Entry.comparingByValue())
            .get().getKey();
        
        // Remove one host from that ASG
        asgCurrentHosts.merge(maxAsg, -1, Integer::sum);
        toRemovePerASG.merge(maxAsg, 1, Integer::sum);
        hostsToRemove--;
    }
    
    return toRemovePerASG;
}
// Example: [100, 80, 70] hosts needed β†’ recommend removing from 100-host ASG first

Section 11 β€” Exception Flow (When Normal Rules Don’t Apply)

Sometimes a service needs more buffer than standard:

Standard buffer factor = 1.1 (10% safety margin)
But Service XYZ says: "Our traffic is unpredictable, we need 1.3"
         ↓
Service owner opens createException.jsx:
    β†’ Fills exception request:
        ExceptionType: "BufferFactor"
        RequestedValue: 1.3
        BusinessReason: "Our traffic has 30% variance due to external APIs"
        BusinessDrivers: ["ThirdPartyDependency", "SeasonalVariance"]
    β†’ Submits
         ↓
POST /exception
ExceptionOperations stores in ExceptionTable
         ↓
EPIC team sees in approveException.jsx
    β†’ Reviews business justification
    β†’ Approves or rejects
         ↓
ApprovalsHandler.approveOrReject()
    β†’ If approved: updates fleet's bufferFactor field to 1.3
    β†’ Sends email notification
    β†’ HOTW will use 1.3 as buffer factor for this fleet going forward

Section 12 β€” BAU (Business As Usual) Flow

BAU is different from peak β€” it’s ongoing, not event-specific:

BAU capacity = what you need for normal (non-peak) operations

Every week, BAUScalingHandler runs:
    1. Gets BAU TPM from Axon (last week's average)
    2. Applies BAU buffer factor
    3. Calculates BAU hosts needed = BAU TPM / Host Throughput / AZ Factor
    4. Compares with current hosts in Apollo
    5. If needed > current: places BAU SPCO order
    6. If needed < current (excess hosts): logs for potential release
         ↓
Service owners see in bauServiceOwnerDashboard.jsx:
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Service: RIPE                                                β”‚
    β”‚ Fleet: RIPE-NA   BAU TPM: 30,000   Required Hosts: 120      β”‚
    β”‚ Current Hosts: 115   Status: ORDERING 5 HOSTS               β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Section 13 β€” Onboarding Flow (New Service)

When a new service joins EPIC:

Service owner opens serviceOnboarding.jsx
    β†’ Fills in service details:
        ServiceId: "MyNewService"
        Owner: "owner@amazon.com"
        ServiceType: "Registered"
        Fleets: [{ region: "us-east-1", fleetIds: ["MyNewService-NA"] }]
         ↓
POST /service
Service.createService() creates in ServiceTable
         ↓
SNS notification sent β†’ Email to EPIC team
         ↓
OnboardingChecklist shows steps:
    βœ… Service registered
    βœ… Fleet created
    βšͺ Connect Axon metrics
    βšͺ Configure PMET links
    βšͺ Set BAU host throughput
    βšͺ Complete onboarding
         ↓
Each step has specific UI + backend integration:
    - Axon: configure which metric tracks TPM
    - PMET: link CloudWatch metric to fleet
    - Host Throughput: how many TPM per host type
         ↓
Once all steps complete:
    β†’ Service is onboarded
    β†’ HOTW will start managing it
    β†’ Shows in readiness dashboards
    β†’ Gets included in weekly HOTW runs

Section 14 β€” Key Formulas Reference

Formula What It Calculates Code Location
BAU TPM = Peak TPM Γ· (CT Factor Γ— Buffer Factor) TPM during normal operations HardwareOrdersUtil.calculateBauTpm()
Required Hosts = Peak TPM Γ· Host Throughput TPM Raw host count for peak HotwUpscalingHelper.getResult()
Required Hosts (multi-AZ) = Req Hosts Γ— AZ Factor Adjusted for AZ redundancy Same, AZ factor from OtherConstants
Hosts Needed = Required - Apollo - Pending FMC Additional hosts to order checkIfDeltaHostsPositive()
Hosts Needed (emergent) = Required - Apollo Urgent order, ignore pending Same, emergent path

AZ Factors (from OtherConstants.js):

Availability Zones Factor
2 AZs 1.5
3 AZs 1.33
4 AZs 1.25

AZ Factor ensures each AZ has enough hosts even if one AZ fails.


Section 15 β€” Notification Chain

When HOTW completes for a fleet, this notification chain fires:

1. HotwUpscalingHelper publishes to SNS (hardware order details)
   β†’ Email sent to: service owner, fleet owner, EPIC team
   Template: hardwareOrderDetails.html.ts
   
2. Published to MilestoneSNSTopic
   β†’ HotwHandler.handleRequest() receives it
   β†’ Updates HardwareOrder milestone status
   
3. ApolloSQSQueue receives message
   β†’ ApolloHandler refreshes Apollo data for this fleet
   β†’ If Apollo now shows enough hosts β†’ updates dashboard
   
4. hotwDashboard.jsx polls every 30 seconds
   β†’ Shows updated status for this fleet

Section 16 β€” State Machine: Milestone Status Transitions

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚  NotStarted  β”‚
          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚ Previous milestone completed
                 β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚  InProgress  │◄──────────── External system updates
          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚ All checks pass
                 β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   Completed  β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          
          Also possible:
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚  PendingOnEPIC   β”‚ ← EPIC automation needs to act
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ PendingOnServiceOwnerβ”‚ ← Human needs to act
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚    Blocked   β”‚ ← Something went wrong
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Section 17 β€” Data Flow Diagram

                         EPIC Frontend
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚  React   β”‚
                         β”‚  Redux   β”‚
                         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                              β”‚ REST
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   API Gateway       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚          EPICBackend (Node.js)          β”‚
                β”‚  Fleet  Service  Event  EventPlan  HOTW β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                    β”‚
               β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
               β”‚DynamoDB β”‚        β”‚   MySQL    β”‚
               β”‚         β”‚        β”‚  (Aurora)  β”‚
               β”‚Fleet    β”‚        β”‚            β”‚
               β”‚Service  β”‚        β”‚hotw_run    β”‚
               β”‚Event    β”‚        β”‚hotw_exec   β”‚
               β”‚EventPlanβ”‚        β”‚asg_details β”‚
               β”‚Projectionsβ”‚      β”‚capacity_   β”‚
               β”‚Schema   β”‚        β”‚  override  β”‚
               β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
           DynamoDB Streams
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ EPICBackendTriggers  β”‚
         β”‚    (Java Lambda)     β”‚
         β”‚                      β”‚
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
         β”‚  β”‚  SQS Queues    β”‚  β”‚
         β”‚  β”‚  40+ queues    β”‚  β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
         β”‚           β”‚          β”‚
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”  β”‚
         β”‚  β”‚   Handlers     β”‚  β”‚
         β”‚  β”‚  HOTW/Apollo/  β”‚  β”‚
         β”‚  β”‚  FMC/Milestone/β”‚  β”‚
         β”‚  β”‚  BAU/Throttlingβ”‚  β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚      Amazon Internal Systems           β”‚
         β”‚                                        β”‚
         β”‚ Apollo    FMC     SIM     Gizmo         β”‚
         β”‚ (config)  (order) (ticket) (throttle)  β”‚
         β”‚                                        β”‚
         β”‚ ScalingPlanner  CloudTune  Axon         β”‚
         β”‚ (EAP)           (ML proj)  (traffic)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜