Dear Valued PEAK Customer,
I would like to take a moment to describe the sequence of events that lead up to the complete data center outage, which occurred on Saturday, October 9th at 3:15am. This outage affected all PEAK customer services.
During planned maintenance to the Uninterruptible Power Supply (UPS) we experienced an unplanned electrical interruption to the PEAK data center (DC). For more information regarding the project, see the PEAK blog at:
Part of our plan was to ensure continuous operation during the upgrade. To accomplish this, we started our back-up generator to supply continuous power to the DC in the event of interruption of city power. Second, we installed a temporary back-feed electrical circuit between the electrical panel feed by the generator and the electrical panel normally fed by the UPS, which feeds all equipment in the DC.
The electrical system of the DC is a three-phase system, which runs at 208 volts. During the weeks leading up to the upgrade, engineers moved workloads to get the power load at the time of the upgrade to less than 80 amps per phase to be in safe tolerance with the circuit breakers. At the time of the upgrade loads per phase were: A:78 amps B:80 amps C:79 amps.
Electricians installed a circuit capable of supplying 100 amps of load between the generator panel and UPS panel. We used 100 amp three-phase breakers on both sides of the temporary back-feed circuit.
At 2:30 AM, the stand-by generator was started and electrical load was transferred from CITY to GENERATOR power.
At 3:00 AM, we initiated procedures to shutdown the old Liebert UPS, and put the UPS in maintenance-bypass mode. This allowed electrical power to flow from the generator panel through the UPS to the UPS sub-panel. At this time, we closed (turned on) the temporary back-feed breakers, which put the UPS in parallel with the temporary back-feed circuit. Next we completely shut down the UPS. This transferred the load in the DC to the temporary back-feed circuit. We amp probed the back-feed circuit and confirmed loads at A:78 amps B:80 amps C:79 amps. The electricians felt comfortable that this circuit would hold the load properly.
PEAK engineers, equipment movers and electricians began the process of safely disconnecting and removing the existing UPS from the DC, while the generator supplied power.
Around 3:15 AM, for an un-known reason, the 100-amp breaker on the generator side of the back-feed circuit tripped, causing complete electrical loss of power to the DC. At this time, the decision was made to shut down the electrical sub-panel supplying power to the DC co-location area, which would remove around 50 amps from our workload.
Around 3:20 AM, we re-closed the tripped breaker and power was restored to the DC.
There were cascading failures caused by the sudden unplanned loss of electrical power to the DC. Our infrastructure relies on hundreds of devices which all work together to provide Internet and information technology services to our customers. The procedures to re-start these systems are tedious, time-consuming, and must be done in a specific order. Engineers immediately and swiftly initiated this re-start procedure.
The most significant failure during the re-start was our network switching/routing core, which runs on Juniper EX4200 switches. Two of the four switches did not re-start properly and required a re-load of the operating system, which runs on the device.
In addition, a Network Appliance Filer for data storage did not re-start. The controller for this device completely failed and a replacement has been procured, which will arrive on Tuesday. Most of the data that runs on the filer was moved to alternate servers, but there were a few services, which rely on this server. The most significant service is customer personal web space.
At 5:00 AM, the existing UPS was completely removed from the DC and the process of bringing in the new UPS begun.
Around 6:00 AM, the factory service technician from APC arrived on-site. The new UPS is a modular 3 cabinet system consisting of an inverter cabinet, battery cabinet and distribution cabinet. Electricians and APC started the process of cabling the cabinets and connecting incoming and outgoing power to the distribution cabinet.
At 9:00 AM, after APC completed start-up checklist procedures, power was supplied to the UPS to confirm proper operation and cabling.
At 9:15 AM, the new UPS was put in maintenance bypass mode, which put the new UPS in parallel with the temporary back-feed circuit. At this time, the breakers to the temporary back-feed circuit were opened (turned off) and the new UPS carried the full DC load. After a few minutes, the UPS taken out of maintenance bypass mode and it began normal operation protecting electrical loads in the DC.
At 9:20 AM, the stand-by generator was shut down and incoming electrical load was transferred from the GENERATOR to CITY power.
At 9:35 AM, power to the co-location area was restored.
This marked the completion of the UPS upgrade project. Although we experienced a significant service interruption to our customers, which was not planned, the new UPS will provide increased reliability and capacity.
The new UPS is an APC PX80 Symmetra Modular UPS. Some of the features and benefits of this system are:
- Modular design: Provides fast serviceability and reduced maintenance requirements via self-diagnosing, field-replaceable modules
- Configurable for N+1 internal redundancy: Provides high availability through redundancy by allowing configuration with one more Power Module than is necessary to support the connected load.
- Redundant Intelligence Modules: Provides higher availability to the UPS connected loads by giving redundant communication paths to critical UPS functions.
- Hot-swappable intelligence modules: Ensures clean, uninterrupted power to protected equipment during Intelligence Module replacement.
- Hot-swappable power modules: Ensures clean, uninterrupted power to protected equipment during Power Module replacement.
- Hot-swappable batteries: Ensures clean, uninterrupted power to protected equipment while batteries are being replaced
- Power Modules connected in parallel: Enhances availability by allowing immediate, seamless recovery from isolated module failures.
- Battery modules connected in parallel: Delivers higher availability through redundant batteries.
- Automatic internal bypass: Supplies utility power to the connected loads in the event of a UPS overload condition or fault.
- Automatic restart of loads after UPS shutdown: Automatically starts up the connected equipment upon the return of utility power.
- Power conditioning: Protects connected loads from surges, spikes, lightning, and other power disturbances.
In our configuration, the UPS is operating at 45% capacity, provides 15 minutes of operation (while the stand-by generator starts) and provides N+3 power module redundancy.
We sincerely apologize for the interruption of service this outage caused you. We take great pride in the reliability of our infrastructure. You can be confident that we will continue to work on improving our infrastructure and procedures to ensure highly available service delivery to our valued customers.
If you have specific questions or concerns, please do not hesitate to contact me directly.
Chief Technology Officer