There have been teething issues over the past week. I’m still working out a lot of the kinks, but there was a relatively big incident last Friday. Let me just let my hosting provider give the overview of what happened, the analysis, and their corrective actions.
Earlier today, we had to perform emergency maintenance on a critical piece of power infrastructure. Our customers’ uptime is of critical importance to us and communication during these events is paramount. At this time, power has been restored and servers are back online. Listed below is a timeline of events, record of ongoing communications, SLA compensation information and a detailed outline of the steps we’re taking to prevent against these issues in the future. If at anytime you have any questions please do not hesitate to call, email or chat.
Timeline of Events:
- 11:00 – During a routine check of the data center by our Maintenance staff, the slight odor of smoke was detected. We immediately began a complete investigation and located the source of the smell; a power distribution unit in Liquid Web DC3, Zone B, Section 8 covering rows 10 & 11.
- 11:05 – We discovered a manufacturer defect in the Power Distribution Unit (PDU). This defect resulted in a high resistance connection which heated up to critical levels, and threatened to seriously damage itself and surrounding equipment. This bad connection fed an electrical distribution panel which powers one row (Lansing Region, Zone B, Section 8, Row 11) of servers which is part of our Storm platform. We immediately tried to resolve the issue by tightening the connection while the equipment was still on, but it wasn’t possible. To properly resolve the situation and repair the equipment, we needed to de-energize the PDU to replace an electrical circuit breaker.
- 11:15 – To avert any additional damage, we were forced to turn off the breaker which powered servers in Lansing Region, Zone B, Section 8, Row 11. All servers were shut down at this time.
- 11:48 – Servers in Lansing Region, Zone B, Section 8, Row 10 began to be shut down.
- 11:49 – Once it was safe to begin the work, we immediately removed the failed components and replaced them with spares. We discovered that the failed connection was due to a cross threaded screw installed at the time of manufacture. This cross threaded screw meant the connection wasn’t tightened fully, and resulted in a loose, high resistance connection which heated far beyond normal levels. Upon replacing the breaker, we re-energized the PDU and customer servers. Our networking and system restore teams have been working to ensure every customer comes back online as soon as possible.
- 12:52 – Power was restored and servers began to be powered back on.
Communication During Event
We know that in the event of an outage, communication is of critical importance. As soon as the issues were identified we provided the following updates on the Support Page and an “Event” which emails the customer as well as provides an alert within the manage.liquidweb.com interface.
Event Notice on Support Page:
“We are currently undergoing emergency maintenance on critical power infrastructure affecting a small number of Storm servers in Zone B. Work is expected to take approximately 2 hours. During this event affected instances will be powered down. We apologize for the inconvenience this will cause. An update will be provided upon completion. “
Event Notice Emailed to Customers:
“We are currently undergoing emergency maintenance on critical power infrastructure affecting 1 or more of your Storm instances. Work is expected to take approximately 2 hours. During this event affected instances will be powered down. We apologize for the inconvenience this will cause. An update will be provided upon completion.”
Liquid Web’s Service Level Agreement (SLA) provides customers the guarantee that in the event of an outage the customer will receive a credit for 10 times (1,000%) the actual amount of downtime. From our initial research into this event it appears as though most customers experienced between 1 hour and 2 hours of downtime. However, due to the disruptive nature of this event we are providing a minimum of 1 full day of SLA coverage for any servers that were affected by this event. Please contact support if you have any additional information regarding this event of if you would like to check on the status of your SLA request.
Liquid Web TOS Network SLA
Network SLA Remedy
In the event that Liquid Web does not meet this SLA, Dedicated Hosting clients will become eligible to request compensation for downtime reported by service monitoring logs. If Liquid Web is or is not directly responsible for causing the downtime, the customer will receive a credit for 10 times ( 1,000% ) the actual amount of downtime. This means that if your server is unreachable for 1 hour (beyond the 0.0% allowed), you will receive 10 hours of credit.
All requests for compensation must be received within 5 business days of the incident in question. The amount of compensation may not exceed the customer’s monthly recurring charge. This SLA does not apply for any month that the customer has been in breach of Liquid Web Terms of Service or if the account is in default of payment.
All PDU’s will be inspected for the same issue for all panels and all main breakers.
In this case, this PDU was just recently put into service. When we purchase critical power equipment, the manufacturer performs an onsite startup procedure. This equipment check includes a physical inspection, phase rotation, voltage checks, alarm checks and many more. This particular manufacturer defect didn’t avail itself until the PDU was under a significant amount of load. Once the manufacturer defect began, the screw at the bus finger began to overheat. Once this overheating began, the resistance increased causing a serious risk of catastrophic failure.
Going forward, Liquid Web will perform additional tests, above and beyond our manufacturer startup procedures, on new equipment to look for manufacturer related defects and issues. We will now perform testing at full load by utilizing a Power Load Banking System. This testing procedure was already in place for the vast majority of our power equipment but now will also include PDU specific testing.
Liquid Web performs preventative maintenance (PM) on all PDU’s. This PM is an inspection that consists of current draw recording on all branch circuit breakers, infrared imaging of main connection points and on the transformers and a general inspection. This is typically a quarterly inspection.
Yeah, I can’t argue with a company that honest. Plus they go out of their way to help solve problems which technically may not even be their problem or responsibility.
Oh, and I2R losses as always, are a pain in the ass.