Cooling System Incident at Data Centre
Incident Report for Panthur Hosting
Postmortem

Introduction

On Friday November 16, 2018 controlled emergency shutdown procedures were executed when the Pyrmont Data Centre temperatures exceeded known acceptable thresholds due to a cooling system failure.

No customer data was lost and all hardware was protected from thermal-failure due to this timely response. Once the cooling system was restored we were able to bring all customers services back online.

We have received a full PIR (post incident review) from the data centre including remedial action and confirmation of upgrades and testing to ensure a recurrence of a similar incident has been satisfactorily mitigated.

Summary

At 12:32pm, our monitoring systems indicated that temperatures in the Pyrmont data centre had started to increase. Our system administrators conducted an immediate audit of all servers in the facility to determine if this was isolated to some areas, or was a facility-wide event. It was clear that this was a facility-related event and we immediately contacted our data centre provider for more information.

At 12:44pm we received confirmation that the facility had seamlessly moved to UPS power after an areawide Ausgrid power outage at 12:22pm, however there was a cooling issue that they were actively working on.

As temperatures continued to rise, the decision was made at 1:02pm to execute emergency shutdown procedures to ensure data integrity across our platforms. This involved a graceful shutdown of all services.

At 1:31pm our engineers on-site noted that cooling systems were once again functioning.

At 1:42pm our monitoring systems indicated that temperatures in the data centre facility were approaching normal levels. Our engineers monitored the situation closely to ensure thermal stability prior to powering servers on.

At 1:45pm our engineers started restoring power to servers. Customers services start coming back online.

The majority of services were restored by 2:15pm. All remaining services were restored by 3:14pm.

Root Cause

The root cause of the incident was an extended cooling system interruption at the data centre in Pyrmont.

Corrective and Preventative Measures

We have followed up with the data centre and they have advised that independent engineers have determined the remedial action to replace some components in the cooling system and install an override facility on the auxiliary emergency heat extraction system. The faulty components have been replaced and a number of full failover tests have been successfully completed without incident.

Posted Dec 03, 2018 - 11:37 AEDT

Resolved
All services have remained stable.
Our engineers will continue to monitor the situation closely.
A post-incident review will be conducted and a postmortem will be provided.
Posted Nov 19, 2018 - 11:37 AEDT
Monitoring
All services have been restored.
Our engineers will continue to monitor the situation closely.
A post-incident review will be conducted and a postmortem will be provided.
Posted Nov 16, 2018 - 15:14 AEDT
Update
Our engineers are proceeding to restore services and are continuing to monitor the thermal-stability of the environment. An update on the situation will be provided in 30 minutes or if there is a significant change in the situation.
Posted Nov 16, 2018 - 14:00 AEDT
Update
The thermal issue at the data-centre has been rectified. Our engineers are monitoring to ensure thermal-stability before they begin to restore services. An update on the situation will be provided in 10 minutes.
Posted Nov 16, 2018 - 13:44 AEDT
Identified
Our Pyrmont Data Centre is currently experiencing a cooling system incident. Some customer services are being shut down as a precaution to avoid data-loss.
Our teams are working to restore all service as quickly as possible. Your patience is appreciated.

As further information becomes available, we will provide an update here.
Posted Nov 16, 2018 - 13:07 AEDT
This incident affected: Economy & Business Web Hosting, Reseller Hosting, Stealth Web Hosting, Technical Support (Tickets), and Technical Support (Phone).