On Friday November 16, 2018 controlled emergency shutdown procedures were executed when the Pyrmont Data Centre temperatures exceeded known acceptable thresholds due to a cooling system failure.
No customer data was lost and all hardware was protected from thermal-failure due to this timely response. Once the cooling system was restored we were able to bring all customers services back online.
We have received a full PIR (post incident review) from the data centre including remedial action and confirmation of upgrades and testing to ensure a recurrence of a similar incident has been satisfactorily mitigated.
At 12:32pm, our monitoring systems indicated that temperatures in the Pyrmont data centre had started to increase. Our system administrators conducted an immediate audit of all servers in the facility to determine if this was isolated to some areas, or was a facility-wide event. It was clear that this was a facility-related event and we immediately contacted our data centre provider for more information.
At 12:44pm we received confirmation that the facility had seamlessly moved to UPS power after an areawide Ausgrid power outage at 12:22pm, however there was a cooling issue that they were actively working on.
As temperatures continued to rise, the decision was made at 1:02pm to execute emergency shutdown procedures to ensure data integrity across our platforms. This involved a graceful shutdown of all services.
At 1:31pm our engineers on-site noted that cooling systems were once again functioning.
At 1:42pm our monitoring systems indicated that temperatures in the data centre facility were approaching normal levels. Our engineers monitored the situation closely to ensure thermal stability prior to powering servers on.
At 1:45pm our engineers started restoring power to servers. Customers services start coming back online.
The majority of services were restored by 2:15pm. All remaining services were restored by 3:14pm.
The root cause of the incident was an extended cooling system interruption at the data centre in Pyrmont.
We have followed up with the data centre and they have advised that independent engineers have determined the remedial action to replace some components in the cooling system and install an override facility on the auxiliary emergency heat extraction system. The faulty components have been replaced and a number of full failover tests have been successfully completed without incident.