Data center power outage
Incident Report for Exposure
Postmortem

Data center power outage postmortem

On Thursday, May 24th 2018, at around 6:00 a.m. ET, an unexpectedly power failure occurred at the Hetzner data center in Falkenstein, Germany. This caused the Exposure platform to go offline for a period of 28 hours until service resumed on Wednesday, May 25th at around 10 a.m. ET. All stories and platform features became available at that time.

No data was lost during this outage, as Exposure is continuously backed up to a secure location outside of the data center.

This outage was the result of a strong voltage reduction in the local power supply caused by a heavy storm, which caused the data center's main and backup power supplies to fail. Hetzner technicians worked to replace the necessary power supplies and, in time, were able to bring the data center back online.

During this outage, we were continually updating our status at http://status.exposure.co/ and https://twitter.com/exposure. We sent a total of three emails to active members to inform them of the status of the issue.

What steps are we talking to prevent this in the future

It is extremely rare for such a large data center to experience an issue of this degree. We understand the severity of this outage and apologize that we were unable to rectify the issue more quickly. We assure you that we are taking steps to better protect your experience moving forward.

In the coming week you will receive information about a scheduled maintenance period. During this time, we will be moving Exposure to a cloud-based server on Amazon EC2. This move will provide greater redundancy and make rare outages like this one even less likely.

This move will be seamless for you and your stories. There are no updates or action required.

*How we are updating our outage protocol *

Once the server move is complete, our outage protocol will act much faster to bring new servers online in the event of a power failure. Better communication workflows have already been implemented in the event of future outages. Please subscribe to our status page to receive update in real time: http://status.exposure.co/.


*Let us know if you have any questions *

Don’t hesitate to email support@exposure.co with any questions and again we a apologies for the outage.

Luke Beard
CEO & Founder of Exposure

Posted May 27, 2018 - 16:37 EDT

Resolved
Exposure is now back online and fully operational! A full report on the cause and the steps we are taking to avoid this in the future will be published in the coming few days. Thanks for sticking with us.

Please email support@exposure.co with any questions.

Happy storytelling!
Posted May 25, 2018 - 11:08 EDT
Update
We are currently in a queue to have our hardware restarted since the power has been restored. Please stand by.
Posted May 25, 2018 - 09:38 EDT
Update
Good news! Technicians have managed to restore secure operation in DC12. The UPS is fully operational and Exposure should be coming back online very shortly.
Posted May 25, 2018 - 07:17 EDT
Update
Good morning everyone. Communications with the Data Center provider (Hetzner) has been frustratingly slow and we don't have an ETA to resume service right now. This has been a major failure on their part.

A rebuild of our server infrastructure is very much in progress but that does take time due to the complexity of our content. More news on that as we have it. Please email support@exposure.co with any questions.

We are doing everything we can do get service back online.
Posted May 25, 2018 - 05:18 EDT
Update
The data center is still having major power issues despite progress made by Hetzner. We wish we had more information at this time. Please email any questions to support@exposure.co.
Posted May 24, 2018 - 20:03 EDT
Update
Still monitoring the situation. Updates will be shared here as we get them.
Posted May 24, 2018 - 18:36 EDT
Update
We are continue to monitor the outage. For more detailed info please visit our Data Center Hetzner status page: https://www.hetzner-status.de/en.html.

In the coming weeks Exposure will be moved to a more reliable setup on Amazon EC2 to prevent future outages.

We will update new information as we get it.
Posted May 24, 2018 - 17:07 EDT
Update
Bad news for the time being. The power issue is more sever than the Data Center technicians thought so we are waiting to hear on next steps.
Posted May 24, 2018 - 15:25 EDT
Update
Still working on the hardware issue! We are trying to mitigate traffic to a temp version of published stories. More info on that shortly. Apologies for this length of this outage.
Posted May 24, 2018 - 14:19 EDT
Update
Sorry for the delay. Still working through the hardware issues. Next update in one hour.
Posted May 24, 2018 - 12:35 EDT
Update
Exposures hardware is currently in the queue of the Data Centers technicians to restart. This is good news for a full service restore.
Posted May 24, 2018 - 11:18 EDT
Update
The backup power supply for the Data Center is currently being replaced to compensate for the issue. More updates as we get them.
Posted May 24, 2018 - 10:49 EDT
Update
Still working through the issues with our hardware. Please drop support@exposure.co an email with any questions and standby for more updates.
Posted May 24, 2018 - 10:20 EDT
Update
Still working on this with our Data Center team, service is returning slowly at this point. Apologies for the interruption.
Posted May 24, 2018 - 09:09 EDT
Update
Still working on this. More updates as soon as we get them.
Posted May 24, 2018 - 08:26 EDT
Monitoring
The data center that serves Exposure experienced a power outage around 6:30am ET that made the site inaccessible and is currently taking some time to reboot. We are monitoring the situation closely and doing everything we can to get Exposure back online.
Posted May 24, 2018 - 08:16 EDT
This incident affected: Exposure Member Sites and Posts Platform, Story Email Notifications, Member Custom Domains, Member Statistics, Content delivery network (CDN), Content storage, Image processing API service, Image processing content delivery network (CDN), Image processing rendering infrastructure, Billing API, and Billing webhooks.