HEY is down
Incident Report for 37signals
Postmortem

What

Hey's app was unavailable due to AWS returning incorrect DNS records. Mail was unaffected.

Notified By

Pingdom, Nagios

Impact

App.hey.com was unavailable from 21:26 until 21:39 UTC. Some users reported lingering DNS issues for about 10 minutes following the restoration of most throughput. Mail was unaffected.

Remediation

We use Failover Records to serve our friendly error pages from IP addresses in the datacenter in the case that the Hey App is down. At the beginning of this incident AWS was erroneously returning these failover records, despite the app being up. We removed the failover record and manually set DNS to use an A record with the proper IP addresses for the app. However, for a brief period of time there was no record. This caused AWS to return a blank reply for app.hey.com. This should have happened until the negative TTL expired. The negative TTL is set to 60 seconds but some users definitely saw empty responses longer than that.

Follow-up Actions

  • We are going to work with our AWS Account Team to understand how and why the failover record was tripped erroneously.
  • We are going to research why the negative TTL apparently did not function as configured.
  • We are still pointed directly at AWS' IP addresses and will need to restore the failover records.
  • We're going to work to improve our public communication response when Hey is down.

Once again, sorry for the brief hiccup! We're 99.99% on uptime for the last three months, but even a few minutes is a big deal when it comes to email.

Posted Nov 03, 2022 - 14:36 UTC

Resolved
HEY is back up! We'll continue to monitor it to make sure it's stable. So sorry for the interruption today, folks.
Posted Nov 02, 2022 - 21:38 UTC
Investigating
HEY is currently down. We're investigating the cause and will update you as soon as we can. Sorry for the interruption, everyone.
Posted Nov 02, 2022 - 21:37 UTC
This incident affected: HEY.