We believe that the outage was caused by the load balancer incorrectly marking the service as unhealthy, due to aggressive timeout and failure conditions. This caused downtime for a brief while. We are addressing this by changing the failure thresholds for the load balancer to be more generous.
EDIT: We determined that the initial marking as unhealthy was likely related to a live migration that our hosting provider, Google Cloud, conducted on a number of servers at the time.
Posted Sep 11, 2020 - 16:14 UTC
Monitoring
The email service seems to be back up, as all alerts have resolved. We believe the likely cause was in the networking and load balancing layers, as the servers themselves were operational throughout the duration of the incident. However, we're continuing to investigate what happened.
Posted Sep 11, 2020 - 15:51 UTC
Investigating
We are investigating issues with the email service that powers on-open email recommendations, based on alerts we received at 8:44 AM Pacific Time (15:44 UTC). The issues could be affecting new email opens. We will post more details once we have them.