Services giving 502 errors and slow responses globally

Incident Report for LiftIgniter

Resolved

We have verified that things are working normally, and also updated our internal documentation to streamline the recovery process if a similar issue occurs in the future. We are marking the issue as resolved.

Posted Jul 17, 2018 - 22:10 UTC

Monitoring

We have verified that services are working fine now; a few users with cached DNS may continue to see issues till 22:00 UTC on July 17, but everything should be fine after that.

Some complications arose because of edge cases in the fallback methods we used and the manual switching around of routes, which led to additional issues, but we have resolved everything.

We'll be assembling internal documentation on the route switching and recovery process to avoid the complications and to have a speedier response in the future.

Posted Jul 17, 2018 - 21:25 UTC

Update

After Google reported having resolved their problem at https://status.cloud.google.com/incident/cloud-networking/18012 we switched back to the Google Cloud public load balancer, We are seeing better performance in most regions, but continuing to see some issues in Australia. We are investigating the issues.

Posted Jul 17, 2018 - 20:58 UTC

Update

We have noticed a resurgence of high client timeout rates in Europe, but everything seems normal elsewhere. We are continuing to investigate.

Posted Jul 17, 2018 - 20:20 UTC

Update

Here are some more details on the problem and the fixes we are making.

The problem: Google Cloud's public load balancer is having networking issues. We use the public load balancer for query.petametrics.com, spi.petametrics.com, console.liftigniter.com, and our other services.

Our fixes:

(1) We have updated query.petametrics.com to directly point to our Nginx servers in various regions via Route 53. However, there is a 3-hour DNS cache so users who have cached DNS may continue to see issues. We recommend that users bust their DNS cache.
(2) We have also updated query1.petametrics.com to directly point to our Nginx servers in various regions via Route 53. This has a 1-minute TTL, so should be immediately effective. Thus, even for users who have query.petametrics.com with a bad cached record, JavaScript model queries will automatically retry with query1.petametrics.com.
(3) We are pushing our browser-client (our JavaScript) to choose query1.petametrics.com as the primary query and activity server, to get around the 3-hour DNS caching limit (the JavaScript cache-busts at the turn of each hour).

With the three fixes in place, impact on JavaScript customers should be effectively nullified.

For API customers:

(a) To make query.petametrics.com work, you may need to force DNS cache busting.
(b) api.petametrics.com might have issues; unfortunately we don't have a good setup to directly point to the servers.

Posted Jul 17, 2018 - 19:54 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 17, 2018 - 19:43 UTC

Update

Our cloud provider, Google Cloud, has publicly declared itself as having networking issues: https://status.cloud.google.com/ They have an incident page at https://status.cloud.google.com/incident/cloud-networking/18012

In the meantime, we are doing some rapid rerouting to minimize the impact of this change. Unfortunately, not everything will fully return to normal in the process due to DNS caching, but the majority of users should be able to access the recommendations and send activities.

Posted Jul 17, 2018 - 19:41 UTC

Identified

We've identified that our model-server, api-fe, and email services were giving 502 errors in US West. We believe that this is a problem at the level of our cloud provider, because our dedicated regional endpoints are working.

Initially, both the plain HTTP and HTTPS endpoint were down; the plain HTTP endpoint is back up but the HTTPS endpoint continues to be down.

The problem began at 19:15 UTC on July 17.

Posted Jul 17, 2018 - 19:24 UTC