Google Cloud load balancer issues causing errors for end users; workarounds in progress for core services

Incident Report for LiftIgniter

Resolved

Google has confirmed full resolution at https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh and we have reverted to our normal settings. All metrics have returned to normal ranges.
Posted Nov 16, 2021 - 22:30 UTC

Monitoring

Starting around 18:10 UTC (10:10 AM Pacific Time) we started seeing the endpoints working again. We paused our mitigation steps but are continuing to monitor before reverting them.

Google Cloud has posted an incident report at https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh and the incident report does not yet confirm resolution.

The incident has also been reported in the media; see https://www.theverge.com/2021/11/16/22785599/google-cloud-outage-spotify-discord-snapchat-google-cloud for instance.
Posted Nov 16, 2021 - 18:21 UTC

Identified

All our domains that route through Google Cloud's global public load balancer are giving 404 errors. This appears to be due to issues with Google Cloud; as of the time of writing this, they do not have an incident page but are reported degraded service on their status page (https://status.cloud.google.com/).

We implemented mitigation procedures for our recommendation engine endpoints that we originally wrote after a 2018 incident http://status.liftigniter.com/incidents/7kz9f7w1z8jg and expect to mitigate the bulk of the impact through these mitigation procedures, even if Google Cloud takes time to solve its issues. However, some of our other services, including our inventory API, user API, email recommendations, and console do not have a similar mitigation process in place so recovery for them must wait till Google Cloud fixes the issue.
Posted Nov 16, 2021 - 18:08 UTC