Issues with services in US East due to capacity issues with cloud provider

Incident Report for LiftIgniter

Resolved

Capacity is back to normal and all configurations have been returned to their defaults.
Posted Apr 28, 2022 - 21:33 UTC

Monitoring

Due to some capacity issues being experienced by our cloud provider (Google Cloud) in US East, we are or were experiencing issues with some of our services.

Our query endpoint (query.petametrics.com), that is used to serve recommendations, saw (503 status) error rates rise to about 1%. Error rates were nonzero between 18:00 and 18:04 UTC. We had already started provisioning alternate capacity prior to the increase in error rates, but still got some errors as the provisioning of capacity took a few minutes. We also saw increased latency in the period from 17:51 to 18:11 UTC for the successful requests.

We also provisioned alternate capacity for a few other affected services; these services had a few minutes of downtime while the alternate capacity was coming online.

We significantly benefited from preparation we did after the previous incident http://status.liftigniter.com/incidents/1522vrjxbmcp.
Posted Apr 28, 2022 - 18:21 UTC