Capacity is back to normal and all services are operating normally. We've identified improvements to make to our systems to make them even more robust to similar issues.
Posted Apr 09, 2022 - 00:59 UTC
Monitoring
All our services are back to working normally. We are still waiting for the underlying capacity issues to be fixed, and will be reviewing our setup to see how we can reduce the impact of such incidents in the future.
Posted Apr 08, 2022 - 15:35 UTC
Update
As of 15:04 UTC, our email-rendering services are back online and working properly, so all our front-facing services are working properly now.
We have identified that the capacity issue is affecting one of our backend services used for managing user histories, and are continuing to investigate that.
Posted Apr 08, 2022 - 15:15 UTC
Identified
Due to some capacity issues being experienced by our cloud provider (Google Cloud) in US East, we are or were experiencing issues with some of our services.
Our query endpoint (query.petametrics.com), that is used to serve recommendations, saw (503 status) error rates rise to over 1%, peaking at 2.6% briefly. Error rates were nonzero between 14:23 and 14:38 UTC. Error rates went down to zero after we provisioned alternate capacity. The period of increased error rates was also a period of increased latency for the successful requests.
We are currently investigating the impact on some of our other services, including a service used for rendering emails.