We have pushed some code updates that turn off the parts of the codebase that were causing trouble, and also added several more code robustness improvements, better alerting, and better playbooks for alert response. We expect that incidents with the same cause won't occur any more, and other incidents with similar symptoms will be mitigated much more quickly.
We are still diagnosing the exact mechanism by which the problem occurred (so that we can reactivate the parts of the codebase we've turned off). We're also preparing an internal postmortem. We will share further details regarding timeline and impact once the investigation and internal postmortem are completed.
Posted Aug 23, 2019 - 17:40 UTC
Monitoring
The inventory API servers had issues between August 21, 2019, 10:45 PM PDT and 11:45 PM PDT (August 22, 2019, 5:45 UTC to 6:45 UTC). A large fraction of requests to inventory insertion, GET, and DELETE operations timed out or gave error codes during the period. We were able to get the servers back to normal through scaling up capacity, and the servers have been stable since 11:45 PM PDT.
Customers who received failures or timeouts on inventory API operations during this period would see their requests succeed if they retried after 11:45 PM PDT. We are still reviewing what happened and will update with more details later.
NOTE: This degraded performance only affects inventory insertion, GET, and DELETE operations attempted via the API during the time period. Any affected users would have either had their request time out or received an error code. Customers who do not use the inventory API, or who were not using it during the time period of the problem, are unaffected.
Posted Aug 22, 2019 - 13:10 UTC
Investigating
We are experiencing degraded performance for our inventory API servers used for our inventory API operations on the api.petametrics.com domain.
The problems appear to have started a little before 11 PM PDT.
We will post more details as we get to know them.
NOTE: This degraded performance only affects inventory insertion, GET, and DELETE operations performed via the API. It has no direct effect on the model-servers that server queries. Therefore it only affects the users who use our inventory API, and specifically, used it during that time period.