Backend datastore node having issues, cold-restarting. Affects inventory API server and recommendation quality

Incident Report for LiftIgniter

Resolved

All nodes have been moved to the new, safer, more minimal instance template, and data has been restored. We have also confirmed that the specific issue that caused node crashes is no longer occurring. We have also put together an improved set of best practices around both peacetime capacity changes and emergency responses, so as to minimize data loss.

Our cloud provider also created a public issue tracker for their underlying issue at https://issuetracker.google.com/issues/111753610

Posted Jul 21, 2018 - 22:37 UTC

Update

We are continuing to monitor for any further issues.

Posted Jul 14, 2018 - 01:56 UTC

Monitoring

We have confirmed that there is no data loss, and the services are running fine; however, we lack our usual storage capacity buffer right now.

Our cloud provider has confirmed ongoing issues on their side that caused our problems, and an ongoing investigation. In the meantime, they have provided us with guidance on working around the issue to reprovision capacity. We are working on that reprovisioning, and will mark the issue resolved when the reprovisioning is completed.

Posted Jul 14, 2018 - 01:56 UTC

Update

The problems turned out to be more serious than expected, with additional nodes affected; we are getting in touch with our cloud provider for more diagnosis and resolution around the issues.

For now, the bad nodes have been removed; due to data replication we expect data loss to be minimal and expect to recover data through our standard recovery procedure.

Posted Jul 13, 2018 - 21:02 UTC

Identified

Starting 10:20 AM PDT (17:20 UTC) on Friday, July 13, one of our datastore nodes in one region started misbehaving. We received notifications for and began addressing the issue within ten minutes. The node is currently doing a cold restart and we expect it to be back up by 11:45 AM PDT (18:45 UTC).

No data has been lost; however, until the node comes fully back up, some data will appear to be missing or unavailable. This has impact on the following services:

- Inventory API errors: Customers using the GET, POST, and DELETE endpoints of our inventory API will see error rates and their intended actions may not complete.
- Degraded recommendation quality: We'll continue to return recommendations; there is no effect on the overall error rates of our model servers. However, the quality of the recommendations will be degraded and latency will be higher due to the difficulty retrieving all the necessary data to make a great recommendation.

Posted Jul 13, 2018 - 17:57 UTC