Datastore issues in US West region causing degraded recommendation quality

Incident Report for LiftIgniter

Resolved

Data restoration is now complete and all metrics are back to normal. We will continue to keep an eye for any data inconsistencies created by the restore process, but nothing seems off as of now.

Posted Nov 05, 2020 - 13:50 UTC

Identified

This is similar to http://status.liftigniter.com/incidents/1z9fqwpckkyk

We experienced a hardware issue affecting multiple nodes of out datastore in US West [EDIT: Our hosting provider, Google Cloud, believes that this was actually a software issue with their virtualization software, and not a real hardware issue]. Capacity has been restored (it was limited between 8:30 UTC and 9:41 UTC). During the period of limited capacity, we experienced increased latency and substantially degraded recommendation quality.

Now that capacity has been restored, the ongoing challenge is that due to the large amount of node failures, the datastore system in US West lacks all the data that it should have, causing degraded recommendation quality in some cases. We are working to restore from backups.

Posted Nov 04, 2020 - 09:55 UTC