Datastore issues in US East region causing degraded recommendation quality

Incident Report for LiftIgniter

Resolved

Data restoration is now complete, and all metrics are back to normal. The backup from which we restored is a little older than the point in time at which we lost data, so there is a possibility of some intermediate data loss; we are keeping a close eye on performance to see if there are data inconsistencies.

Posted Nov 05, 2020 - 13:51 UTC

Update

Capacity has been restored in the US East region; latency and other performance metrics are close to normal. However, we are still faced with the problem that the data available to the recommendation engine for queries in the US East region is less complete than it should be, causing degraded recommendation quality in some cases. We are working to restore from backups.

Posted Nov 03, 2020 - 19:17 UTC

Identified

We are currently experiencing hardware issues that affected multiple nodes of our datastore in the US East. These hardware issues are causing some data to be unavailable to our recommendation engine when calculating recommendations (causing a degraded quality of recommendations), and may also result in increased latencies when returning recommendations.

We are working to restore capacity and will then restore lost data from backups.

Our system in US West is not affected as of now.

We will share more details as they become available.

Posted Nov 03, 2020 - 18:57 UTC