Backend server overload resulting in issues with recommendation quality as well as update lags

Incident Report for LiftIgniter

Resolved

A one-off backend job caused an overload on some of our backend data servers, causing the associated processes to crash. We promptly restarted the affected processes and they came back online. While these backend data servers were down, our front-end servers continued to serve requests using intermediate layers of caching, and most end users continued to receive recommendation results. After bringing the data servers online, we were able to resume running the one-off job with updated settings that did not put production infrastructure at risk.

During the period that the affected processes were down, the following issues would have been observed:

* Recent updates made (to the catalog and to ML data) would not be getting pulled in to recommendations.
* Infrequently accessed data (that was not cached) would not be successfully looked up, which could cause issues such as not returning enough results (particularly for queries with restrictive rules), or not returning the best results.

We had two periods where users may have noticed issues:

* US East from 2025-02-11 10:59 PM PT (2025-02-12 06:59 UTC) to 2025-02-11 11:26 PM PT (2025-02-12 07:26 UTC), mostly affecting traffic from the Eastern United States and Canada, Europe, and South America
* US West from 2025-02-11 11:38 PM PT (2025-02-12 07:38 UTC) to 2025-02-11 11:59 PM PT (2025-02-12 07:59 UTC), mostly affecting traffic from the Western United States and Canada, Asia, and Australia

Our redundant architecture prevented visible damage to end users. We are investigating best practices around the right settings, configurations, and safeguards for one-off backend jobs to reduce the risk of these sorts of incidents occurring in the future.

Posted Feb 12, 2025 - 07:00 UTC