Significant increase in 500 internal server errors in recommendations

Incident Report for LiftIgniter

Postmortem

On Thursday, August 6, 2020, at 2:15 PM Pacific Time (21:15 UTC) we deployed an update to one of our jobs responsible for making real-time machine learning updates. The job had a bug that caused a (reversible) corruption in machine learning data. The data is used by the servers serving LiftIgniter's recommendations (known as the model-servers). The corruption in the data resulted in internal server errors (status code 500) being returned by the model-servers on some queries.

Only queries that used the corrupted data in a particular way were affected.

The errors were discovered by our engineers through an alert. After determining that the bad code deployment to the job responsible for real-time machine learning updates was responsible, we reverted the deployment for that job. The errors died out within a few minutes of the revert.

Event timeline

  1. On Thursday, August 6, 2020, at 2:11 PM (21:11 UTC) we began a deployment to one of our jobs responsible for real-time machine learning. The deployment finished rollout at 2:15 PM (21:15 UTC).
  2. Starting 2:19 PM (21:19 UTC), corruption in the machine learning data started causing the model-servers to return internal server errors with status code 500. The initial error rate was 0.023%, and in the next few minutes (till about 2:33 PM) the error rate rose to 0.5%.
  3. At 2:26 PM (21:26 UTC), after a little over 6 minutes of 500 internal server errors, an alert triggered, notifying our engineers. Since the error was on the model-servers, and the deployment had been to a different component, we were not initially sure that the deployment was the cause, but it was a leading hypothesis.
  4. Starting 2:34 PM (21:34 UTC), the error rate started increase. By the time range of 2:41 PM to 2:49 PM, the error rate had increased to the range of 3% to 3.5%. The error rate differently affected different customers; at the high end, error rates for one customer reached 17%.
  5. By 2:50 PM (21:50 UTC), the revert of the bad deploy had been completed, and the error rate started going down. By 2:54 PM, the error rate was down to zero.
  6. Following the immediate mitigation, we continued investigating the mechanism of the problem and began working on robustification of the model-server. After extensive testing and peer review, the robustifications were rolled out to production on Friday, August 7, 2020, starting 6:26 PM Pacific Time.
  7. On Monday, August 10, 2020, we resumed making updates to the jobs involved in building machine learning configurations. Thanks to the robustifications pushed out the previous Friday, we were able to make the changes safely with minimal risk to the serving.

Cause and impact

The underlying cause of the problem was that the bad deploy of the machine learning update job had a bug due to which, at each update, some parts of the record being updated were set to null. The model-server, when reading the record, would encounter the null, and throw an error. The error would get caught in the model-server and return as a 500 internal server to the end user.

The impact of the problem was therefore limited to the queries where one or more of the records looked up in the query had been updated since the bad deploy. Therefore, the percentage of queries affected increased gradually after the bad deploy. There was a further lag in the increase of errors due to caching on the model-server side. Specifically, the sharp increase in error rate at 2:34 PM Pacific Time, 15 minutes after the start of errors, was partly due to the fact that the maximum cache duration is about 11 minutes.

Similarly, after the bad deploy was reverted, the error rate started dropping as the affected records started cleaning themselves. This reduction was more rapid because the corrupt data had not been cached in our model-servers (only valid data is cached).

Learnings and improvements for the future

The following are some of the improvements that we have made or are planning to make based on this experience:

  1. Robustification of model-server against corrupt data (done): As we work on adding more algorithms and strategies in the model-server, and modifying the logic of building machine learning data, we want to make sure that the model-server is robust against corrupt data. That way, a bad deploy such s this one will not cause 500 internal server errors. Rather, this bad data will be reported by the model-server through a more granular exception-handling mechanism, so that we get alerted but end users do not get errors (instead, end users do suffer but in a milder form, through somewhat degraded recommendation quality).
  2. More proactive monitoring of model-servers, with a predetermined monitoring plan, for deploys to jobs that involve real-time machine learning updates (instituted as a process update): One of the challenges with deploying build jobs is that their impact on serving can be fully tested only after a full deployment. By thinking through metrics that may be affected, and proactively monitoring them, we could more quickly catch and revert bad deploys. While this would not have prevented this incident, it could have reduced the duration before we diagnosed and reverted the deploy.
  3. Some improvements to pre-deploy testing procedures for jobs that involve real-time machine learning updates (still under consideration): Wee are experimenting with various ways of improving pre-deploy testing procedures with such jobs, so that subtle errors in them can be caught before production deployment. One idea that we plan to try out is to switch some non-customer organizations to the new job and check for the impact on the model-server for those organizations, before rolling out too widely. The list of non-customer organizations includes organizations with synthetic data, as well as real websites that are being tracked through LiftIgniter but aren't showing recommendations.
Posted Aug 11, 2020 - 14:53 UTC

Resolved

We have pushed robustifications to our serving architecture. We expect that with these robustifications, problems similar to the ones that triggered this incident would not cause internal server errors. Rather, such problems would trigger an alert in our system for exceptions, while still returning a possibly degraded response to the queries.

The robustifications will be tested next week as we resume the work of pushing updates to our machine learning building configurations.
Posted Aug 08, 2020 - 05:25 UTC

Monitoring

We have successfully reverted the faulty code push and the internal server errors have stopped completely.

We are working on robustifications on the serving side that will prevent it from throwing errors for similar corruptions on the building side.
Posted Aug 06, 2020 - 22:05 UTC

Identified

LiftIgniter's recommendation servers are experiencing a significant increase in internal server errors and returning status code 500. We identified a recent code push to our machine learning building configuration that is responsible, and are working on reverting. We will update with more details as we learn more.
Posted Aug 06, 2020 - 21:40 UTC