Brief period of increased error rates in US West model-servers due to a mass killing of servers by our hosting provider

Incident Report for LiftIgniter

Postmortem

On 2020-08-13, between 1:11 AM Pacific Time and 1:25 AM Pacific Time (8:11 to 8:25 UTC), LiftIgniter's model-server system in the US West region, that serves traffic in the Western Americas and Asia-Pacific, saw degraded performance. The degraded performance was caused by a mass killing by Google Cloud of a large fraction of "preemptible" model-servers. Our system automatically scaled out "normal" model-servers in response, and the system performance stabilized rapidly without human intervention. Our engineers did get notified through the alerting system, but no human intervention was needed to stabilize the system.

Background

LiftIgniter's model-server system, that serves recommendations to end users, comprises a mix of "normal" model-servers and "preemptible" model-servers. The model-server system is hosted by Google Cloud. This system exists in two Google Cloud regions (US East and US West). The normal model-servers cannot be arbitrarily preempted by Google; the preemptible model-servers can be arbitrarily preempted by Google with a 30-second notice.

LiftIgniter has a robust process for handling preemptible terminations with no interruption of service to end users. If Google Cloud terminates a lot of preemptible servers close by in time, and does not provide preemptible capacity to replace the removed model-servers, LiftIgniter's normal model-server system scales out to handle the load. This scaling out can take a few minutes, and depending on the proportion of preemptible servers killed, can result in intermittent errors and connectivity issues.

Event timeline

1:11 AM to 1:17 AM Pacific Time: Over this time period, Google Cloud killed about 2/3 of the preemptible model-servers in the US West region.
1:12 AM to 1:23 AM Pacific Time: Over this time period, we saw an increase in unresponsiveness of the status endpoints on the model-servers. Unresponsiveness was highest from 1:18 AM to 1:21 AM.
1:17 AM to 1:24 AM Pacific Time: Over this time period, we saw an incidence of errors with status codes 429, 503, and 504 on the recommendations endpoint. Status codes 429 and 503 dominated the errors. These are the expected status codes when servers are loaded; the use of the status codes 429 and 503 suggests a relatively smart response to overload, compared to 504 (in that the system discarded extra load rather than starting to compute on it and then timing out). The peak proportion of request traffic that was returning an error code (summed across error codes) was 1% in the US West region. For individual customers, the peak proportion of request traffic that was returning an error code was 3%; these peak values lasted for short periods of a few seconds.
1:18 AM to 1:27 AM Pacific Time: Over this time period, the normal model-servers scaled out, making up for the lost capacity on preemptible model-servers.
1:27 AM to 1:36 AM Pacific Time: Over this time period, capacity recovered for the preemptible model-servers.

Learnings and improvements for the future

The rapid automatic stabilization of the system was satisfactory to us, given the rarity of such mass killings of servers. However, if the frequency of such incidents increases, we will tweak our settings around the relative proportion of preemptible and normal model-servers.

We are also generally satisfied with alerting -- our engineers were alerted and monitoring the situation, but the automatic response was good enough that no manual intervention was needed. However, if for some reason the automatic response had not been good enough, our engineers would have been on hand to make adjustments.

Posted Aug 13, 2020 - 20:22 UTC

Resolved

On August 13, 2020, between 1:11 AM and 1:25 AM Pacific Time (PDT), we saw increased error rates and unresponsiveness on our model-server system in US West due to a mass killing of servers by Google Cloud, our hosting provider. The automatic response was sufficient to stabilize the system. No manual intervention was needed, though our engineers did get alerted and were monitoring the situation.

Posted Aug 13, 2020 - 08:30 UTC