On 2020-08-13, between 1:11 AM Pacific Time and 1:25 AM Pacific Time (8:11 to 8:25 UTC), LiftIgniter's model-server system in the US West region, that serves traffic in the Western Americas and Asia-Pacific, saw degraded performance. The degraded performance was caused by a mass killing by Google Cloud of a large fraction of "preemptible" model-servers. Our system automatically scaled out "normal" model-servers in response, and the system performance stabilized rapidly without human intervention. Our engineers did get notified through the alerting system, but no human intervention was needed to stabilize the system.
LiftIgniter's model-server system, that serves recommendations to end users, comprises a mix of "normal" model-servers and "preemptible" model-servers. The model-server system is hosted by Google Cloud. This system exists in two Google Cloud regions (US East and US West). The normal model-servers cannot be arbitrarily preempted by Google; the preemptible model-servers can be arbitrarily preempted by Google with a 30-second notice.
LiftIgniter has a robust process for handling preemptible terminations with no interruption of service to end users. If Google Cloud terminates a lot of preemptible servers close by in time, and does not provide preemptible capacity to replace the removed model-servers, LiftIgniter's normal model-server system scales out to handle the load. This scaling out can take a few minutes, and depending on the proportion of preemptible servers killed, can result in intermittent errors and connectivity issues.
The rapid automatic stabilization of the system was satisfactory to us, given the rarity of such mass killings of servers. However, if the frequency of such incidents increases, we will tweak our settings around the relative proportion of preemptible and normal model-servers.
We are also generally satisfied with alerting -- our engineers were alerted and monitoring the situation, but the automatic response was good enough that no manual intervention was needed. However, if for some reason the automatic response had not been good enough, our engineers would have been on hand to make adjustments.