On Monday, June 29, 2020, 2:26 PM Pacific Daylight Time (PDT) (21:26 UTC), our server that inserts catalog data (inventory) into our Aerospike-based key-value store in the US West region stopped inserting successfully. The reason for the failure was disk corruption on one of the Aerospike cluster nodes in US West. We addressed the backlog completely by the morning of Tuesday, June 30. Over the next few days, we implemented improvements to our metrics, monitoring, and alerts to be able to deal with similar situations better in the future.
Sections of this document:
LiftIgniter stores a copy of its catalog in separate Aerospike clusters in both its regions of operation (US West and US East), though the cluster is not the canonical location of the catalog data. Copying of catalog updates to the Aerospike cluster happens via a job. The copy of the catalog in the Aerospike cluster in each region is used by LiftIgniter's recommendation servers for serving recommendations, as well as for real-time machine learning model updates.
The root cause, and trigger, of the problem, was a disk corruption issue on one of the nodes in the Aerospike cluster in US West. This disk corruption was causing some writes to that node to fail. The job that updates the catalog in Aerospike makes updates in order, so once it encountered the error, it did not proceed further. This issue, that started at 2:26 PM PDT, led to the backlog that was discovered several hours later.
For the majority of customers, the impact was as follows: their end users whose requests were routed to the US West region were being served using a somewhat outdated catalog (as of the start time of the problem). This means that items published recently would not be shown in recommendations, and would not be recognized when users visited them. Also, updates to catalog metadata for items would not be reflected in the recommendations. Other than that, however, recomendations would still be served normally.
The impact could thus be described as a slight degradation in the quality of recommendations, as well as slight inaccuracies in the quality of real-time learning. In most cases, end users would not notice this impact. The overall effect on metrics was also small. [NOTE: If the problem had continued for a much longer period, then the recommendation quality impact would be more severe.]
Two customer accounts were affected to a much greater extent. Both of them had high time-sensitivity to catalog updates, with old content either being expired from the catalog or filtered out through rules. For these accounts, LiftIgniter ended up returning empty results for a while. The proble with one of these accounts was the alert trigger described in the incident timeline, that led to our discovery of the issue, and we worked around the problem for that account by directing its traffic to the unaffected US East region.
The problem with the other account occurred for a few hours during the period that the backlog was being caught up on, and resolved on its own as our system worked through the backlog.
We believe we could do better on two fronts:
We're making improvements on both fronts:
Alerting: We have improved our alerting so that we get alerted if the job stops working in the way it stopped this time. We already had two alerts, one for the job not reporting any metrics and one for the job reporting a backlog, but the specific failure mode here did not match either of the alert definitions. To elaborate: the job was reporting one metric and not the other metric, and it was this unreported metric that was used to calculate the backlog. We modified one of the alert definitions to incorporate this case.
With the new alerting, we would have discovered the problem around 10 minutes after it started, rather than 5 hours later, and would have been able to fix it before any visible customer impact. We would have not only saved on the 5 hours of discovery time, but also cut down on the investigation time (since the more specific alert would have led us to the problem more quickly) and reduced the time taken to catch up with the backlog (because there would have been less of a backlog to catch up on).
Catching up on backlogs: We have added latency metrics so that we can debug backlogs better. This may inspire future improvements so that we can catch up faster on backlogs.