Catalog updates backlogged for recomendation-serving in US West

Incident Report for LiftIgniter

Postmortem

On Monday, June 29, 2020, 2:26 PM Pacific Daylight Time (PDT) (21:26 UTC), our server that inserts catalog data (inventory) into our Aerospike-based key-value store in the US West region stopped inserting successfully. The reason for the failure was disk corruption on one of the Aerospike cluster nodes in US West. We addressed the backlog completely by the morning of Tuesday, June 30. Over the next few days, we implemented improvements to our metrics, monitoring, and alerts to be able to deal with similar situations better in the future.

Sections of this document:

Event timeline
Cause and impact
Learnings and improvements for the future

Event timeline

On Monday, June 29, 2020, 2:26 PM Pacific Daylight Time (PDT) (21:26 UTC), LiftIgniter's server that inserts catalog data into our Aerospike-based key-value store in the US West region stopped inserting successfully. Thus, a backlog of insertion started building from this time onward. The reason for the failure was disk corruption on one of the Aerospike cluster nodes in US West. Insertion into the US East region was unaffected.
At 7:33 PM PDT and 7:43 PM PDT (five hours after the problem started), two alerts triggered for a customer website whose catalog configuration was highly sensitive to timely catalog updates, and whose end user traffic went mainly to the US West region.
Between 7:57 PM PDT and around 8:15 PM PDT, two of our engineers together diagnosed the problem.
At 8:21 PM PDT, LiftIgniter pushed an update that redirected traffic for the main affected customer to the US East region. This addressed the immediate problems for that customer and gave us breathing room to fix the problem.
At 8:36 PM PDT, LiftIgniter successfully resumed the process of catalog insertion into US West, so from that point onward it started trying to catch up on the backlog. We did so by removing the bad Aerospike node from the cluster.
At 8:56 PM PDT, LiftIgniter added a new Aerospike node to replace the bad Aerospike node, so that the cluster could successfully rebalance without running out of disk space.
On Tuesday, June 30, at 8:39 AM PDT, the backlog was fully caught up in US West. The rebalancing of the Aerospike cluster had also completed by this time. We also reverted the customer configuration for the affected customer to resume sending traffic to US West.

Cause and impact

Root cause

LiftIgniter stores a copy of its catalog in separate Aerospike clusters in both its regions of operation (US West and US East), though the cluster is not the canonical location of the catalog data. Copying of catalog updates to the Aerospike cluster happens via a job. The copy of the catalog in the Aerospike cluster in each region is used by LiftIgniter's recommendation servers for serving recommendations, as well as for real-time machine learning model updates.

The root cause, and trigger, of the problem, was a disk corruption issue on one of the nodes in the Aerospike cluster in US West. This disk corruption was causing some writes to that node to fail. The job that updates the catalog in Aerospike makes updates in order, so once it encountered the error, it did not proceed further. This issue, that started at 2:26 PM PDT, led to the backlog that was discovered several hours later.

Impact on recommendations

For the majority of customers, the impact was as follows: their end users whose requests were routed to the US West region were being served using a somewhat outdated catalog (as of the start time of the problem). This means that items published recently would not be shown in recommendations, and would not be recognized when users visited them. Also, updates to catalog metadata for items would not be reflected in the recommendations. Other than that, however, recomendations would still be served normally.

The impact could thus be described as a slight degradation in the quality of recommendations, as well as slight inaccuracies in the quality of real-time learning. In most cases, end users would not notice this impact. The overall effect on metrics was also small. [NOTE: If the problem had continued for a much longer period, then the recommendation quality impact would be more severe.]

Two customer accounts were affected to a much greater extent. Both of them had high time-sensitivity to catalog updates, with old content either being expired from the catalog or filtered out through rules. For these accounts, LiftIgniter ended up returning empty results for a while. The proble with one of these accounts was the alert trigger described in the incident timeline, that led to our discovery of the issue, and we worked around the problem for that account by directing its traffic to the unaffected US East region.

The problem with the other account occurred for a few hours during the period that the backlog was being caught up on, and resolved on its own as our system worked through the backlog.

Learnings and improvements for the future

We believe we could do better on two fronts:

Alerting: We should have received an alert about the problem right when it started, rather than discovering it indirectly due to downstream customer impact.
Catching up on backlogs: Our system should have been able to catch up faster on the backlog. That would have prevented the issues with the second customer account.

We're making improvements on both fronts:

Alerting: We have improved our alerting so that we get alerted if the job stops working in the way it stopped this time. We already had two alerts, one for the job not reporting any metrics and one for the job reporting a backlog, but the specific failure mode here did not match either of the alert definitions. To elaborate: the job was reporting one metric and not the other metric, and it was this unreported metric that was used to calculate the backlog. We modified one of the alert definitions to incorporate this case.

With the new alerting, we would have discovered the problem around 10 minutes after it started, rather than 5 hours later, and would have been able to fix it before any visible customer impact. We would have not only saved on the 5 hours of discovery time, but also cut down on the investigation time (since the more specific alert would have led us to the problem more quickly) and reduced the time taken to catch up with the backlog (because there would have been less of a backlog to catch up on).
Catching up on backlogs: We have added latency metrics so that we can debug backlogs better. This may inspire future improvements so that we can catch up faster on backlogs.

Posted Jul 12, 2020 - 04:04 UTC

Resolved

This incident has been resolved.

Posted Jul 12, 2020 - 04:03 UTC

Monitoring

The backlog was fully caught up as of 2020-06-30 16:00 UTC (about one hour ago at the time of posting this). Recommendation-serving is therefore operating completely normally now. The customer website that was dramatically affected has been restored to its prior configuration of sending traffic to the closest region.

Also, after removing the node with disk corruption, we have successfully rebalanced data across the remaining nodes, and our data storage is operating as expected.

Posted Jun 30, 2020 - 17:09 UTC

Update

Our machines have started working through the backlog of catalog updates. We are working on getting an estimate of how quickly the backlog will be fully caught up.

Posted Jun 30, 2020 - 03:42 UTC

Identified

LiftIgniter maintains servers to serve recommendations in two Google Cloud regions: US West and US East. End user traffic generally goes to the closer of the two regions unless configured to go to a specific region.

The LiftIgniter recommendation servers in the US West region do not have access to catalog updates made since around 2020-06-29 21:20 UTC (about 6 hours ago at the time of posting this). We have determined that this is because of disk corruption on one of the nodes blocking updates from being made. No data is lost, and we are working to replace the bad node and resume the process of catalog updates.

Affected customers are those whose traffic mainly goes in the US West region, and this primarily includes customers with end users in the Western United States and Asia Pacific. The main effect on these customers will be the lack of freshness of catalog information used to serve recommendations. Only one customer website was dramatically affected, and we are in touch with them. We've directed their end user traffic to the US East region for the time being.

Posted Jun 30, 2020 - 03:40 UTC