Issue with honoring extremely short TTLs for items in the catalog

Incident Report for LiftIgniter

Postmortem

In May/June 2020, LiftIgniter was notified by a customer that the customer had been trying to delete items from the catalog by reinserting them with a ttl of 1 second (rather than use the DELETE API); however, these items were not successfully deleted, and in fact got a ttl of 30 days. Upon investigation, LiftIgniter found and fixed two issues that were causing this problem.

Event timeline

  1. Between August and December 2019, LiftIgniter moved to a new system for managing catalog insertion. Prior to this, the entire catalog was stored in our Aerospike clusters, and these were the source of truth for the catalog. Aerospike is a key-value data storage software that allows for rapid lookups. After the change, we switched the canonical source of truth to a SQL database, while still maintaining the Aerospike clusters for use for rapid serving.
  2. In May and June 2020, a customer communicated with us that they had been trying to delete items by inserting them with a time to live (ttl) of 1 second. We discovered that the items were not being deleted, and were being recorded with a ttl of 30 days (a fallback, default TTL of value in Aerospike).
  3. In June 2020, we discovered one cause of the problem: the SQL database that we were using as our canonical source of truth was not expiring items correctly after their ttl. This was because a stored procedure on the SQL database was not running at all between February and June 2020. We addressed this problem and pushed the code fix on July 6, 2020, so that the items would get expired from SQL and the expiration would propagate to Aerospike. We also ran a job to backfill all the past deletions that needed to be done.
  4. In July 2020, we discovered a second cause of the problem: when the ttl is short, it sometimes was the case that by the time the insertion was propagated to Aerospike, the remaining ttl was close enough to 0 seconds that it was getting truncated to 0 seconds. Aerospike does not accept a ttl of 0 seconds; a value of 0 gets replaced by Aerospike's default ttl. That explains the ttl of 30 days that we had been seeing. We pushed the code fix on July 15, 2020.

NOTE: Prior to step 1, our insertion logic had respected ttls, because it had been a single-step process. The bug introduced by step 1 was specifically that we added logic to adjust ttl based on a lag between first insertion and the job that copies from the SQL database to Aerospike. It was this adjustment for the lag that created the case of a ttl of 0 seconds, even when the ttl at insertion was more than 0 seconds.

Learnings and improvements for the future

We have three learnings from this experience:

  • Improving communication with customers around expected uses of ttl functionality: Our ttl functionality is not intended as a way to immediately delete items; we encourage customers to use the DELETE API for that purpose. In particular, we had not tested around extremely short ttls as that was not a typical use case for us.
  • Better monitoring of SQL permission errors: We could have detected problems on the SQL side more quickly once they happened if SQL error logs automatically triggered alerts in our alerting system. We are working on instituting this monitoring for the future.
  • Improving our speed of diagnosis of the issue once it was reported to us: Since we didn't have good logging around the exact body of the original API insertion request, it took us some time to narrow the problem down to one with our system's lack of respect for ttl. We could have diagnosed the problem faster by testing a wider range of alternatives once the problem was reported to us.
Posted Jul 23, 2020 - 20:16 UTC

Resolved

This incident has been resolved.
Posted Jul 23, 2020 - 20:15 UTC