CDN issues in Japan

Incident Report for LiftIgniter

Postmortem

Starting Wednesday, 4 July, 4 AM Japan Time (Tuesday, 3 July, noon Pacific Time) our CDN provider, MaxCDN, had problems with their Tokyo point-of-presence, causing Japan traffic to be routed to Hong Kong, which was unable to serve the majority of traffic. The CDN serves the JavaScript files loaded by LiftIgniter through the snippet customers put on their website (all under the domain cdn.petametrics.com). Thus, most end users in Japan of customers using LiftIgniter via the JavaScript integration were unable to load LiftIgniter's JavaScript. Our query endpoint (query.petametrics.com) and inventory API endpoint (api.petametrics.com) maintained their usual availability.

We estimate this affected 95-98% of impressions in Japan until we resolved the problem on Wednesday, 4 July, 12:15 PM Japan Time (Tuesday, 3 July, 8:15 PM Pacific Time)

The impact on end users in Japan of customers using our JavaScript integrations was as follows:

No recommendation requests were being made to LiftIgniter, so LiftIgniter-powered recommendations were not being shown to these users. Customers who are requesting recommendations via API were unaffected.
No activities were being sent to LiftIgniter. Customers who are sending activities to LiftIgniter via API were unaffected.
Inventory information for these users was not being sent. However, the impact on the overall inventory would be minimal, since it would only affect newly published or updated content in the timeframe, and they would still get updated if LiftIgniter got any events from users outside Japan. Customers sending inventory via API were unaffected.

Event timeline

4:30 AM Japan Time (12:30 PM Pacific Time): We noticed reduced traffic for one of our Japanese customers. We verified that the JavaScript file is loading and events are firing correctly for us locally, and also saw that the site had been under maintenance overnight, so we incorrectly diagnosed the ongoing site maintenance as the main reason for reduced traffic. When traffic levels as seen by us failed to pick up by 7 AM Japan Time, we got in touch with the affected customer, but they felt that the scheduled maintenance was the likely reason. Neither side pinpointed CDN failure in Japan.

10:30 AM Japan Time (6:30 PM Pacific Time): We received reports from two other Japanese customers about the JavaScript file not loading, and identified CDN failure in Japan as a likely cause.

10:45 AM Japan Time (6:45 PM Pacific Time): Our engineer in California and our support representative in Japan began interactive debugging. Within 5-10 minutes, we obtained diagnostic information that made it clear that the CDN service was to blame, and opened a ticket with MaxCDN. We sent debugging and diagnostic information to the MaxCDN support representative.

12:00 PM Japan Time (8:00 PM Pacific Time): MaxCDN's network engineering team identified the problem as a failure of the Tokyo point-of-presence causing traffic to get routed to the Hong Kong point-of-presence, which was getting overloaded. The MaxCDN support representative suggested that LiftIgniter disable the use of additional points of presence, so that requests would be routed to MaxCDN's core network. LiftIgniter made the change, and traffic levels were back to normal by 12:15 PM Japan Time. Our support representative in Japan and our customers also confirmed that things were working normally.

2:00 PM Japan Time (10:00 PM Pacific Time): MaxCDN posted about the outage as a Status Page incident. (By this time, LiftIgniter's customers were no longer affected because of the setting change made at 12:15).

2:30 PM Japan Time (10:30 PM Pacific Time): MaxCDN noted that a fix had been made.

3:00 PM Japan Time (11:00 PM Pacific Time): MaxCDN reported the incident as being resolved.

System improvements to reduce incidence and minimize impact

The scale of impact of the CDN outage in Japan has led us to revisit our CDN relationship as well as our alerting and monitoring framework.

CDN redundancy and vetting

Historically, LiftIgniter has not paid close attention to monitoring the uptime of our CDN service. The CDN service we've used has generally been reliable -- we have had a couple other outages in the last four years but they were resolved within minutes. However, this incident highlights how critical CDN uptime is to LiftIgniter's customers and end users, so we are going to invest more into a more redundant set of CDN solutions.

LiftIgniter is moving to a multi-CDN architecture, where we have at least two CDN providers. All CDN providers will be reviewed thoroughly for uptime, latency, quality of internal monitoring, and speed of incident resolution for serving end users around the world. We will pay particular attention to the reliability of the CDN in regions with a large number of our customers and end users, in particular Japan.

LiftIgniter's own monitoring of CDN uptime

LiftIgniter has an in-house service called upcheck that monitors uptime and latency for LiftIgniter's APIs, by sending requests to these APIs from servers in three different regions. We are working to expand upcheck in two ways:

upcheck will now also query the CDN JavaScript files, to make sure these files are accessible.
upcheck will run from a wider range of geographic locations, in particular more locations in Asia-Pacific.

Metrics from the expanded upcheck will be periodically reviewed, and high latencies or error or timeout rates will trigger alerts for our 24/7 on-call rotation.

Expected time of completion: We expect to have expanded upcheck to include the new metrics by Thursday, 5 July, and to have the alerts in place by Friday, 6 July.

Impact if these had been present prior to the outage: If both the fixes 1. and 2. were in place, we would have been able to catch the problem within minutes of it occurring, and been in touch with our CDN provider within about 15 minutes of the problem starting (so around 4:15 AM Japan Time).

Better traffic level monitoring

LiftIgniter already had some alerting around decline in traffic levels, but the alerts in place would only catch global declines in traffic rather than declines specific to one region. In light of this incident, we have improved our monitoring:

Rather than use absolute traffic level thresholds, we have switched to comparing traffic levels with traffic levels at the same time a day ago. This allows us to control for the daily cycle in traffic levels.
In addition to an alert on global traffic level, we have added alerts for traffic decline (relative to the same time a day ago) at the level of individual customers. We already had monitoring of traffic level for individual customers through daily generation of traffic anomalies, but this kind of monitoring is too slow to catch urgent issues, hence the need for the new alerts.
We have also started sending metrics on traffic levels by country to our internal metrics tool. After getting at least 24 hours of data, we will be adding alerts for low traffic level by country (relative to the same time a day ago) and also for high load time by country.

Expected time to completion: 1. and 2. are already completed; we expect to finish 3. on Thursday, 5 July Pacific Time (after getting at least 24 hours of data into our metrics tool).

Impact if these had been in place prior to the outage:

We verified that the alerts we set up for 2. would have triggered as a result of the outage. We would have immediately identified a list of the affected customers, narrowed down the problem to Japan, and been in touch with our CDN provider before 5:30 AM Japan Time.
We expect that with the alerts we expect to set up for 3., we could cut down the alert and response time even further: we expect that with 3. in place, we would have been in touch with our CDN provider by 4:30 AM Japan Time.

Posted Jul 04, 2018 - 23:12 UTC

Resolved

We have received confirmation from affected customers that things are working normally for them and their end users, and also verified that the traffic levels continue to be similar to what they are at this time of day. We are marking the issue resolved. We plan to publish a postmortem later providing more information on the cause of the issue and additional safeguards we are putting in place to prevent a recurrence.

Our CDN provider has posted some information to their own status page: https://status.maxcdn.com/incidents/7zwpqc1f581r

Posted Jul 04, 2018 - 03:42 UTC

Monitoring

At the suggestion of our CDN provider, we have disabled the problematic edge location, and are seeing a traffic increase to levels similar to those generally seen at this time of day. We also verified with our Japanese team that they are able to load the JavaScript files and events are firing normally.

Posted Jul 04, 2018 - 03:25 UTC

Update

Our CDN provider has confirmed that the issue is related to the Hong Kong location being unreachable for most clients, and their networking team is working to address the issue.

Posted Jul 04, 2018 - 03:07 UTC

Investigating

Our CDN provider is having issues serving our JavaScript files in Japan. As a result, we are successfully serving traffic for less than 5% of end users in Japan. The problem has been ongoing since 4 AM Japan Time on July 4, or 19:00 UTC on July 3. We are working actively with our CDN provider to resolve the issue.

Posted Jul 04, 2018 - 02:54 UTC