Starting Wednesday, 4 July, 4 AM Japan Time (Tuesday, 3 July, noon Pacific Time) our CDN provider, MaxCDN, had problems with their Tokyo point-of-presence, causing Japan traffic to be routed to Hong Kong, which was unable to serve the majority of traffic. The CDN serves the JavaScript files loaded by LiftIgniter through the snippet customers put on their website (all under the domain cdn.petametrics.com). Thus, most end users in Japan of customers using LiftIgniter via the JavaScript integration were unable to load LiftIgniter's JavaScript. Our query endpoint (query.petametrics.com) and inventory API endpoint (api.petametrics.com) maintained their usual availability.
We estimate this affected 95-98% of impressions in Japan until we resolved the problem on Wednesday, 4 July, 12:15 PM Japan Time (Tuesday, 3 July, 8:15 PM Pacific Time)
The impact on end users in Japan of customers using our JavaScript integrations was as follows:
4:30 AM Japan Time (12:30 PM Pacific Time): We noticed reduced traffic for one of our Japanese customers. We verified that the JavaScript file is loading and events are firing correctly for us locally, and also saw that the site had been under maintenance overnight, so we incorrectly diagnosed the ongoing site maintenance as the main reason for reduced traffic. When traffic levels as seen by us failed to pick up by 7 AM Japan Time, we got in touch with the affected customer, but they felt that the scheduled maintenance was the likely reason. Neither side pinpointed CDN failure in Japan.
10:30 AM Japan Time (6:30 PM Pacific Time): We received reports from two other Japanese customers about the JavaScript file not loading, and identified CDN failure in Japan as a likely cause.
10:45 AM Japan Time (6:45 PM Pacific Time): Our engineer in California and our support representative in Japan began interactive debugging. Within 5-10 minutes, we obtained diagnostic information that made it clear that the CDN service was to blame, and opened a ticket with MaxCDN. We sent debugging and diagnostic information to the MaxCDN support representative.
12:00 PM Japan Time (8:00 PM Pacific Time): MaxCDN's network engineering team identified the problem as a failure of the Tokyo point-of-presence causing traffic to get routed to the Hong Kong point-of-presence, which was getting overloaded. The MaxCDN support representative suggested that LiftIgniter disable the use of additional points of presence, so that requests would be routed to MaxCDN's core network. LiftIgniter made the change, and traffic levels were back to normal by 12:15 PM Japan Time. Our support representative in Japan and our customers also confirmed that things were working normally.
2:00 PM Japan Time (10:00 PM Pacific Time): MaxCDN posted about the outage as a Status Page incident. (By this time, LiftIgniter's customers were no longer affected because of the setting change made at 12:15).
2:30 PM Japan Time (10:30 PM Pacific Time): MaxCDN noted that a fix had been made.
3:00 PM Japan Time (11:00 PM Pacific Time): MaxCDN reported the incident as being resolved.
The scale of impact of the CDN outage in Japan has led us to revisit our CDN relationship as well as our alerting and monitoring framework.
Historically, LiftIgniter has not paid close attention to monitoring the uptime of our CDN service. The CDN service we've used has generally been reliable -- we have had a couple other outages in the last four years but they were resolved within minutes. However, this incident highlights how critical CDN uptime is to LiftIgniter's customers and end users, so we are going to invest more into a more redundant set of CDN solutions.
LiftIgniter is moving to a multi-CDN architecture, where we have at least two CDN providers. All CDN providers will be reviewed thoroughly for uptime, latency, quality of internal monitoring, and speed of incident resolution for serving end users around the world. We will pay particular attention to the reliability of the CDN in regions with a large number of our customers and end users, in particular Japan.
LiftIgniter has an in-house service called upcheck that monitors uptime and latency for LiftIgniter's APIs, by sending requests to these APIs from servers in three different regions. We are working to expand upcheck in two ways:
Metrics from the expanded upcheck will be periodically reviewed, and high latencies or error or timeout rates will trigger alerts for our 24/7 on-call rotation.
Expected time of completion: We expect to have expanded upcheck to include the new metrics by Thursday, 5 July, and to have the alerts in place by Friday, 6 July.
Impact if these had been present prior to the outage: If both the fixes 1. and 2. were in place, we would have been able to catch the problem within minutes of it occurring, and been in touch with our CDN provider within about 15 minutes of the problem starting (so around 4:15 AM Japan Time).
LiftIgniter already had some alerting around decline in traffic levels, but the alerts in place would only catch global declines in traffic rather than declines specific to one region. In light of this incident, we have improved our monitoring:
Expected time to completion: 1. and 2. are already completed; we expect to finish 3. on Thursday, 5 July Pacific Time (after getting at least 24 hours of data into our metrics tool).
Impact if these had been in place prior to the outage: