Google Cloud networking incident in US East: Minimal impact on LiftIgniter service other than slight latency increases

Incident Report for LiftIgniter

Resolved

We are making this incident resolved after verifying that traffic is being distributed between US West and US East in the normal manner.

We believe that our system's response to the networking issues was graceful and resulted in minimal end user impact from the outage. We will keep an eye out for further details from Google about the incident https://status.cloud.google.com/incident/cloud-networking/19015 to learn more.
Posted Jul 03, 2019 - 16:14 UTC

Monitoring

The Google Cloud load balancer has now resumed sending traffic to US East. Everything seems to be working as expected, but we are closely monitoring metrics and will resolve this incident once all the metrics look healthy for a while.
Posted Jul 03, 2019 - 13:51 UTC

Update

Although Google posted in https://status.cloud.google.com/incident/cloud-networking/19015 at 9:12 AM PDT (16:12 UTC) that the problem is fully resolved, we are continuing to see all our traffic being sent to US West. We have opened a case with Google to check in on the status.

We continue to believe that none of our services are affected (except possibly for slight latency increases). We will post further updates once we hear back from Google or see that the load balancer is directing traffic to US East.
Posted Jul 02, 2019 - 18:34 UTC

Identified

On Tuesday, July 2, 2019, between 08:22 and 08:24 AM Pacific Time, which is 15:22 to 15:24 UTC, we saw a dramatic reduction in the traffic going to our US East datacenter and a corresponding increase in the traffic going to our US West datacenter. The change in traffic appears to be due to the Google Load Balancer, a global public load balancer provided by our cloud provider Google Cloud Platform, deciding to no longer direct traffic to US East. Our autoscaling was able to handle the approximate doubling of traffic to US West fairly gracefully, with capacity roughly doubling within minutes.

We believe that Google Cloud's decision to redirect traffic is driven by networking issues with US East as described at https://status.cloud.google.com/incident/cloud-networking/19015 According to Google's update at 08:50 AM PDT (15:50 UTC): "The Cloud Networking service (Standard Tier) has lost multiple independent fiber links within us-east1 zone. Vendor has been notified and are currently investigating the issue. In order to restore service, we have reduced our network usage and prioritised customer workloads. We will provide another status update by Tuesday, 2019-07-02 09:38 US/Pacific with current details." We will await further updates from Google.

As far as we can make out, there is no impact on the availability of LiftIgniter's services. Even the regional endpoint for US East (query-us-east1.petametrics.com) appears to be working correctly. However, there may be a small end-user latency impact, both for customers who have hardcoded query-us-east1.petametrics.com as the endpoint (due to the networking issues in US East) and to the customers whose end users would normally go to US East, but are now being redirected to the somewhat more distant US West.
Posted Jul 02, 2019 - 16:05 UTC