Networking issues between our servers causing downtime for end users

Incident Report for LiftIgniter

Resolved

Google has confirmed at https://status.cloud.google.com/incident/cloud-networking/19009 that the networking issue is resolved and that they will post a detailed report. Since all our alerts have also resolved and our systems have been stable for the past few hours, we are marking the incident resolved as well.

Posted Jun 03, 2019 - 00:39 UTC

Monitoring

Our systems have been stable for 40 minutes now, but we are still waiting for Google to confirm in https://status.cloud.google.com/incident/cloud-networking/19009 that they have fixed the issue on their end before we consider this issue resolved.

Posted Jun 02, 2019 - 21:05 UTC

Update

The services have returned again to a fully functional state in both regions. However, we are still waiting for more details from Google Cloud regarding the networking issue at https://status.cloud.google.com/incident/cloud-networking/19009

Posted Jun 02, 2019 - 20:17 UTC

Identified

We noticed a recurrence of the problem in US West (previously, the problem had been more severe in US East) and are applying the same emergency fix on US West. We expect to return to fully functional status in 10 to 15 minutes.

Also, Google Cloud has clarified at https://status.cloud.google.com/incident/cloud-networking/19003 that the issue is related to a larger networking issue (which is also what we originally saw evidence for). Their status page on the networking issue is at https://status.cloud.google.com/incident/cloud-networking/19009

Posted Jun 02, 2019 - 20:06 UTC

Monitoring

All our services appear to be fully functional again. However, we are still waiting for Google to share more details of the underlying issue at https://status.cloud.google.com/incident/compute/19003 so we can evaluate how much longer to closely monitor our systems and whether there may be any other impact missed by our alerts.

Posted Jun 02, 2019 - 19:59 UTC

Identified

Google Cloud has reported the issue with Google Compute Engine at https://status.cloud.google.com/incident/compute/19003

They appear to have recovered enough that we should be able to get our services to a fully functional state soon. However, because they continue to have degraded performance, we will keep an eye on the impact on our services.

Posted Jun 02, 2019 - 19:38 UTC

Investigating

We received alerts suggesting that our services in various regions are having trouble talking to each other as well as to external services. This is affecting the volume of traffic that is being successfully processed by all our endpoints under query.petametrics.com and api.petametrics.com and is also affecting the accessibility of the LiftIgniter Console.

These networking issues may be due to our cloud provider. We are still investigating to mitigate the situation and assess the impact.

EDIT: After more investigation we are more confident that the issues are due to our cloud provider, Google Cloud, but are still waiting for them to report the issues on their Status page https://status.cloud.google.com It looks like others have also noticed the same issues with Google Cloud; see for instance https://twitter.com/GossiTheDog/status/1135260263316381696 https://twitter.com/phineyes/status/1135259372895031297 https://twitter.com/dripstatstatus/status/1135261993055600640

Posted Jun 02, 2019 - 19:02 UTC