Google has confirmed at https://status.cloud.google.com/incident/cloud-networking/19009 that the networking issue is resolved and that they will post a detailed report. Since all our alerts have also resolved and our systems have been stable for the past few hours, we are marking the incident resolved as well.
We noticed a recurrence of the problem in US West (previously, the problem had been more severe in US East) and are applying the same emergency fix on US West. We expect to return to fully functional status in 10 to 15 minutes.
All our services appear to be fully functional again. However, we are still waiting for Google to share more details of the underlying issue at https://status.cloud.google.com/incident/compute/19003 so we can evaluate how much longer to closely monitor our systems and whether there may be any other impact missed by our alerts.
They appear to have recovered enough that we should be able to get our services to a fully functional state soon. However, because they continue to have degraded performance, we will keep an eye on the impact on our services.
Posted Jun 02, 2019 - 19:38 UTC
Investigating
We received alerts suggesting that our services in various regions are having trouble talking to each other as well as to external services. This is affecting the volume of traffic that is being successfully processed by all our endpoints under query.petametrics.com and api.petametrics.com and is also affecting the accessibility of the LiftIgniter Console.
These networking issues may be due to our cloud provider. We are still investigating to mitigate the situation and assess the impact.