US Production is Currently Experiencing an Outage
Incident Report for Cloud Elements

We wanted to provide more details on what went wrong today with the Cloud Elements platform. During the outage only direct API traffic to the platform was affected, all other transactions recovered automatically. The downtime lasted for approximately 33 minutes because of an internal service outage, and that’s why many customers will have noticed the HTTP 502 and HTTP 503 response codes.

Here’s what happened:

At 2.07pm (14.07 MDT), a server was reaching its memory limit and alerted the Cloud Elements DevOps team.

Due to this server running out of memory that platform suffered a service failure, and immediately executed the automated recovery mechanism designed to ensure zero downtime. Unfortunately this failover mechanism was unable to recover the failed service gracefully.

At 2.08pm, our DevOps team intervened to understand why the failover didn’t function as expected, and began troubleshooting the problem.

At 2.16pm, after finding that the OS kernel had failed to successfully kill the offending service, it was manually restarted.

At 2.37pm, the platform began accepting new traffic again.

2.40pm, the team had verified that all platform services were up and running, with full capacity and functionality restored to normal.

We apologize for the unplanned downtime today - we’re disappointed with the sequence of events today, and strive to do better for you guys!

We’ll continue to investigate why the failover mechanism didn’t work, but right now it seems that the failed process couldn’t be killed and the presence of this zombie process prevented the failover mechanism from completing.

Despite the failure, most of our processes operated as expected:

  • While direct API traffic was failing, we were sending appropriate failure responses to your applications.
  • All events and associated formula executions were paused and queued, and the platform started processing these again without any data loss.
  • Bulk API operations were paused and automatically restarted when the platform recovered.
Posted 3 months ago. Aug 09, 2017 - 17:15 MDT

Resolved
This incident has been resolved.
Posted 3 months ago. Aug 09, 2017 - 17:14 MDT
Monitoring
We have resolved the issue and all services should be back to normal. We will continue to monitor the situation.
Posted 3 months ago. Aug 09, 2017 - 14:48 MDT
Investigating
There is currently an issue in US Production. The Engineering team is currently investigating.
Posted 3 months ago. Aug 09, 2017 - 14:14 MDT