We wanted to provide more details on what went wrong today with the Cloud Elements platform. During the outage only direct API traffic to the platform was affected, all other transactions recovered automatically. The downtime lasted for approximately 33 minutes because of an internal service outage, and that’s why many customers will have noticed the HTTP 502 and HTTP 503 response codes.
Here’s what happened:
At 2.07pm (14.07 MDT), a server was reaching its memory limit and alerted the Cloud Elements DevOps team. Due to this server running out of memory that platform suffered a service failure, and immediately executed the automated recovery mechanism designed to ensure zero downtime. Unfortunately this failover mechanism was unable to recover the failed service gracefully. At 2.08pm, our DevOps team intervened to understand why the failover didn’t function as expected, and began troubleshooting the problem. At 2.16pm, after finding that the OS kernel had failed to successfully kill the offending service, it was manually restarted. At 2.37pm, the platform began accepting new traffic again. 2.40pm, the team had verified that all platform services were up and running, with full capacity and functionality restored to normal.
We apologize for the unplanned downtime today - we’re disappointed with the sequence of events today, and strive to do better for you guys!
We’ll continue to investigate why the failover mechanism didn’t work, but right now it seems that the failed process couldn’t be killed and the presence of this zombie process prevented the failover mechanism from completing.
Despite the failure, most of our processes operated as expected: