Slack has blamed a major January 4 outage on a "routing problem between network boundaries on the network of our cloud provider" (AWS) -- which seems to have failed to scale fast enough to handle a surge in demand from Slack users returning to work after the New Year.
The resulting packet loss triggered error rates from Slack's backend servers, which caused a domino effect ultimately resulting in the company's load balancers entering "an emergency routing mode where they routed traffic to healthy and unhealthy hosts alike", Slack said.
"Our cloud provider has increased the capacity of their cross-boundary network traffic systems, as well as moving us from a shared to a dedicated system"
The root cause analysis (RCA) suggests Slack itself was not equipped to scale smoothly and the company has now promised a wide range of changes to how it handles its infrastructure. Interestingly (given the size of the company, recently sold to Salesforce for $27.7 billion) the RCA reveals that Slack was using shared rather than dedicated cloud resources. It's now using a dedicated system and said AWS has "increased the API rate limit on the cloud service APIs we call as part of the provisioning process."
(Slack has published a highly truncated note about the outage on its status page but only shares the RCA on request. We requested and promptly got it. Thanks! Those interested, it's here. Protocol -- which first reported on the RCA -- attributes the initial issue to the AWS Transit Gateway not scaling fast enough. Slack does not explicitly name AWS anywhere.)
Slack outage: company promises improved load testing
Slack says it has now created an alert for packet rate limits between network boundaries on the network of its cloud provider; will "increase the number of provisioning service workers to improve our capacity to provision servers quickly"; improve observability on its provisioning service; and more substantially revisit its provisioning service design, load test it better and "revisit our backend server scaling automation, to ensure we have the right settings for predictive scaling, rate of scaling, and metrics that we use to scale." The latter will be done by April 13.
The incident comes after AWS itself promised a major shakeup of its architecture following a major outage in AWS’s US-EAST-1 in November. The November 25 outage began after after AWS added new capacity to the front-end of data streaming service AWS Kinesis. This led “all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration,” AWS admitted.
A domino effect led to numerous major associated AWS services facing sustained issues — including Cognito, which uses Kinesis to collect and analyse API access patterns, and CloudWatch, a widely used AWS application and infrastructure monitoring service.)
AWS was also left “unable to update the [customer] Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event.” (Yes, this perennial bugbear of online service providers is still a thing in 2020.)
The company has now promised numerous changes to its architecture and stronger safeguards to prevent recurrence, as well as the decoupling of related services that failed as a result, in the short term including larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet.