Slack outage: “Database shard repair” ongoing

Updated March 3 with cause.

A Slack outage has lasted nearly half a day for millions of users.

The incident was first acknowledged by Slack at 15:27 GMT.

Six hours later, the enterprise collaboration platform, which is owned by Salesforce, had not revealed a precise cause of the Slack outage.

Is Slack down or did I get fired
— Jarvis (@jarvis_best) February 26, 2025

Slack’s status page said that it was “continuing to diligently work on database shard repair and have made progress on restoring affected replicas.”

“Users may still be experiencing issues loading Slack, using workflows, sending messages or loading threads. API-related actions may also be degraded in speed,” it added in a separate update at 21:04 GMT.

It later attributed the outage to "a maintenance action in one of our database systems, which, combined with a latency defect in our caching system, caused an overload of heavy traffic to the database. As a result, approximately 50% of instances relying on this database became unavailable.

"We took several actions to reduce the heavy load on the database system and implemented a fix for the source of the load," Slack said later, without specifying any precise further details.

Slack outage: Lessons learned from internal fail?

Hopefully its recovery efforts have improved since January 2024, when certain Slack internal custom dashboards and visualizations of critical application performance data couldn’t be recovered after an outage.

As its engineers admitted: “We were shocked and disappointed to discover our most recent backup was almost two years old… The backup and restore method hadn’t gotten a lot of love after its first configuration… on top of that, our runbook was out of date, and the old backup failed when we tried to restore from it. We lost our internal employees’ links and visualizations, we were forced to recreate indexes and index patterns by hand,” they wrote in a candid blog in December.

(That outage, which to reiterate, was not of Slack services for customers, happened after it ran out of disc space. And as Slack put it: “[Our] Kibana cluster was configured to use an Elasticsearch instance on the same hosts as the Kibana application. This tied the storage and the application together on the same nodes, and those nodes were now failing…”

Of today (February 26)’s Slack outage, it said that “remediation work involves repairing affected database shards, which are causing feature degradation issues. This has become a diligent process to ensure we're prioritizing the database replicas with the most impact…”

"Tens of thousands of EC2 instances..."

Slack in September 2024 noted that it manages “tens of thousands of EC2 instances that host a variety of services, including our Vitess databases, Kubernetes workers, and various components of the Slack application.

“The majority of these instances run on some version of Ubuntu, while a portion operates on Amazon Linux. With such a vast infrastructure, the critical question arises: how do we efficiently provision these instances and deploy changes across them? The solution lies in a combination of internally-developed services, with Chef playing a central role…”

Quite what has gone wrong and precisely where has not been answered yet but we’ll keep our eye out for updates and/or a post mortem. Stay tuned.

Hugops to those fixing.