Hundreds of Atlassian Jira and Confluence users are facing up to two more weeks of downtime, following the Atlassian outage which started a week ago. The incident occurred after a rogue script run by the company to delete legacy data instead “erroneously deleted sites, and all associated products for those sites including connected products, users, and third-party applications” in the words of the software company this week.
Customers affected by the Atlassian outage received this message on April 11: “We were unable to confirm a more firm ETA until now due to the complexity of the rebuild process for your site. While we are beginning to bring some customers back online, we estimate the rebuilding effort to last for up to 2 more weeks.”
According to Atlassian, only 400 users in total (out of a customer base of around 226,000) have been affected by the Jira and Confluence outage, and 35% (or 40% according to the status update pages) have had their data restored. The company is at pains to make clear this was not a cyber-attack – but a maintenance script gone wrong. But close watchers of the outage like Gergely Orosz say that “anywhere from 50,000 to 400,000 users had no access to JIRA, Confluence, OpsGenie, JIRA Status page, and other Atlassian Cloud services”.
Do follow The Stack on LinkedIn for updates
The almost unprecendented length of outage for a company of this scale (Atlassian generates over $2 billion in annual revenues) has raised real concerns among users about how well tested its recovery systems are. As Orosz notes: “On their “How Atlassian Does Resilience” page Atlassian confirms they can restore data deleted in a matter of hours… There is a problem, though:
- Atlassian can, indeed, restore all data to a checkpoint in a matter of hours.
- However, if they did this, while the impacted ~400 companies would get back all their data, everyone else would lose all data committed since that point
- So now each customer’s data needs to be selectively restored. Atlassian has no tools to do this in bulk.”
(Atlassian has confirmed this: “What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers.”)
Atlassian outage cause: Was the Insight asset management plugin purged?
One affected customer told The Stack: “In the beginning of the outage, Atlassian slipped a message that this outage is in reference to Insight Asset management. We have been using Insight for more than a year now.
“It originally started as a Plug-In to our Cloud deployment. Relatively short after we started using it in production, we got the notification that Insight will be integrated into Jira Service Management (Premium). EOL for the Plug-In was communicated to be 31st March 2022. We successfully migrated to the “new Insight” earlier this year, but had some DBs left behind for them to remove in the Plug-In (we couldn’t delete them for some reason).
They added by email: “Since we were hit within such a limited group of companies, the reference in the original incident report (via Atlassian support ticket), many confirmations on Reddit and the perfectly matching timeframe, I’m led to believe that this outage is a result of purging the original Insight Plug-In & databases.”
UPDATED April 12: Atlassian’s CTO has confirmed that this was the case.
“Our team ran a script to delete legacy data…”
Atlassian said in a canned comment to media this week: “As part of scheduled maintenance on selected cloud products, our team ran a script to delete legacy data. This data was from a deprecated service that had been moved into the core datastore of our products. Instead of deleting the legacy data, the script erroneously deleted sites, and all associated products for those sites including connected products, users, and third-party applications.
The software company added: “We maintain extensive backup and recovery systems, and there has been no data loss for customers that have been restored to date. This incident was not the result of a cyberattack and there has been no unauthorized access to customer data… We know this outage is unacceptable and we are fully committed to resolving this. Our global engineering teams are working around the clock to achieve full and safe restoration for our approximately 400 impacted customers and they are continuing to make progress on this incident.
“At this time, we have rebuilt functionality for over 35% of the users who are impacted by the service outage.
An Atlassian spokesperson added by email: “We know we are letting our customers down right now and we are doing everything in our power to prevent future reoccurrence.”
Atlassian outage cause details likely in post-mortem. It could have been worse…
Bad though the incident may be, it pales in comparison to the 77TB of research data permanently wiped from a Japanese university’s supercomputer in December 2021 in an incident caused by a software update pushed by Hewlett Packard Enterprise (HPE) that made a script to go rogue and delete backups.
HPE said at the time: “The backup script includes a find command to delete log files older than 10 days. In addition to functional improvement of the script, the variable name passed to the find command for deletion was changed to improve visibility and readability.” (Google and DeepL translate, with a light edit by The Stack.)
The company added: “However, there was a lack of consideration in the release procedure of this modified script. We were not aware of the side effects of this behavior and released the [updated] script, overwriting [a bash script] while it was still running,” HPE admitted. “This resulted in the reloading of the modified shell script in the middle of the execution, resulting in undefined variables. As a result, the original log files in /LARGE0 [backup disc storage] were deleted instead of the original process of deleting files saved in the log directory.”
Got any more information on what’s behind the Atlassian outage? Get in touch.