What does the Atlassian cloud outage teach us? - Weekly Sharing

Summary : In response to the Atlassian cloud service downtime incident, the ZenTao team believes that no one wants to see a cloud service fail, and we should not deny the manufacturer due to minor mistakes. We hope that the vendor will recover the data as soon as possible to minimize the damage to both parties. For ZenTao, we have to learn from this lesson. Ensure the product is lightweight enough, encourage users to focus on private deployment, and ensure data security.

This article contains about 1600 words and takes about 8 minutes to read.

Atlassian's April 5 outage of some cloud services caused a stir at home and abroad. The outage involved Jira Work Management, Service Management, Confluence, Jira Software, Atlassian Access, and Opsgenie. About 400 customers have been affected, and 55% were restored at 20:37 on April 15.

The impact of the incident was limited, but it took so long to resolve that it failed to meet its SLA (99.9% Premium Cloud Products uptime per month, 99.95% Enterprise Cloud Products uptime per month). And whether the end user's data can be intact has become one of the focus of discussion, and credibility has been affected.

No one wants to see such a thing happen. Atlassian can fix the problem as quickly as possible, minimize the impact, and get everyone out of the emotional state of the incident as soon as possible. We can look at the whole process and see what we can learn.

1. What on earth is going on?

According to the official explanation, a plug-in for Jira Service Management and Jira Software called "Insight -- Asset Management" (Resources 1) has been fully integrated into the product as a native Atlassian feature. Jira Service Management 4.14 or Jira Software 8.14 or earlier will not be able to install Insight as a plug-in after February 3, 2023. To do this, Atlassian needs to stop the Insight app as a plug-in for customers. Atlassian's engineering team plans to use an existing script to do this. And then there's the trouble.

2. What can we learn from the accident?

There are two official root causes of the accident (Reference 2) :

1）Communication problems. There was a communication issue between the team requesting the outage and the team running the blackout. Instead of providing the ID of the application you want to deactivate, the group asking for the outage provides the ID of the entire cloud site.

This communication problem is very typical:

Firstly, it's not a full-featured team. The team that requests outages and the team that runs outages are two different teams, leading to work passing and handover between groups, leaving the potential for deviations in efficiency and accuracy.
Secondly, they did not carry out face-to-face communication. The executive party did not understand why the operation was carried out but mechanically followed it. Even the ID needed to be provided by the requester. Imagine if the requester only explained to the executor the application's name it wanted to disable and the reason for disabling it, and the executor did everything else (including determining the ID). The probability of such an accident would be significantly reduced.
Again, the validation process appears to be missing or ineffective. The application ID is written as the ID of the entire cloud site and is not found within the requester. At the same time, the executor executes mechanically without confirming the reason for the request and verifying the ID again.
Finally, the degree of automation of management needs to be improved. Suppose there is good configuration management and operation and maintenance of the background guarantee. In that case, ID is only one of the background configuration items. The whole process of application deactivation should be a one-click operation in the foreground. There is no need for the requestor to provide and transfer the application ID will be no error.

2）There are problems with the use of scripts. The scripts used have two capabilities: "Tag deletion", which is used for everyday operations (and can be restored if needed), and "Permanent delete" capability, which is used when data needs to be permanently deleted for compliance purposes. However, they used the wrong execution mode and the wrong ID list when executing the script. As a result, some 400 clients' websites were improperly removed.

The script is used here, but there is an error in the execution mode and the ID list. So there are two possibilities:

Is the execution mode and ID list all in the form of numbers or letters? If so, the operator may inadvertently cause an accident due to memory errors or manual mistakes. There would be far fewer accidents if they were translated into friendlier natural language for the operator to see, and cryptic numbers or letters were hidden in the background through configuration management.
Of course, the absence or invalidity of the verification process also increases the probability of problems.

In addition, are there sufficient warnings when deleting data so that the operator can reconfirm the possible impact and whether to proceed with the deletion?

Are you considering separating the script's functionality to prevent future errors due to the messy execution of script options?

3. What can we learn from accident recovery?

As for the speed of data recovery, the time is relatively slow, and an official explanation has been given. Atlassian provides and maintains a synchronous standby copy in multiple AWS availability zones (AZs). AZ failover is automated and usually takes 60-120 seconds. They also preserve secure backups that help restore data to any point in time within 30 days.

However, Atlassian's backup recovery is not automated enough, so they can only use those backups at the moment, periodically rolling back individual customers or a small number of customers who accidentally delete their data. And, if needed, they can immediately restore total customers to the new environment. But they can't fix a large portion of their customers to their existing, in-use settings without affecting any other customers because that would require powerful automated recovery capabilities.

In Atlassian's cloud environment, each data store contains data from multiple customers. Because the data deleted in this incident is only part of the datastore that other customers continue to use, Atlassian can only manually extract and restore individual data from the backup. Each customer site recovery is a long and complex process that requires both internal verification and end customer verification at site recovery. We see that ten days have passed, and recovery efforts are continuing.

The degree of automation is very important for enterprises, especially for enterprises that provide services for many customers. Time is the reputation of enterprises and even life. We don't know if Atlassian was aware of their lack of automation before the accident. Still, we've seen that in the time since the accident, Atlassian has increased the efficiency of their recovery by increasing automation and earning a lot of time. This incident is a wake-up call for other companies to look at our disaster recovery levels and see if we can move up the automation ladder sooner and faster.

4. Summary and Sublimation

All in all, Atlassian's accident is something to reflect on. Enterprises exist to meet the needs of users. Being able to continuously and safely provide services to users is more critical than providing several new functions in most cases. The speed of service recovery after service interruption also plays a pivotal role in users' minds.

However, we also find that many companies are grossly underinvesting in these areas, whether in time, people or money. They put a lot of resources into developing new features. Those low probability events are often ignored or even given the illusion that they are not worth the investment, resulting in missed opportunities at critical moments.

At the same time, we also saw that Atlassian's accident happened because of technology and management problems. From the above analysis, it can be seen that the rationality and completeness of the process, the planning for a rainy day and the investment of resources in all aspects have a great space for improvement. In addition, when dealing with emergencies, the organization's response-ability in various elements, such as technical resources, human resources, and capital scheduling ability, also reflects the degree of flexibility, namely agility.

Accidents may only happen at a single point or a few points in time. However, behind this, it reflects the ability of the whole enterprise in all aspects. We should believe that necessity often hides in chance. We often say that business agility, only user-centered, focuses on creating enterprise flexibility, providing users with high-quality and safe services at the same time, and constantly responding to users.

Atlassian has been tracking the situation on the status update page on the website (Reference 3) since the early days of the incident to keep abreast of progress. Updates were made at intervals of 2-3 hours while stuck, and things became more evident. The gap increased to 6 hours, and then to 1 day. Frequent and transparent updates help reduce anxiety and manage stakeholder expectations while also reflecting Atlassian's matter-of-fact attitude and focus on putting people and energy into our clients. We wish Atlassian an early and complete data recovery to minimize the loss to our customers and our businesses. We hope that the entire industry can learn lessons and work together to provide better services to our customers.

References:

What does the Atlassian cloud outage teach us?
Original

1. What on earth is going on?

2. What can we learn from the accident?

3. What can we learn from accident recovery?

4. Summary and Sublimation

Support

About Us

Contact Us