Jira Service Management Assets not loading for customers in specific regions

Incident Report for Jira Service Management

Postmortem

Summary

From August 9, 2024, 14:49 UTC until August 10, 2024, 00:55 UTC, Atlassian customers using Jira and Jira Service Management products could not use JSM Assets objects in their workflows. The event was triggered by an out-of-cycle deployment of our services. There were no functional changes included in the service, however, the deployment impacted multiple customers across Europe, North America, and Asia Pacific. The incident was detected within 82 minutes by Staff (Customer reports) and mitigated by restarting the JSM Assets service, which put Atlassian systems into a known good state. The total time to resolution was about 4 hours for most customers, with one having a 10h prolongued outage.

IMPACT

The overall impact was between August 9, 2024, 14:49 UTC and August 10, 2024, 00:55 UTC on Jira and Jira Service Management products. The Incident caused service disruption to Europe, North America, and Asia Pacific customers only where they failed to leverage JSM Assets objects in their workflow.

Jira users faced disruption when looking to:

View Assets objects associated with issues after loading their issues, lists of issues, or boards in browser
View Gadget results which relied on AQL in their JQL
Interact with JQL+AQL via API
Transition issues which required Assets object validation

Jira Service Management users faced disruption when looking to:

Create issues in JSM Customer Portal
View Assets objects on Requests in JSM Customer Portal
Fill JSM Form relying on Assets
Configure Asset fields and JSM Forms with Assets
Refresh queues based on AQL

ROOT CAUSE

The issue was caused by a race condition in refreshing authorization tokens. As a result, the products above could not retrieve access tokens and resource identifiers to support customer features, and the users received HTTP 500 errors. More specifically, our out-of-cycle deployment triggered an authorization token refresh for a downstream service serving customer traffic at the time. As our service was processing traffic, it sought to update authorization tokens, and in some cases, the tokens partially persisted within the customer context. Subsequent calls for the affected customer failed due to a mismatch of authorization tokens.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because the change was related to a particular kind of legacy case that was not picked up by our automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Removing the need to cache authorization tokens during service runtime.

Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact. In this case, our detection instrumentation could have worked better. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures such as:

Alerting on high amount of error rates over short spans of time.

We apologize to customers whose services were impacted by this incident and are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Aug 21, 2024 - 02:11 UTC

Resolved

Between 08:10 UTC and 12:05 UTC, a small subset of customers experienced issues loading Assets in Jira Service Management due to unexpectedly high resource demand. We have increased the computing resources available to those affected, and the service is now operating normally. Thank you for your patience and understanding during this time. We are committed to ensuring the stability and reliability of our services.

Posted Aug 09, 2024 - 07:16 UTC

Monitoring

We have faced an incident with Assets functionality on JSM where customers were unable to load Assets. The impact started at 6.10pm AEST and was resolved at 10.05pm AEST. All the services are back up now and are functioning properly. Team is continuing to monitor for any further issues.

Posted Aug 08, 2024 - 16:00 UTC

Investigating

We are currently investigating an issue with Assets where the customers are unable to access their Assets. The impact started around 2pm IST. At the moment, the issue seems limited to Europe Region. Team is currently investigating the root cause and trying to mitigate for impacted customers. We will share more update as soon as we can.

Posted Aug 08, 2024 - 10:17 UTC

This incident affected: Jira Service Management Web.