Jira Service Management Assets not loading for customers in specific regions

Incident Report for Jira Service Management

Postmortem

Summary

From August 9, 2024, 14:49 UTC until August 10, 2024, 00:55 UTC, Atlassian customers using Jira and Jira Service Management products could not use JSM Assets objects in their workflows. The event was triggered by an out-of-cycle deployment of our services. There were no functional changes included in the service, however, the deployment impacted multiple customers across Europe, North America, and Asia Pacific. The incident was detected within 82 minutes by Staff (Customer reports) and mitigated by restarting the JSM Assets service, which put Atlassian systems into a known good state. The total time to resolution was about 4 hours for most customers, with one having a 10h prolongued outage.

IMPACT

The overall impact was between August 9, 2024, 14:49 UTC and August 10, 2024, 00:55 UTC on Jira and Jira Service Management products_. The Incident caused service disruption to_ Europe, North America, and Asia Pacific customers only where they failed to leverage JSM Assets objects in their workflow.

Jira users faced disruption when looking to:

View Assets objects associated with issues after loading their issues, lists of issues, or boards in browser
View Gadget results which relied on AQL in their JQL
Interact with JQL+AQL via API
Transition issues which required Assets object validation

Jira Service Management users faced disruption when looking to:

Create issues in JSM Customer Portal
View Assets objects on Requests in JSM Customer Portal
Fill JSM Form relying on Assets
Configure Asset fields and JSM Forms with Assets
Refresh queues based on AQL

ROOT CAUSE

The issue was caused by a race condition in refreshing authorization tokens. As a result, the products above could not retrieve access tokens and resource identifiers to support customer features, and the users received HTTP 500 errors. More specifically, our out-of-cycle deployment triggered an authorization token refresh for a downstream service serving customer traffic at the time. As our service was processing traffic, it sought to update authorization tokens, and in some cases, the tokens partially persisted within the customer context. Subsequent calls for the affected customer failed due to a mismatch of authorization tokens.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because the change was related to a particular kind of legacy case that was not picked up by our automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Removing the need to cache authorization tokens during service runtime.

Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact. In this case, our detection instrumentation could have worked better. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures such as:

Alerting on high amount of error rates over short spans of time.

We apologize to customers whose services were impacted by this incident and are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Aug 26, 2024 - 09:40 UTC

Resolved

We are resolving this incident as we have mitigated the root cause for all customers.

Posted Aug 10, 2024 - 09:05 UTC

Monitoring

The issue has been mitigated for all the impacted customers. We are continuing to investigate the root cause.

Posted Aug 09, 2024 - 22:32 UTC

Update

The issue has been mitigated for all the impacted customers. We are continuing to investigate the root cause

Posted Aug 09, 2024 - 22:19 UTC

Update

The issue has been mitigated for all the impacted customers. We are continuing to investigate the root cause

Posted Aug 09, 2024 - 19:17 UTC

Update

We have identified the root cause and have started deploying the mitigation steps

Posted Aug 09, 2024 - 16:55 UTC

Identified

We are investigating an issue with Jira Service Management, where some customers are unable to access their Assets. The impact started around 15:00 UTC. The team is investigating the root cause and we are in the process of mitigation

Posted Aug 09, 2024 - 16:35 UTC

This incident affected: Jira Service Management Web.