From August 9, 2024, 14:49 UTC until August 10, 2024, 00:55 UTC, Atlassian customers using Jira and Jira Service Management products could not use JSM Assets objects in their workflows. The event was triggered by an out-of-cycle deployment of our services. There were no functional changes included in the service, however, the deployment impacted multiple customers across Europe, North America, and Asia Pacific. The incident was detected within 82 minutes by Staff (Customer reports) and mitigated by restarting the JSM Assets service, which put Atlassian systems into a known good state. The total time to resolution was about 4 hours for most customers, with one having a 10h prolongued outage.
The overall impact was between August 9, 2024, 14:49 UTC and August 10, 2024, 00:55 UTC on Jira and Jira Service Management products. The Incident caused service disruption to Europe, North America, and Asia Pacific customers only where they failed to leverage JSM Assets objects in their workflow.
Jira users faced disruption when looking to:
Jira Service Management users faced disruption when looking to:
The issue was caused by a race condition in refreshing authorization tokens. As a result, the products above could not retrieve access tokens and resource identifiers to support customer features, and the users received HTTP 500 errors. More specifically, our out-of-cycle deployment triggered an authorization token refresh for a downstream service serving customer traffic at the time. As our service was processing traffic, it sought to update authorization tokens, and in some cases, the tokens partially persisted within the customer context. Subsequent calls for the affected customer failed due to a mismatch of authorization tokens.
We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because the change was related to a particular kind of legacy case that was not picked up by our automated continuous deployment suites and manual test scripts.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact. In this case, our detection instrumentation could have worked better. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures such as:
We apologize to customers whose services were impacted by this incident and are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support