Degraded performance in Cloud
Incident Report for Jira Service Management
Postmortem

SUMMARY

Between September 18 2023 5:47 PM UTC and September 19 2023 04:15 AM UTC, customers using Confluence Cloud, Jira Software, Jira Service Management, Jira Work Management, Jira Product Discovery and Trello products with services hosted in the AWS us-east-1 and us-west-2 region experienced slow performance and/or page load failures as a result of an AWS issue that began on September 18 2023, 4:00 PM UTC.

This was triggered by an underlying networking fault in our cloud provider AWS, which affected multiple AWS services in their us-west-2 and us-east-1 regions, used by Atlassian. The incident was detected within one minute by our monitoring systems. Recovery of affected Atlassian services occurred on a product-by-product basis with full recovery for all products completed by September 19 2023 04:15 AM UTC.

IMPACT

Product impact varied based on which regions and availability zones services are using, with services hosted in us-west-2 being affected more than services hosted in us-east-1.  

Product-specific impacts are listed below:

  • Jira Software - A number of Jira nodes were affected with highly elevated error rates due to Jira databases in us-east-1 and us-west-2 being impacted.  The impact was varied with some Jira nodes being unusable whilst others were in a usable but degraded state.
  • Jira Service Management - Some users hosted in us-east-1 and us-west-2 experienced problems when creating issues through the Help Center, viewing issues, transitioning issues, posting comments and using queues
  • Jira Work Management - Users based in us-west-2 experienced minor service degradation.
  • Jira Product Discovery - Users experienced some issues when loading insights.
  • Confluence Cloud - Impact was limited to customers hosted in the us-west-2 region. During this time, users attempting to load confluence pages experienced sporadic product degradation, including brief periods where Confluence was inaccessible, complete and partial page load failures, page timeouts, increased request latency.
  • Trello - Users had minimal service degradation - only 0.1% of Trello users had automation rules impacted.

ROOT CAUSE

The root cause was an issue with subsystem responsible for network mapping propagation within the Amazon Virtual Private Cloud in the us-east-1 (use1-az1) and us-west-2 (usw2-az1 and usw2-az2) regions, which impacted network connectivity for multiple AWS services which Atlassian products rely upon.

There was a delay between the AWS incident and Atlassian being affected as existing compute instances and resources were not affected by the issue. However any changes to networking state - such as scaling-up with additional compute nodes - experienced delays in the propagation of network mappings. This led to network connectivity issues until these network mappings had been fully propagated. Other AWS services that create or modify networking resources also saw impact as a result of this issue.

There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

To avoid repeating this type of incident, we are prioritizing documenting and evaluating ways to improve Availability Zone failure resiliency.

Thanks,

Atlassian Customer Support

Posted Oct 06, 2023 - 04:56 UTC

Resolved
Between 09/18 10:47 UTC to 09/19 04:15 UTC, we experienced degraded performance for some Confluence, Jira Work Management, Jira Service Management, Jira Software, Trello, Atlassian Access, and Jira Product Discovery customers. The issue has been resolved and the service is operating normally.
Posted Sep 19, 2023 - 07:23 UTC
Update
Services are confirmed to be stable. We are performing final validation checks before confirming the incident's resolution.
Posted Sep 19, 2023 - 05:45 UTC
Monitoring
We have identified the root cause of the downgraded performance and have mitigated the problem. We are monitoring this closely.
Posted Sep 19, 2023 - 04:56 UTC
Investigating
We are investigating cases of degraded performance for some Confluence, Jira Work Management, Jira Service Management, Jira Software, Trello, and Jira Product Discovery Cloud customers. We will provide more details within the next hour.
Posted Sep 19, 2023 - 02:53 UTC
This incident affected: Jira Service Management Web, Service Portal, Opsgenie Incident Flow, Opsgenie Alert Flow, Opsgenie Incident Flow, Jira Service Management Email Requests, Authentication and User Management, Signup, Automation for Jira, and Assist.