** Description changed:

  [SRU Justification]
  
  [Impact]
  
  Systems on Jammy running high-throughput DMA workloads experience soft lockups
  and RCU stalls in fq_flush_timeout, which result in system hangs.
  
  The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache) 
to
  avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs; 
when
  both are full, the primary "loaded" magazine is pushed to a global depot (a
  fixed-size array of 32 magazines per size-bin). When the depot is also full, 
the
  overflow magazine is freed via iova_magazine_free_pfns(), which acquires
  iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
  holding it.
  
  The problem manifests through the flush-queue timer. Every 10ms,
  fq_flush_timeout fires in softirq context and drains all CPUs' flush queues 
in a
  single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
  all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
  magazines and the shared depot are full, every subsequent overflow triggers
  the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree 
operations
  under iova_rbtree_lock, all within the same softirq:
  
    fq_flush_timeout (timer softirq on CPU X)
      iova_domain_flush
      for_each_possible_cpu(cpu):
        fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
          free_iova_fast
            __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
              if depot_size >= 32:
                iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
  
  The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
  path with reliable stack frames:
  
    native_queued_spin_lock_slowpath+0x2c/0x40
    _raw_spin_lock_irqsave+0x3d/0x50
    iova_magazine_free_pfns.part.0+0x20/0xd0
    free_iova_fast+0x219/0x290
    fq_ring_free+0xa8/0x170
    fq_flush_timeout+0x74/0xc0
    call_timer_fn
    run_timer_softirq
    __do_softirq
  
  [Fix]
  
  Backport upstream commits, adapted for the 5.15 codebase:
  1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
  2. 233045378dbb ("iommu/iova: Manage the depot list size")
  
  Cherry-pick upstream commit:
  3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false 
positive")
  
  Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
  list. Magazines are always pushed to the depot regardless of size. As a 
result,
  the overflow path and its inline call to iova_magazine_free_pfns are 
eliminated
  from __iova_rcache_insert.
  
  Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding 
a
  delayed_work (background workqueue) that trims the depot when it exceeds
  num_online_cpus() magazines. This reclaim runs in process context, which is
  preemptible and sleepable, and therefore, cannot cause soft lockups.
  
  Patch 3 fixes a kmemleak false positive introduced by patch 1.
  
  Adaptations made for 5.15 backport:
  
  - Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
    because in 5.15, struct iova_rcache is defined in the header (upstream moved
    it into iova.c in a prior refactoring series not present in 5.15).
  - The rcache init function in 5.15 is init_iova_rcaches() (static void, called
    unconditionally from init_iova_domain) rather than upstream's
    iova_domain_init_rcaches() (exported, returns int with error cleanup). The
    backport preserves the 5.15 function signature and error handling pattern.
  - 5.15 uses top-of-function variable declarations rather than upstream's C99
    in-loop declarations.
  - The core logic (depot linked-list, overflow elimination, background worker) 
is
    identical between upstream and the backport.
  
  [Test Plan]
  
  TODO
  
+ Test kernel at:
+ https://launchpad.net/~munirsid/+archive/ubuntu/sf4384770-bp
+ 
  [Where problems could occur]
  
  Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
  rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
  changes have been made to IOVA allocation or free semantics from the caller's
  perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
  the fix is already available on Noble and Resolute, where it has been 
thoroughly
  tested.
  
  One behavioral change worth noting is the depot memory usage profile. The old
  code enforced a hard cap of 32 magazines per size-bin; when the depot was 
full,
  overflow was freed immediately. The new code removes that cap and relies on a
  delayed_work firing every 100ms to trim the depot. This means a burst of DMA
  unmaps can temporarily accumulate more depot memory than the old code would 
have
  allowed, since the background reclaim only runs on a 100ms clock. This is not 
a
  bug in the patches, as upstream implements the same design. Rather, it is a
  change in behavior compated to what 5.15 users have today. In practice, the 
risk
- is low: each magazine is 1024 bytes, so even a large spike of unmaps on a 
+ is low: each magazine is 1024 bytes, so even a large spike of unmaps on a
  many-CPU system represents modest memory, and the reclaim worker converges
  quickly.
  
  [Other Info]
  
  Similar issues have been reported in [0], [1], and [2]. The fix has already 
been
  integrated into Noble and subsequent releases. Backporting this fix ensures
  stability for users of the 5.15 kernel.
  
  [0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
  [1] - 
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
  [2] - https://access.redhat.com/solutions/7031930

** Description changed:

  [SRU Justification]
  
  [Impact]
  
  Systems on Jammy running high-throughput DMA workloads experience soft lockups
  and RCU stalls in fq_flush_timeout, which result in system hangs.
  
  The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache) 
to
  avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs; 
when
  both are full, the primary "loaded" magazine is pushed to a global depot (a
  fixed-size array of 32 magazines per size-bin). When the depot is also full, 
the
  overflow magazine is freed via iova_magazine_free_pfns(), which acquires
  iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
  holding it.
  
  The problem manifests through the flush-queue timer. Every 10ms,
  fq_flush_timeout fires in softirq context and drains all CPUs' flush queues 
in a
  single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
  all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
  magazines and the shared depot are full, every subsequent overflow triggers
  the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree 
operations
  under iova_rbtree_lock, all within the same softirq:
  
    fq_flush_timeout (timer softirq on CPU X)
      iova_domain_flush
      for_each_possible_cpu(cpu):
        fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
          free_iova_fast
            __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
              if depot_size >= 32:
                iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
  
  The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
  path with reliable stack frames:
  
    native_queued_spin_lock_slowpath+0x2c/0x40
    _raw_spin_lock_irqsave+0x3d/0x50
    iova_magazine_free_pfns.part.0+0x20/0xd0
    free_iova_fast+0x219/0x290
    fq_ring_free+0xa8/0x170
    fq_flush_timeout+0x74/0xc0
    call_timer_fn
    run_timer_softirq
    __do_softirq
  
  [Fix]
  
  Backport upstream commits, adapted for the 5.15 codebase:
  1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
  2. 233045378dbb ("iommu/iova: Manage the depot list size")
  
  Cherry-pick upstream commit:
  3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false 
positive")
  
  Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
  list. Magazines are always pushed to the depot regardless of size. As a 
result,
  the overflow path and its inline call to iova_magazine_free_pfns are 
eliminated
  from __iova_rcache_insert.
  
  Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding 
a
  delayed_work (background workqueue) that trims the depot when it exceeds
  num_online_cpus() magazines. This reclaim runs in process context, which is
  preemptible and sleepable, and therefore, cannot cause soft lockups.
  
  Patch 3 fixes a kmemleak false positive introduced by patch 1.
  
  Adaptations made for 5.15 backport:
  
  - Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
    because in 5.15, struct iova_rcache is defined in the header (upstream moved
    it into iova.c in a prior refactoring series not present in 5.15).
  - The rcache init function in 5.15 is init_iova_rcaches() (static void, called
    unconditionally from init_iova_domain) rather than upstream's
    iova_domain_init_rcaches() (exported, returns int with error cleanup). The
    backport preserves the 5.15 function signature and error handling pattern.
  - 5.15 uses top-of-function variable declarations rather than upstream's C99
    in-loop declarations.
  - The core logic (depot linked-list, overflow elimination, background worker) 
is
    identical between upstream and the backport.
  
  [Test Plan]
  
  TODO
  
- Test kernel at:
+ Test kernel in:
  https://launchpad.net/~munirsid/+archive/ubuntu/sf4384770-bp
  
  [Where problems could occur]
  
  Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
  rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
  changes have been made to IOVA allocation or free semantics from the caller's
  perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
  the fix is already available on Noble and Resolute, where it has been 
thoroughly
  tested.
  
  One behavioral change worth noting is the depot memory usage profile. The old
  code enforced a hard cap of 32 magazines per size-bin; when the depot was 
full,
  overflow was freed immediately. The new code removes that cap and relies on a
  delayed_work firing every 100ms to trim the depot. This means a burst of DMA
  unmaps can temporarily accumulate more depot memory than the old code would 
have
  allowed, since the background reclaim only runs on a 100ms clock. This is not 
a
  bug in the patches, as upstream implements the same design. Rather, it is a
  change in behavior compated to what 5.15 users have today. In practice, the 
risk
  is low: each magazine is 1024 bytes, so even a large spike of unmaps on a
  many-CPU system represents modest memory, and the reclaim worker converges
  quickly.
  
  [Other Info]
  
  Similar issues have been reported in [0], [1], and [2]. The fix has already 
been
  integrated into Noble and subsequent releases. Backporting this fix ensures
  stability for users of the 5.15 kernel.
  
  [0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
  [1] - 
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
  [2] - https://access.redhat.com/solutions/7031930

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158106

Title:
  [Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system
  hangs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158106/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to