** Description changed:
[SRU Justification]
[Impact]
Systems on Jammy running high-throughput DMA workloads experience soft lockups
and RCU stalls in fq_flush_timeout, which result in system hangs.
The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache)
to
avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs;
when
both are full, the primary "loaded" magazine is pushed to a global depot (a
fixed-size array of 32 magazines per size-bin). When the depot is also full,
the
overflow magazine is freed via iova_magazine_free_pfns(), which acquires
iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
holding it.
The problem manifests through the flush-queue timer. Every 10ms,
fq_flush_timeout fires in softirq context and drains all CPUs' flush queues
in a
single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
magazines and the shared depot are full, every subsequent overflow triggers
the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree
operations
under iova_rbtree_lock, all within the same softirq:
- fq_flush_timeout (timer softirq on CPU X)
- iova_domain_flush
- for_each_possible_cpu(cpu):
- fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
- free_iova_fast
- __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
- if depot_size >= 32:
- iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
+ fq_flush_timeout (timer softirq on CPU X)
+ iova_domain_flush
+ for_each_possible_cpu(cpu):
+ fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
+ free_iova_fast
+ __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
+ if depot_size >= 32:
+ iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
path with reliable stack frames:
- native_queued_spin_lock_slowpath+0x2c/0x40
- _raw_spin_lock_irqsave+0x3d/0x50
- iova_magazine_free_pfns.part.0+0x20/0xd0
- free_iova_fast+0x219/0x290
- fq_ring_free+0xa8/0x170
- fq_flush_timeout+0x74/0xc0
- call_timer_fn
- run_timer_softirq
- __do_softirq
+ native_queued_spin_lock_slowpath+0x2c/0x40
+ _raw_spin_lock_irqsave+0x3d/0x50
+ iova_magazine_free_pfns.part.0+0x20/0xd0
+ free_iova_fast+0x219/0x290
+ fq_ring_free+0xa8/0x170
+ fq_flush_timeout+0x74/0xc0
+ call_timer_fn
+ run_timer_softirq
+ __do_softirq
[Fix]
Backport upstream commits, adapted for the 5.15 codebase:
1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
2. 233045378dbb ("iommu/iova: Manage the depot list size")
Cherry-pick upstream commit:
3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false
positive")
Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
list. Magazines are always pushed to the depot regardless of size. As a
result,
the overflow path and its inline call to iova_magazine_free_pfns are
eliminated
from __iova_rcache_insert.
Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding
a
delayed_work (background workqueue) that trims the depot when it exceeds
num_online_cpus() magazines. This reclaim runs in process context, which is
preemptible and sleepable, and therefore, cannot cause soft lockups.
Patch 3 fixes a kmemleak false positive introduced by patch 1.
Adaptations made for 5.15 backport:
- Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
- because in 5.15, struct iova_rcache is defined in the header (upstream moved
- it into iova.c in a prior refactoring series not present in 5.15).
+ because in 5.15, struct iova_rcache is defined in the header (upstream moved
+ it into iova.c in a prior refactoring series not present in 5.15).
- The rcache init function in 5.15 is init_iova_rcaches() (static void, called
- unconditionally from init_iova_domain) rather than upstream's
- iova_domain_init_rcaches() (exported, returns int with error cleanup). The
- backport preserves the 5.15 function signature and error handling pattern.
+ unconditionally from init_iova_domain) rather than upstream's
+ iova_domain_init_rcaches() (exported, returns int with error cleanup). The
+ backport preserves the 5.15 function signature and error handling pattern.
- 5.15 uses top-of-function variable declarations rather than upstream's C99
- in-loop declarations.
+ in-loop declarations.
- The core logic (depot linked-list, overflow elimination, background worker)
is
- identical between upstream and the backport.
+ identical between upstream and the backport.
[Test Plan]
TODO
[Where problems could occur]
Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
changes have been made to IOVA allocation or free semantics from the caller's
perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
the fix is already available on Noble and Resolute, where it has been
thoroughly
tested.
+ One behavioral change worth noting is the depot memory usage profile. The old
+ code enforced a hard cap of 32 magazines per size-bin; when the depot was
full,
+ overflow was freed immediately. The new code removes that cap and relies on a
+ delayed_work firing every 100ms to trim the depot. This means a burst of DMA
+ unmaps can temporarily accumulate more depot memory than the old code would
have
+ allowed, since the background reclaim only runs on a 100ms clock. This is not
a
+ bug in the patches, as upstream implements the same design. Rather, it is a
+ change in behavior compated to what 5.15 users have today. In practice, the
risk
+ is low: each magazine is 1024 bytes, so even a large spike of unmaps on a
+ many-CPU system represents modest memory, and the reclaim worker converges
+ quickly.
+
[Other Info]
Similar issues have been reported in [0], [1], and [2]. The fix has already
been
integrated into Noble and subsequent releases. Backporting this fix ensures
stability for users of the 5.15 kernel.
[0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
[1] -
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
[2] - https://access.redhat.com/solutions/7031930
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158106
Title:
[Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system
hangs
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158106/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs