** Description changed:

  [SRU Justification]
  
  [Impact]
  
  The current epoll implementation in the 5.15 kernel utilizes a read-write
  semaphore (rwlock_t) to protect the ready event list. While this allows
  multiple producers to concurrently add items, it introduces a scheduling
  priority inversion vulnerability.
  
- If a high-priority consumer (such as a real-time thread calling epoll_wait) is
+ If a high-priority consumer, such as a real-time thread calling epoll_wait, is
  blocked waiting for the exclusive write lock, it can be indefinitely stalled 
by
  a low-priority producer holding the read lock. This results in 
un-deterministic
  system stalls and latency spikes.
  
  This was observed on the 5.15.0-1087-realtime kernel, where a high-priority
  hardware IRQ thread was blocked by a low-priority worker thread holding the
  epoll read lock while throttled by CFS CPU quota limits. Because PREEMPT_RT
  does not extend priority inheritance to rwlock readers, the IRQ thread had no
  mechanism to boost the throttled worker, resulting in a deadlock.
  
  [Fix]
  
  Cherry-pick upstream commit:
  0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock")
  
  The fix involves replacing rwlock_t with spinlock_t, and removing the
  now-redundant lockless helper functions (list_add_tail_lockless and
  chain_epi_lockless). This ensures that under real-time configurations, 
priority
  inheritance works correctly across the epoll subsystem.
  
  [Test Plan]
  
  This is a priority inversion race condition, so it is highly non-deterministic
  and impractical to trigger on command. This is why it is not feasable to
  provide a reliable reproduction script.
  
  Therefore, validation relies on verifying that the replacement locking
  mechanism functions correctly, introduces no regressions, and scales safely
  under synthetic load.
  
  Validation was performed on a 2-core/4GB RAM x86 VM running the test kernel in
  the following PPA: https://launchpad.net/~munirsid/+archive/ubuntu/lp2154194.
  
  As mentioned in the upstream commit, we ran `perf bench epoll wait` with 4
  threads (-t 4) and 10 iterations (-r 10). By configuring 4 threads on a 2-core
  VM, we intentionally overcommit the CPUs to force heavy context-switching and
  lock preemption in order to stress-test the new spinlock boundaries under
  contention.
  
  Observed Results:
  
  Before patch (5.15.0-179-generic #189-Ubuntu):
  $ perf bench epoll wait -t 4 -r 10
  [thread  0] fdmap: 0x556599b44e80 ... 0x556599b44f7c [ 281994 ops/sec ]
  [thread  1] fdmap: 0x556599b451a0 ... 0x556599b4529c [ 279775 ops/sec ]
  [thread  2] fdmap: 0x556599b45420 ... 0x556599b4551c [ 267177 ops/sec ]
  [thread  3] fdmap: 0x556599b456a0 ... 0x556599b4579c [ 270819 ops/sec ]
  Averaged 274941 operations/sec (+- 1.29%), total secs = 10
  
  After patch (5.15.0-183-generic #193+TEST427638v20260525b1-Ubuntu):
  $ perf bench epoll wait -t 4 -r 10
  [thread  0] fdmap: 0x55a665734e80 ... 0x55a665734f7c [ 291941 ops/sec ]
  [thread  1] fdmap: 0x55a6657351a0 ... 0x55a66573529c [ 306480 ops/sec ]
  [thread  2] fdmap: 0x55a665735420 ... 0x55a66573551c [ 286868 ops/sec ]
  [thread  3] fdmap: 0x55a6657356a0 ... 0x55a66573579c [ 312054 ops/sec ]
  Averaged 299335 operations/sec (+- 1.98%), total secs = 10
  
  Consistent with the upstream commit description for x86, we observed 
per-thread
  throughput improve across all 4 threads, with ~8.9% improvement in average
  throughput.
  
  No regression was observed and the logs showed no lockups, RCU stalls, or
  kernel warnings across multiple iterations.
  
  [Where Problems Could Occur]
  
  There could be a performance degradation with some synthetic workloads on the
  GA kernel as seen in the upstream commit description [0]. In artificial
  benchmarks where hundreds of threads continuously spam epoll events, 
throughput
  can drop due to serialization around the new spinlock.
  
  However, testing with realistic workloads (via perf bench epoll wait) actually
  demonstrates a performance improvement on x86 architectures, as mentioned in
  the upstream commit, and demonstrated in the Test Plan section above.
  
  The regression potential for real-world production environments is low, as
  typical workloads do not exhibit continuous, uninterrupted event-spamming
  behavior. Moreover, the fix is strictly isolated to fs/eventpoll.c and is
  already available on Noble and Resolute, where it has been thoroughly tested.
  
  [Other Info]
  
  Similar issues have been reported in [1] and [2]. This bug was addressed
  upstream [0] and the fix has already been integrated into Noble and subsequent
  releases. Backporting this fix ensures stability for users of the 5.15 real-
  time kernel.
  
  [0] - 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd
  [1] - 
https://lore.kernel.org/linux-rt-users/[email protected]/
  [2] - 
https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154194

Title:
  [Jammy] Priority inversion problem in epoll for rt kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2154194/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to