** Description changed:
[SRU Justification]
[Impact]
The current epoll implementation in the 5.15 kernel utilizes a read-write
semaphore (rwlock_t) to protect the ready event list. While this allows
multiple producers to concurrently add items, it introduces a scheduling
priority inversion vulnerability.
If a high-priority consumer (such as a real-time thread calling epoll_wait) is
blocked waiting for the exclusive write lock, it can be indefinitely stalled
by
a low-priority producer holding the read lock. This results in
un-deterministic
system stalls and latency spikes.
[Fix]
Cherry-pick upstream commit:
0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock")
The fix involves replacing rwlock_t with spinlock_t, and removing the
now-redundant lockless helper functions (list_add_tail_lockless and
chain_epi_lockless). This ensures that under real-time configurations,
priority
inheritance works correctly across the epoll subsystem.
[Test Plan]
This is a priority inversion race condition, so it is highly non-deterministic
and impractical to trigger on command. This is why it is not feasable to
provide a reliable reproduction script.
Therefore, validation relies on verifying that the replacement locking
mechanism functions correctly, introduces no regressions, and scales safely
under synthetic load.
Validation was performed on a 2-core/4GB RAM x86 VM running the test kernel in
the following PPA: https://launchpad.net/~munirsid/+archive/ubuntu/lp2154194.
As mentioned in the upstream commit, we ran `perf bench epoll wait` with 4
threads (-t 4) and 10 iterations (-r 10). By configuring 4 threads on a 2-core
VM, we intentionally overcommit the CPUs to force heavy context-switching and
lock preemption in order to stress-test the new spinlock boundaries under
- contention.
+ contention.
Observed Results:
Before patch (5.15.0-179-generic #189-Ubuntu):
$ perf bench epoll wait -t 4 -r 10
[thread 0] fdmap: 0x556599b44e80 ... 0x556599b44f7c [ 281994 ops/sec ]
[thread 1] fdmap: 0x556599b451a0 ... 0x556599b4529c [ 279775 ops/sec ]
[thread 2] fdmap: 0x556599b45420 ... 0x556599b4551c [ 267177 ops/sec ]
[thread 3] fdmap: 0x556599b456a0 ... 0x556599b4579c [ 270819 ops/sec ]
Averaged 274941 operations/sec (+- 1.29%), total secs = 10
After patch (5.15.0-183-generic #193+TEST427638v20260525b1-Ubuntu):
$ perf bench epoll wait -t 4 -r 10
[thread 0] fdmap: 0x55a665734e80 ... 0x55a665734f7c [ 291941 ops/sec ]
[thread 1] fdmap: 0x55a6657351a0 ... 0x55a66573529c [ 306480 ops/sec ]
[thread 2] fdmap: 0x55a665735420 ... 0x55a66573551c [ 286868 ops/sec ]
[thread 3] fdmap: 0x55a6657356a0 ... 0x55a66573579c [ 312054 ops/sec ]
Averaged 299335 operations/sec (+- 1.98%), total secs = 10
Consistent with the upstream commit description for x86, we observed
per-thread
- throughput improve across all 4 threads, with ~8.9% average improvement in
+ throughput improve across all 4 threads, with ~8.9% improvement in average
throughput.
No regression was observed and the logs showed no lockups, RCU stalls, or
kernel warnings across multiple iterations.
[Where Problems Could Occur]
There could be a performance degradation with some synthetic workloads on the
GA kernel as seen in the upstream commit description [0]. In artificial
benchmarks where hundreds of threads continuously spam epoll events,
throughput
can drop due to serialization around the new spinlock.
However, testing with realistic workloads (via perf bench epoll wait) actually
demonstrates a performance improvement on x86 architectures, as mentioned in
the upstream commit, and demonstrated in the Test Plan section above.
The regression potential for real-world production environments is low, as
typical workloads do not exhibit continuous, uninterrupted event-spamming
behavior. Moreover, the fix is strictly isolated to fs/eventpoll.c and is
already available on Noble and Resolute, where it has been thoroughly tested.
[Other Info]
Similar issues have been reported in [1] and [2]. This bug was addressed
upstream [0] and the fix has already been integrated into Noble and subsequent
releases. Backporting this fix ensures stability for users of the 5.15 real-
time kernel.
[0] -
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd
[1] -
https://lore.kernel.org/linux-rt-users/[email protected]/
[2] -
https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154194
Title:
[Jammy] Priority inversion problem in epoll for rt kernel
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2154194/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs