Public bug reported:
`lru_add_drain_all()` deadlocks on `6.17.0-1007-aws #7~24.04.1-Ubuntu` (aarch64, EC2 m7gd.8xlarge). Two tasks are involved: 1. **`flb-pipeline` (FluentBit, pid 18416)**: Called `fadvise(POSIX_FADV_DONTNEED)` -> `generic_fadvise` -> `lru_add_drain_all`, which acquired the global `lru_drain_lock` mutex, scheduled per-CPU kworker drain jobs via `queue_work_on()`, and is now stuck in `flush_work()` -> `wait_for_completion()` waiting for the drain work to complete. It never does. The mutex is held the entire time. 2. **`khugepaged` (pid 220)**: Called `lru_add_drain_all` -> `mutex_lock` and is blocked waiting to acquire the same `lru_drain_lock` mutex that `flb-pipeline` holds. The kernel explicitly reports the dependency: `INFO: task khugepaged:220 is blocked on a mutex likely owned by task flb-pipeline:18416`. The result is a ~system-wide deadlock: the mutex holder (`flb-pipeline`) is stuck waiting on a completion that never arrives, and all subsequent callers of `lru_add_drain_all()` (including `khugepaged`, but also any code path touching page cache or LRU state) pile up on the mutex indefinitely. The exact mechanism by which the drain work fails to complete is unclear. On a preemptive kernel, the kworker should be able to get scheduled even on busy CPUs. The same bug was reported on Amazon Linux 2023 (issue #993), where a memory dump showed `mutex.owner = 1` (only `MUTEX_FLAG_WAITERS` set, no actual owner) -- suggesting the completion or mutex release was lost, not that the work couldn't be scheduled. This points to a bug in the `lru_add_drain_all` mechanism itself rather than a scheduling issue. ### Impact In our case this caused a production outage. The deadlock stalled I/O completion causing Redpanda (which uses the Seastar framework which uses libaio+DIO for IO) to experience I/O hangs: disk was idle, but I/O completions were never delivered to userspace. ### Trigger The deadlock is between two callers of `lru_add_drain_all()`: 1. **FluentBit** (`flb-pipeline`, pid 18416) -- a userspace logging process -- calls `fadvise(POSIX_FADV_DONTNEED)` -> `generic_fadvise` -> `lru_add_drain_all`. This thread acquired the mutex, scheduled per-CPU drain work, and is stuck in `flush_work()` -> `wait_for_completion()` waiting for the drain to complete. It never does. 2. **khugepaged** (pid 220) -- a kernel thread -- calls `lru_add_drain_all` -> `mutex_lock`. Blocked waiting on the mutex held by flb-pipeline. THP was enabled (default Ubuntu config). The kernel explicitly reports the dependency: `INFO: task khugepaged:220 is blocked on a mutex likely owned by task flb-pipeline:18416`. Note that although neither stack trace involves Redpanda/Seastar directly, the system was running Redpanda which performs heavy DIO (direct I/O) via libaio on XFS. This generates significant kernel-side I/O completion work and LRU activity, which may or may not be relevant to why the drain work fails to complete. The key question is why the per-CPU drain work scheduled by flb- pipeline's `lru_add_drain_all` never completes. On a preemptive kernel, the kworker threads should be schedulable regardless of CPU load. The Amazon Linux #993 report's memory dump showed `mutex.owner = 1` (only MUTEX_FLAG_WAITERS set, no actual owner) in the same scenario, suggesting the bug is in the drain/completion mechanism itself -- not a scheduling issue. ### Kernel version ``` 6.17.0-1007-aws #7~24.04.1-Ubuntu Architecture: aarch64 (ARM64, Graviton3, EC2 m7gd.8xlarge) Distribution: Ubuntu 24.04 LTS (Noble Numbat), HWE kernel ``` ### dmesg evidence First occurrence at boot + ~37 minutes: ``` [Thu Apr 2 09:41:07 2026] INFO: task khugepaged:220 blocked for more than 122 seconds. [Thu Apr 2 09:41:07 2026] Not tainted 6.17.0-1007-aws #7~24.04.1-Ubuntu [Thu Apr 2 09:41:07 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Apr 2 09:41:07 2026] task:khugepaged state:D stack:0 pid:220 tgid:220 ppid:2 task_flags:0x200040 flags:0x00000010 [Thu Apr 2 09:41:07 2026] Call trace: [Thu Apr 2 09:41:07 2026] __switch_to+0xf0/0x178 (T) [Thu Apr 2 09:41:07 2026] __schedule+0x2e0/0x790 [Thu Apr 2 09:41:07 2026] schedule+0x34/0xc0 [Thu Apr 2 09:41:07 2026] schedule_preempt_disabled+0x1c/0x40 [Thu Apr 2 09:41:07 2026] __mutex_lock.constprop.0+0x420/0xcb0 [Thu Apr 2 09:41:07 2026] __mutex_lock_slowpath+0x20/0x48 [Thu Apr 2 09:41:07 2026] mutex_lock+0x8c/0xc0 [Thu Apr 2 09:41:07 2026] __lru_add_drain_all+0x50/0x2e8 [Thu Apr 2 09:41:07 2026] lru_add_drain_all+0x20/0x48 [Thu Apr 2 09:41:07 2026] khugepaged+0xa8/0x2c8 [Thu Apr 2 09:41:07 2026] kthread+0xfc/0x110 [Thu Apr 2 09:41:07 2026] ret_from_fork+0x10/0x20 [Thu Apr 2 09:41:07 2026] INFO: task khugepaged:220 is blocked on a mutex likely owned by task flb-pipeline:18416. [Thu Apr 2 09:41:07 2026] task:flb-pipeline state:D stack:0 pid:18416 tgid:18403 ppid:17913 task_flags:0x400140 flags:0x00800008 [Thu Apr 2 09:41:07 2026] Call trace: [Thu Apr 2 09:41:07 2026] __switch_to+0xf0/0x178 (T) [Thu Apr 2 09:41:07 2026] __schedule+0x2e0/0x790 [Thu Apr 2 09:41:07 2026] schedule+0x34/0xc0 [Thu Apr 2 09:41:07 2026] schedule_timeout+0x13c/0x150 [Thu Apr 2 09:41:07 2026] __wait_for_common+0xe4/0x2a8 [Thu Apr 2 09:41:07 2026] wait_for_completion+0x2c/0x60 [Thu Apr 2 09:41:07 2026] __flush_work+0x98/0x138 [Thu Apr 2 09:41:07 2026] flush_work+0x30/0x58 [Thu Apr 2 09:41:07 2026] __lru_add_drain_all+0x1bc/0x2e8 [Thu Apr 2 09:41:07 2026] lru_add_drain_all+0x20/0x48 [Thu Apr 2 09:41:07 2026] generic_fadvise+0x228/0x3b8 [Thu Apr 2 09:41:07 2026] __arm64_sys_fadvise64_64+0xa8/0x138 [Thu Apr 2 09:41:07 2026] invoke_syscall+0x74/0x128 [Thu Apr 2 09:41:07 2026] el0_svc_common.constprop.0+0x4c/0x140 [Thu Apr 2 09:41:07 2026] do_el0_svc+0x28/0x58 [Thu Apr 2 09:41:07 2026] el0_svc+0x40/0x160 [Thu Apr 2 09:41:07 2026] el0t_64_sync_handler+0xc0/0x108 [Thu Apr 2 09:41:07 2026] el0t_64_sync+0x1b8/0x1c0 [Thu Apr 2 09:41:07 2026] INFO: task flb-pipeline:18416 blocked for more than 122 seconds. [Thu Apr 2 09:41:07 2026] Not tainted 6.17.0-1007-aws #7~24.04.1-Ubuntu ``` The deadlock was reported 5 times at ~2 minute intervals from 09:41:07 to 09:49:19 (122s to 614s blocked), at which point the hung task warning limit (10) was exhausted and further reports were suppressed. Presumably the deadlock continued indefinitely as the machine continued experiencing symptoms until it was shut down. That is, we believe it never self-resolved: - Both tasks are in `state:D` (TASK_UNINTERRUPTIBLE) waiting on kernel primitives without timeouts (`mutex_lock` and `wait_for_completion`) - The cluster continued to exhibit severe I/O stalls, leaderless partitions, and stuck partition movements on this node for the remainder of its lifetime (~10 hours after the deadlock was first observed) - A second, distinct kernel deadlock (XFS DIO rwsem owner-not-found, see `dmesg-nicolae.txt`) was observed on the same node ~7 hours later at 20:26 UTC after we manually re-enabled hung task reporting (`sysctl -w kernel.hung_task_warnings=-1`, since the original 10-warning limit had been exhausted). This suggests the system was in a degraded state throughout, accumulating further deadlocks. - The node was ultimately decommissioned and removed from the cluster; at no point did its symptoms improve ### Very similar bug on Amazon Linux 2023 https://github.com/amazonlinux/amazon-linux-2023/issues/993 -- very similar deadlock on kernel 6.12.37 (x86_64). Memory dump showed `mutex.owner = 1` (MUTEX_FLAG_WAITERS set, no actual owner). In their case the second blocked process was a tokio runtime thread (rather than `khugepaged` in ours), but the mechanism is the same: one task holds the `lru_add_drain_all` mutex and blocks in `flush_work`, while another task blocks on the mutex itself. The issue appears to have been resolved at some point (the reporter confirmed it stopped occurring), though it is not definitive and it is unclear which patch, if any, fixed it. ### Reproducer We have not yet built a minimal reproducer. Based on the evidence, the conditions that were present when the bug triggered appear to be: 1. THP enabled (default of madvise), so `khugepaged` is periodically calling `lru_add_drain_all()` 2. A process calling `fadvise(POSIX_FADV_DONTNEED)` (FluentBit in our case), which also calls `lru_add_drain_all()` 3. Sufficient load to trigger the bug -- the deadlock appeared on all 4 involved hosts, fairly quickly once production load was applied (~37 minutes after boot) Notably, other clusters running the same kernel (6.17 ARM64) with only 4 vCPUs and modest load do not exhibit the issue, suggesting it is load- dependent. We are happy to work with the kernel team to develop a reproducer. ** Affects: linux-aws (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2147564 Title: lru_add_drain_all() deadlock on 6.17.0-1007-aws ARM64 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2147564/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
