Public bug reported:

`lru_add_drain_all()` deadlocks on `6.17.0-1007-aws #7~24.04.1-Ubuntu` 
(aarch64, EC2 m7gd.8xlarge). Two tasks are involved:

1. **`flb-pipeline` (FluentBit, pid 18416)**: Called
`fadvise(POSIX_FADV_DONTNEED)` -> `generic_fadvise` ->
`lru_add_drain_all`, which acquired the global `lru_drain_lock` mutex,
scheduled per-CPU kworker drain jobs via `queue_work_on()`, and is now
stuck in `flush_work()` -> `wait_for_completion()` waiting for the drain
work to complete. It never does. The mutex is held the entire time.

2. **`khugepaged` (pid 220)**: Called `lru_add_drain_all` ->
`mutex_lock` and is blocked waiting to acquire the same `lru_drain_lock`
mutex that `flb-pipeline` holds.

The kernel explicitly reports the dependency: `INFO: task khugepaged:220
is blocked on a mutex likely owned by task flb-pipeline:18416`.

The result is a ~system-wide deadlock: the mutex holder (`flb-pipeline`)
is stuck waiting on a completion that never arrives, and all subsequent
callers of `lru_add_drain_all()` (including `khugepaged`, but also any
code path touching page cache or LRU state) pile up on the mutex
indefinitely.

The exact mechanism by which the drain work fails to complete is
unclear. On a preemptive kernel, the kworker should be able to get
scheduled even on busy CPUs. The same bug was reported on Amazon Linux
2023 (issue #993), where a memory dump showed `mutex.owner = 1` (only
`MUTEX_FLAG_WAITERS` set, no actual owner) -- suggesting the completion
or mutex release was lost, not that the work couldn't be scheduled. This
points to a bug in the `lru_add_drain_all` mechanism itself rather than
a scheduling issue.

### Impact

In our case this caused a production outage. The deadlock stalled I/O
completion causing Redpanda (which uses the Seastar framework which uses
libaio+DIO for IO) to experience I/O hangs: disk was idle, but I/O
completions were never delivered to userspace.

### Trigger

The deadlock is between two callers of `lru_add_drain_all()`:

1. **FluentBit** (`flb-pipeline`, pid 18416) -- a userspace logging
process -- calls `fadvise(POSIX_FADV_DONTNEED)` -> `generic_fadvise` ->
`lru_add_drain_all`. This thread acquired the mutex, scheduled per-CPU
drain work, and is stuck in `flush_work()` -> `wait_for_completion()`
waiting for the drain to complete. It never does.

2. **khugepaged** (pid 220) -- a kernel thread -- calls
`lru_add_drain_all` -> `mutex_lock`. Blocked waiting on the mutex held
by flb-pipeline. THP was enabled (default Ubuntu config).

The kernel explicitly reports the dependency: `INFO: task khugepaged:220
is blocked on a mutex likely owned by task flb-pipeline:18416`.

Note that although neither stack trace involves Redpanda/Seastar
directly, the system was running Redpanda which performs heavy DIO
(direct I/O) via libaio on XFS. This generates significant kernel-side
I/O completion work and LRU activity, which may or may not be relevant
to why the drain work fails to complete.

The key question is why the per-CPU drain work scheduled by flb-
pipeline's `lru_add_drain_all` never completes. On a preemptive kernel,
the kworker threads should be schedulable regardless of CPU load. The
Amazon Linux #993 report's memory dump showed `mutex.owner = 1` (only
MUTEX_FLAG_WAITERS set, no actual owner) in the same scenario,
suggesting the bug is in the drain/completion mechanism itself -- not a
scheduling issue.

### Kernel version

```
6.17.0-1007-aws #7~24.04.1-Ubuntu
Architecture: aarch64 (ARM64, Graviton3, EC2 m7gd.8xlarge)
Distribution: Ubuntu 24.04 LTS (Noble Numbat), HWE kernel
```

### dmesg evidence

First occurrence at boot + ~37 minutes:

```
[Thu Apr  2 09:41:07 2026] INFO: task khugepaged:220 blocked for more than 122 
seconds.
[Thu Apr  2 09:41:07 2026]       Not tainted 6.17.0-1007-aws #7~24.04.1-Ubuntu
[Thu Apr  2 09:41:07 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Thu Apr  2 09:41:07 2026] task:khugepaged      state:D stack:0     pid:220   
tgid:220   ppid:2      task_flags:0x200040 flags:0x00000010
[Thu Apr  2 09:41:07 2026] Call trace:
[Thu Apr  2 09:41:07 2026]  __switch_to+0xf0/0x178 (T)
[Thu Apr  2 09:41:07 2026]  __schedule+0x2e0/0x790
[Thu Apr  2 09:41:07 2026]  schedule+0x34/0xc0
[Thu Apr  2 09:41:07 2026]  schedule_preempt_disabled+0x1c/0x40
[Thu Apr  2 09:41:07 2026]  __mutex_lock.constprop.0+0x420/0xcb0
[Thu Apr  2 09:41:07 2026]  __mutex_lock_slowpath+0x20/0x48
[Thu Apr  2 09:41:07 2026]  mutex_lock+0x8c/0xc0
[Thu Apr  2 09:41:07 2026]  __lru_add_drain_all+0x50/0x2e8
[Thu Apr  2 09:41:07 2026]  lru_add_drain_all+0x20/0x48
[Thu Apr  2 09:41:07 2026]  khugepaged+0xa8/0x2c8
[Thu Apr  2 09:41:07 2026]  kthread+0xfc/0x110
[Thu Apr  2 09:41:07 2026]  ret_from_fork+0x10/0x20
[Thu Apr  2 09:41:07 2026] INFO: task khugepaged:220 is blocked on a mutex 
likely owned by task flb-pipeline:18416.
[Thu Apr  2 09:41:07 2026] task:flb-pipeline    state:D stack:0     pid:18416 
tgid:18403 ppid:17913  task_flags:0x400140 flags:0x00800008
[Thu Apr  2 09:41:07 2026] Call trace:
[Thu Apr  2 09:41:07 2026]  __switch_to+0xf0/0x178 (T)
[Thu Apr  2 09:41:07 2026]  __schedule+0x2e0/0x790
[Thu Apr  2 09:41:07 2026]  schedule+0x34/0xc0
[Thu Apr  2 09:41:07 2026]  schedule_timeout+0x13c/0x150
[Thu Apr  2 09:41:07 2026]  __wait_for_common+0xe4/0x2a8
[Thu Apr  2 09:41:07 2026]  wait_for_completion+0x2c/0x60
[Thu Apr  2 09:41:07 2026]  __flush_work+0x98/0x138
[Thu Apr  2 09:41:07 2026]  flush_work+0x30/0x58
[Thu Apr  2 09:41:07 2026]  __lru_add_drain_all+0x1bc/0x2e8
[Thu Apr  2 09:41:07 2026]  lru_add_drain_all+0x20/0x48
[Thu Apr  2 09:41:07 2026]  generic_fadvise+0x228/0x3b8
[Thu Apr  2 09:41:07 2026]  __arm64_sys_fadvise64_64+0xa8/0x138
[Thu Apr  2 09:41:07 2026]  invoke_syscall+0x74/0x128
[Thu Apr  2 09:41:07 2026]  el0_svc_common.constprop.0+0x4c/0x140
[Thu Apr  2 09:41:07 2026]  do_el0_svc+0x28/0x58
[Thu Apr  2 09:41:07 2026]  el0_svc+0x40/0x160
[Thu Apr  2 09:41:07 2026]  el0t_64_sync_handler+0xc0/0x108
[Thu Apr  2 09:41:07 2026]  el0t_64_sync+0x1b8/0x1c0
[Thu Apr  2 09:41:07 2026] INFO: task flb-pipeline:18416 blocked for more than 
122 seconds.
[Thu Apr  2 09:41:07 2026]       Not tainted 6.17.0-1007-aws #7~24.04.1-Ubuntu
```

The deadlock was reported 5 times at ~2 minute intervals from 09:41:07
to 09:49:19 (122s to 614s blocked), at which point the hung task warning
limit (10) was exhausted and further reports were suppressed.

Presumably the deadlock continued indefinitely as the machine continued
experiencing symptoms until it was shut down. That is, we believe it
never self-resolved:

- Both tasks are in `state:D` (TASK_UNINTERRUPTIBLE) waiting on kernel 
primitives without timeouts (`mutex_lock` and `wait_for_completion`)
- The cluster continued to exhibit severe I/O stalls, leaderless partitions, 
and stuck partition movements on this node for the remainder of its lifetime 
(~10 hours after the deadlock was first observed)
- A second, distinct kernel deadlock (XFS DIO rwsem owner-not-found, see 
`dmesg-nicolae.txt`) was observed on the same node ~7 hours later at 20:26 UTC 
after we manually re-enabled hung task reporting (`sysctl -w 
kernel.hung_task_warnings=-1`, since the original 10-warning limit had been 
exhausted). This suggests the system was in a degraded state throughout, 
accumulating further deadlocks.
- The node was ultimately decommissioned and removed from the cluster; at no 
point did its symptoms improve

### Very similar bug on Amazon Linux 2023

https://github.com/amazonlinux/amazon-linux-2023/issues/993 -- very
similar deadlock on kernel 6.12.37 (x86_64). Memory dump showed
`mutex.owner = 1` (MUTEX_FLAG_WAITERS set, no actual owner). In their
case the second blocked process was a tokio runtime thread (rather than
`khugepaged` in ours), but the mechanism is the same: one task holds the
`lru_add_drain_all` mutex and blocks in `flush_work`, while another task
blocks on the mutex itself. The issue appears to have been resolved at
some point (the reporter confirmed it stopped occurring), though it is
not definitive and it is unclear which patch, if any, fixed it.

### Reproducer

We have not yet built a minimal reproducer. Based on the evidence, the
conditions that were present when the bug triggered appear to be:

1. THP enabled (default of madvise), so `khugepaged` is periodically calling 
`lru_add_drain_all()`
2. A process calling `fadvise(POSIX_FADV_DONTNEED)` (FluentBit in our case), 
which also calls `lru_add_drain_all()`
3. Sufficient load to trigger the bug -- the deadlock appeared on all 4 
involved hosts, fairly quickly once production load was applied (~37 minutes 
after boot)

Notably, other clusters running the same kernel (6.17 ARM64) with only 4
vCPUs and modest load do not exhibit the issue, suggesting it is load-
dependent.

We are happy to work with the kernel team to develop a reproducer.

** Affects: linux-aws (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2147564

Title:
  lru_add_drain_all() deadlock on 6.17.0-1007-aws ARM64

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2147564/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to