Philip Cox - I think we have an RCA.

Below is the call stack of “iptables” at the moment of the hang (which is same 
across all collected kernel dumps):
```
crash> bt 25894
PID: 25894    TASK: ffff89094bce8000  CPU: 1    COMMAND: "iptables"
 #0 [ffffadb9456ab8f8] __schedule at ffffffffa5ba8b8d
 #1 [ffffadb9456ab980] preempt_schedule_common at ffffffffa5ba92a8
 #2 [ffffadb9456ab998] __cond_resched at ffffffffa5ba92e6
 #3 [ffffadb9456ab9a8] down_read at ffffffffa5bab823
 #4 [ffffadb9456ab9c0] kernfs_walk_and_get_ns at ffffffffa5248b16
 #5 [ffffadb9456ab9f8] cgroup_get_from_path at ffffffffa4fa87fa
 #6 [ffffadb9456aba20] cgroup_mt_check_v2 at ffffffffc07bf083 [xt_cgroup]
 #7 [ffffadb9456aba48] xt_check_match at ffffffffc01304c1 [x_tables]
 #8 [ffffadb9456abb08] find_check_entry at ffffffffc014315e [ip_tables]
 #9 [ffffadb9456abbc8] translate_table at ffffffffc0144429 [ip_tables]
#10 [ffffadb9456abc68] do_ipt_set_ctl at ffffffffc014579c [ip_tables]
#11 [ffffadb9456abd10] nf_setsockopt at ffffffffa598d697
#12 [ffffadb9456abd50] ip_setsockopt at ffffffffa59a140a
#13 [ffffadb9456abd90] raw_setsockopt at ffffffffa59d44bf
#14 [ffffadb9456abd98] security_socket_setsockopt at ffffffffa533c5d2
#15 [ffffadb9456abdc8] __sys_setsockopt at ffffffffa58c1699
#16 [ffffadb9456abe10] __x64_sys_setsockopt at ffffffffa58c17c5
#17 [ffffadb9456abe20] x64_sys_call at ffffffffa4e06bab
#18 [ffffadb9456abe30] do_syscall_64 at ffffffffa5b9a9e4
#19 [ffffadb9456abe88] handle_mm_fault at ffffffffa51027d8
#20 [ffffadb9456abec8] do_user_addr_fault at ffffffffa4ea4b40
#21 [ffffadb9456abf00] irqentry_exit_to_user_mode at ffffffffa5b9f43e
#22 [ffffadb9456abf10] irqentry_exit at ffffffffa5b9f46d
#23 [ffffadb9456abf18] clear_bhb_loop at ffffffffa5c018c5
#24 [ffffadb9456abf28] clear_bhb_loop at ffffffffa5c018c5
#25 [ffffadb9456abf38] clear_bhb_loop at ffffffffa5c018c5
#26 [ffffadb9456abf50] entry_SYSCALL_64_after_hwframe at ffffffffa5c00124
    RIP: 00007f715892496e  RSP: 00007ffddb994cf8  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 00005589d9902dc8  RCX: 00007f715892496e
    RDX: 0000000000000040  RSI: 0000000000000000  RDI: 0000000000000004
    RBP: 00005589d9909ec0   R8: 0000000000003348   R9: 0000000000000052
    R10: 00005589d9909ec0  R11: 0000000000000206  R12: 00005589d99097d0
    R13: 00005589d9902dc8  R14: 00005589d9902dc0  R15: 00005589d9909f20
    ORIG_RAX: 0000000000000036  CS: 0033  SS: 002b
```

There are two cgroup-related functions on the stack, and the buggy one
is cgroup_get_from_path — it acquires the spinlock and then calls a
function which may cause the current process to sleep.  This leaves the
spinlock locked triggering the subsequent hard lockup.

The good news is that the bug appears to be present briefly within 5.15
kernel — it was first introduced in 5.15.75 and “fixed” in 5.16.1
(https://github.com/torvalds/linux/commit/46307fd6e27a3f678a1678b02e667678c22aa8cc).

So two follow up questions for you at your convenience:

1. Does this RCA seem reasonable / correct to you? 
2. If 1) can Canonical backport this fix to the 5.15 and 5.0.4-fips kernels?
3. If 1) In the mean time, is there a good way for me to find the version of 
the aws Ubuntu kernel which would not contain this issue? In other words - how 
can I translate 5.15.0-1072-aws to 5.15.xx so we can pin the kernel to the 
previous revision - if not too far back?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to