Public bug reported:

[Impact]

 * This problem hard locks up 2 CPUs in a deadlock, and this
   soft locks up other CPUs as an effect; the system becomes
   unusable.

 * This is relatively rare / difficult to hit because it's a
   corner case in scheduling/load balancing that needs timing
   with CPU stopper code. And it needs SMP plus _NUMA_ system.
   (but it can be hit with synthetic test case attached in LP.)

 * Since SMP plus NUMA usually equals _servers_ it looks like
   a good idea to prevent this bug / hard lockups / rebooting.

 * The fix resolves the potential deadlock by removing one of
   the calls required to deadlock from under the locked code.

[Test Case]

 * There's a synthetic test case to reproduce this problem
   (although without the stack traces - just a system hang)
   attached to this LP bug.

 * It uses kprobes/mdelay/cpu stopper calls to force the code
   to execute and force the timing/locking condition to occur.

 * $ sudo insmod kmod-stopper.ko

   Some dmesg logging occurs, and systems either hangs or not.
   See examples in comments.
   
[Regression Potential] 

 * These are patches to the cpu stop_machine.c code, and they
   change a bit how it works;  however, there are no upstream
   fixes for these patches anymore and they are still the top
   of the 'git log --oneline -- kernel/stop_machine.c' output.

 * These patches have been verified with the synthetic test case
   and 'stress-ng --class scheduler --sequential 0' (no regressions)
   on guest with 2 CPUs and one physical system with 24 CPUs.

[Other Info]
 
 * The patches are required on Xenial and later.
 * There are 4 patches for Xenial, and 2 patches pending for Bionic.
 * All patches are applied from Cosmic onwards.

[Original Description]

These 2 hard lockups happened all of a sudden in the logs, and many soft
lockups occur after them as a fallout.

    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: Watchdog 
detected hard LOCKUP on cpu 10
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 Comm: 
migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: ffff883ff2a76200 
ti: ffff883ff2110000 task.ti: ffff883ff2110000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[<ffffffff810c8cb0>]  [<ffffffff810c8cb0>] 
native_queued_spin_lock_slowpath+0x160/0x170
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
0000:ffff883ff2113c58  EFLAGS: 00000002
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 0000000000000101 
RBX: 0000000000000086 RCX: 0000000000000001
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 0000000000000101 
RSI: 0000000000000001 RDI: ffff881fff991ba8
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: ffff883ff2113c58 
R08: 0000000000000101 R09: ffff883ff082e200
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 0000000000002e04 
R11: 0000000000002e04 R12: ffff881fff997c60
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: ffff881fff991ba8 
R14: 0000000000000000 R15: ffff881fff997300
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 00007f7caaa23020 
CR3: 0000001f46740000 CR4: 0000000000160670
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  ffff883ff2113c68 
ffffffff811870eb ffff883ff2113c80 ffffffff81819907
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  ffff881fff991ba0 
ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  ffff881fff997c90 
ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105]  [<ffffffff811870eb>] 
queued_spin_lock_slowpath+0xb/0xf
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109]  [<ffffffff81819907>] 
_raw_spin_lock_irqsave+0x37/0x40
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113]  [<ffffffff8111c600>] 
cpu_stop_queue_work+0x30/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116]  [<ffffffff8111ccd0>] 
stop_one_cpu_nowait+0x30/0x40
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119]  [<ffffffff810bbb5b>] 
load_balance+0x71b/0x940
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122]  [<ffffffff810bbff5>] 
pick_next_task_fair+0x275/0x4b0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126]  [<ffffffff81816166>] 
__schedule+0x6c6/0x7f0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132]  [<ffffffff810a2560>] 
? sort_range+0x30/0x30
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134]  [<ffffffff818162c5>] 
schedule+0x35/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136]  [<ffffffff810a262d>] 
smpboot_thread_fn+0xcd/0x180
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139]  [<ffffffff8109f138>] 
kthread+0xd8/0xf0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141]  [<ffffffff8109f060>] 
? kthread_park+0x60/0x60
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143]  [<ffffffff81819ff5>] 
ret_from_fork+0x55/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144]  [<ffffffff8109f060>] 
? kthread_park+0x60/0x60

    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: Watchdog 
detected hard LOCKUP on cpu 6
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: 
<...>
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 
Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: ffff881ff03dd400 
ti: ffff883cda77c000 task.ti: ffff883cda77c000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 
0010:[<ffffffff810aacb6>]  [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 
0000:ffff883cda77fa78  EFLAGS: 00000002
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 0000000000000001 
RBX: ffff883ff2a76200 RCX: 0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 0000000000000001 
RSI: 0000000000000003 RDI: ffff883ff2a768d4
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: ffff883cda77fab8 
R08: 000000000000000a R09: ffff881ff03dd400
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 0000000000000001 
R11: 0000000000000000 R12: 0000000000017300
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: ffff883ff2a768d4 
R14: 0000000000000046 R15: 0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS:  
00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 0000000014583550 
CR3: 0000003d4ac96000 CR4: 0000000000160670
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365]  0000000000000202 
ffff883cda77fa98 0000000000000003 0000000000000006
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368]  000000000000000a 
ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370]  ffff883cda77fac8 
ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375]  [<ffffffff810ab035>] 
wake_up_process+0x15/0x20
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379]  [<ffffffff8111cc22>] 
stop_two_cpus+0x1b2/0x230
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382]  [<ffffffff8111c650>] 
? cpu_stop_queue_work+0x80/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384]  [<ffffffff810b5d15>] 
? dequeue_entity+0x455/0x8a0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386]  [<ffffffff8111c650>] 
? cpu_stop_queue_work+0x80/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388]  [<ffffffff810aaa70>] 
? __migrate_swap_task.part.83+0x80/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390]  [<ffffffff810ab18e>] 
migrate_swap+0xae/0x130
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392]  [<ffffffff810b4e44>] 
task_numa_migrate+0x504/0x930
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394]  [<ffffffff810b52e9>] 
numa_migrate_preferred+0x79/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396]  [<ffffffff810b9373>] 
task_numa_fault+0x923/0xcd0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400]  [<ffffffff8175e407>] 
? tcp_recvmsg+0x6b7/0xbd0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404]  [<ffffffff811da9be>] 
? mpol_misplaced+0x14e/0x190
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408]  [<ffffffff811b7836>] 
handle_pte_fault+0x5a6/0x1440
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411]  [<ffffffff816f6693>] 
? sock_recvmsg+0x43/0x50
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413]  [<ffffffff811b9540>] 
handle_mm_fault+0x250/0x540
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417]  [<ffffffff81069e1a>] 
__do_page_fault+0x19a/0x430
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419]  [<ffffffff8106a0d2>] 
do_page_fault+0x22/0x30
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423]  [<ffffffff8181c5a8>] 
page_fault+0x28/0x30

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Incomplete


** Tags: xenial

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1821259/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to