Public bug reported:
[Impact]
* This problem hard locks up 2 CPUs in a deadlock, and this
soft locks up other CPUs as an effect; the system becomes
unusable.
* This is relatively rare / difficult to hit because it's a
corner case in scheduling/load balancing that needs timing
with CPU stopper code. And it needs SMP plus _NUMA_ system.
(but it can be hit with synthetic test case attached in LP.)
* Since SMP plus NUMA usually equals _servers_ it looks like
a good idea to prevent this bug / hard lockups / rebooting.
* The fix resolves the potential deadlock by removing one of
the calls required to deadlock from under the locked code.
[Test Case]
* There's a synthetic test case to reproduce this problem
(although without the stack traces - just a system hang)
attached to this LP bug.
* It uses kprobes/mdelay/cpu stopper calls to force the code
to execute and force the timing/locking condition to occur.
* $ sudo insmod kmod-stopper.ko
Some dmesg logging occurs, and systems either hangs or not.
See examples in comments.
[Regression Potential]
* These are patches to the cpu stop_machine.c code, and they
change a bit how it works; however, there are no upstream
fixes for these patches anymore and they are still the top
of the 'git log --oneline -- kernel/stop_machine.c' output.
* These patches have been verified with the synthetic test case
and 'stress-ng --class scheduler --sequential 0' (no regressions)
on guest with 2 CPUs and one physical system with 24 CPUs.
[Other Info]
* The patches are required on Xenial and later.
* There are 4 patches for Xenial, and 2 patches pending for Bionic.
* All patches are applied from Cosmic onwards.
[Original Description]
These 2 hard lockups happened all of a sudden in the logs, and many soft
lockups occur after them as a fallout.
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: Watchdog
detected hard LOCKUP on cpu 10
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in:
<...>
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 Comm:
migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: ffff883ff2a76200
ti: ffff883ff2110000 task.ti: ffff883ff2110000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP:
0010:[<ffffffff810c8cb0>] [<ffffffff810c8cb0>]
native_queued_spin_lock_slowpath+0x160/0x170
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP:
0000:ffff883ff2113c58 EFLAGS: 00000002
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 0000000000000101
RBX: 0000000000000086 RCX: 0000000000000001
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 0000000000000101
RSI: 0000000000000001 RDI: ffff881fff991ba8
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: ffff883ff2113c58
R08: 0000000000000101 R09: ffff883ff082e200
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 0000000000002e04
R11: 0000000000002e04 R12: ffff881fff997c60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: ffff881fff991ba8
R14: 0000000000000000 R15: ffff881fff997300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:
0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 00007f7caaa23020
CR3: 0000001f46740000 CR4: 0000000000160670
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092] ffff883ff2113c68
ffffffff811870eb ffff883ff2113c80 ffffffff81819907
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094] ffff881fff991ba0
ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096] ffff881fff997c90
ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105] [<ffffffff811870eb>]
queued_spin_lock_slowpath+0xb/0xf
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109] [<ffffffff81819907>]
_raw_spin_lock_irqsave+0x37/0x40
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113] [<ffffffff8111c600>]
cpu_stop_queue_work+0x30/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116] [<ffffffff8111ccd0>]
stop_one_cpu_nowait+0x30/0x40
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119] [<ffffffff810bbb5b>]
load_balance+0x71b/0x940
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122] [<ffffffff810bbff5>]
pick_next_task_fair+0x275/0x4b0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126] [<ffffffff81816166>]
__schedule+0x6c6/0x7f0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132] [<ffffffff810a2560>]
? sort_range+0x30/0x30
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134] [<ffffffff818162c5>]
schedule+0x35/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136] [<ffffffff810a262d>]
smpboot_thread_fn+0xcd/0x180
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139] [<ffffffff8109f138>]
kthread+0xd8/0xf0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141] [<ffffffff8109f060>]
? kthread_park+0x60/0x60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143] [<ffffffff81819ff5>]
ret_from_fork+0x55/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144] [<ffffffff8109f060>]
? kthread_park+0x60/0x60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: Watchdog
detected hard LOCKUP on cpu 6
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in:
<...>
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932
Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: ffff881ff03dd400
ti: ffff883cda77c000 task.ti: ffff883cda77c000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP:
0010:[<ffffffff810aacb6>] [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP:
0000:ffff883cda77fa78 EFLAGS: 00000002
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 0000000000000001
RBX: ffff883ff2a76200 RCX: 0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 0000000000000001
RSI: 0000000000000003 RDI: ffff883ff2a768d4
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: ffff883cda77fab8
R08: 000000000000000a R09: ffff881ff03dd400
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 0000000000000001
R11: 0000000000000000 R12: 0000000000017300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: ffff883ff2a768d4
R14: 0000000000000046 R15: 0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS:
00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 0000000014583550
CR3: 0000003d4ac96000 CR4: 0000000000160670
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365] 0000000000000202
ffff883cda77fa98 0000000000000003 0000000000000006
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368] 000000000000000a
ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370] ffff883cda77fac8
ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375] [<ffffffff810ab035>]
wake_up_process+0x15/0x20
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379] [<ffffffff8111cc22>]
stop_two_cpus+0x1b2/0x230
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382] [<ffffffff8111c650>]
? cpu_stop_queue_work+0x80/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384] [<ffffffff810b5d15>]
? dequeue_entity+0x455/0x8a0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386] [<ffffffff8111c650>]
? cpu_stop_queue_work+0x80/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388] [<ffffffff810aaa70>]
? __migrate_swap_task.part.83+0x80/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390] [<ffffffff810ab18e>]
migrate_swap+0xae/0x130
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392] [<ffffffff810b4e44>]
task_numa_migrate+0x504/0x930
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394] [<ffffffff810b52e9>]
numa_migrate_preferred+0x79/0x80
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396] [<ffffffff810b9373>]
task_numa_fault+0x923/0xcd0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400] [<ffffffff8175e407>]
? tcp_recvmsg+0x6b7/0xbd0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404] [<ffffffff811da9be>]
? mpol_misplaced+0x14e/0x190
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408] [<ffffffff811b7836>]
handle_pte_fault+0x5a6/0x1440
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411] [<ffffffff816f6693>]
? sock_recvmsg+0x43/0x50
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413] [<ffffffff811b9540>]
handle_mm_fault+0x250/0x540
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417] [<ffffffff81069e1a>]
__do_page_fault+0x19a/0x430
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419] [<ffffffff8106a0d2>]
do_page_fault+0x22/0x30
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423] [<ffffffff8181c5a8>]
page_fault+0x28/0x30
** Affects: linux (Ubuntu)
Importance: Undecided
Status: Incomplete
** Tags: xenial
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1821259
Title:
Hard lockup in 2 CPUs due to deadlock in cpu_stoppers
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1821259/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs