[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
[X][PATCH 0/4] LP#1821259 Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099427.html

[B][PATCH 0/2] Fix for LP#1821259 (pending patches for) Fix for deadlock in 
cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099432.html

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
   Status: New

** No longer affects: linux (Ubuntu)

** Changed in: linux (Ubuntu Bionic)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Xenial)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux source package in Xenial:
  Confirmed
Status in linux source package in Bionic:
  Confirmed

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
 (although without the stack traces - just a system hang)
 attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
 to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

 Some dmesg logging occurs, and systems either hangs or not.
 See examples in comments.
 
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
 change a bit how it works;  however, there are no upstream
 fixes for these patches anymore and they are still the top
 of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
 and 'stress-ng --class scheduler --sequential 0' (no regressions)
 on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
883ff2a76200 ti: 883ff211 task.ti: 883ff211
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[]  [] 
native_queued_spin_lock_slowpath+0x160/0x170
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
:883ff2113c58  EFLAGS: 0002
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0101 RBX: 0086 RCX: 0001
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0101 RSI: 0001 RDI: 881fff991ba8
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
883ff2113c58 R08: 0101 R09: 883ff082e200
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
2e04 R11: 2e04 R12: 881fff997c60
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
881fff991ba8 R14:  R15: 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
() GS:883fff00() knlGS:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS:  
ES:  CR0: 80050033
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
7f7caaa23020 CR3: 001f4674 CR4: 00160670
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  883ff2113c68 
811870eb 883ff2113c80 81819907
  Nov 23 15:48:33 SYSTEM_NAME kernel

[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
Since Bionic already has the fix commit applied,
the original kernel version doesn't hit the problem.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
 (although without the stack traces - just a system hang)
 attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
 to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

 Some dmesg logging occurs, and systems either hangs or not.
 See examples in comments.
 
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
 change a bit how it works;  however, there are no upstream
 fixes for these patches anymore and they are still the top
 of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
 and 'stress-ng --class scheduler --sequential 0' (no regressions)
 on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
883ff2a76200 ti: 883ff211 task.ti: 883ff211
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[]  [] 
native_queued_spin_lock_slowpath+0x160/0x170
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
:883ff2113c58  EFLAGS: 0002
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0101 RBX: 0086 RCX: 0001
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0101 RSI: 0001 RDI: 881fff991ba8
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
883ff2113c58 R08: 0101 R09: 883ff082e200
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
2e04 R11: 2e04 R12: 881fff997c60
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
881fff991ba8 R14:  R15: 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
() GS:883fff00() knlGS:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS:  
ES:  CR0: 80050033
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
7f7caaa23020 CR3: 001f4674 CR4: 00160670
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  883ff2113c68 
811870eb 883ff2113c80 81819907
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  881fff991ba0 
883ff2113cb0 8111c600 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  881fff997c90 
881ff03dd400  883ff2113cc0
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105]  
[] queued_spin_lock_slowpath+0xb/0xf
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109]  
[] _raw_spin_lock_irqsave+0x37/0x40
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113]  
[] cpu_stop_queue_work+0x30/0x80
  Nov 23

[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
Test-case on Xenial;

$ ls -1d /sys/devices/system/cpu/cpu[0-9]*
/sys/devices/system/cpu/cpu0
/sys/devices/system/cpu/cpu1


Original


$ uname -rv
4.4.0-144-generic #170-Ubuntu SMP Thu Mar 14 11:56:20 UTC 2019

$ sudo insmod kmod-stopper/kmod-stopper.ko
[   74.198379] mod_init() :: this cpu = 0x1, that cpu = 0x0
[   74.199613] mod_init() :: that_cpu_stopper_task = 88003d80e600, comm = 
migration/0
[   74.206194] kp2/stop_two_cpus() :: this cpu = 0x1, that cpu = 0x0
[   74.206196] do_nothing() :: this cpu = 0x0, that cpu = 0x1
[   74.206201] kp1/pick_next_task_fair() :: this cpu = 0x0, that cpu = 0x1
[   74.206203] kp1/pick_next_task_fair() :: before sleep (1000 msecs)
[   74.212759] kp2/stop_two_cpus() :: before sleep (500 msecs)
[   74.710138] kp2/stop_two_cpus() :: after  sleep (500 msecs)
[   75.198324] kp1/pick_next_task_fair() :: after  sleep (1000 msecs)
[   75.199814] kp1/pick_next_task_fair() :: stopping other cpu...


The test-case only failed 2 out of 50+ tests.


Patched:
---

$ uname -rv
4.4.0-144-generic #170+test20190320b1 SMP Wed Mar 20 18:35:06 UTC 2019

$ sudo insmod kmod-stopper/kmod-stopper.ko
[   85.958527] mod_init() :: this cpu = 0x1, that cpu = 0x0
[   85.965876] mod_init() :: that_cpu_stopper_task = 88003d80e600, comm = 
migration/0
[   85.993446] kp2/stop_two_cpus() :: this cpu = 0x1, that cpu = 0x0
[   85.993471] do_nothing() :: this cpu = 0x0, that cpu = 0x1
[   85.993477] kp1/pick_next_task_fair() :: this cpu = 0x0, that cpu = 0x1
[   85.993480] kp1/pick_next_task_fair() :: before sleep (1000 msecs)
[   86.019469] kp2/stop_two_cpus() :: before sleep (500 msecs)
[   86.521688] kp2/stop_two_cpus() :: after  sleep (500 msecs)
[   86.987662] kp1/pick_next_task_fair() :: after  sleep (1000 msecs)
[   86.989427] kp1/pick_next_task_fair() :: stopping other cpu...
[   86.991109] do_nothing() :: this cpu = 0x1, that cpu = 0x0
[   86.992615] do_nothing() :: this cpu = 0x1, that cpu = 0x0


It passes every time (50+ tests).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
 (although without the stack traces - just a system hang)
 attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
 to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

 Some dmesg logging occurs, and systems either hangs or not.
 See examples in comments.
 
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
 change a bit how it works;  however, there are no upstream
 fixes for these patches anymore and they are still the top
 of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
 and 'stress-ng --class scheduler --sequential 0' (no regressions)
 on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
883ff2a76200 ti: 883ff211 task.ti: 883ff211
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[]  [] 
native_queued_spin_lock_slowpath+0x160/0x170
  Nov 23 15:48:33 SYSTEM_NAME kerne

[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
Both xenial and bionic original/patched kernels
were tested with stress-ng scheduler class, and
no regressions were observed.

$ stress-ng --version
stress-ng, version 0.09.56 (gcc 8.3, x86_64 Linux 4.15.0-47-generic) 💻🔥

$ sudo stress-ng --class scheduler --sequential 0

$ uname -rv
4.4.0-144-generic #170-Ubuntu SMP Thu Mar 14 11:56:20 UTC 2019

$ uname -rv
4.4.0-144-generic #170+test20190320b1 SMP Wed Mar 20 18:35:06 UTC 2019

$ uname -rv
4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019

$ uname -rv
4.15.0-47-generic #50+test20190320b1 SMP Wed Mar 20 20:08:03 UTC 2019

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
 (although without the stack traces - just a system hang)
 attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
 to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

 Some dmesg logging occurs, and systems either hangs or not.
 See examples in comments.
 
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
 change a bit how it works;  however, there are no upstream
 fixes for these patches anymore and they are still the top
 of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
 and 'stress-ng --class scheduler --sequential 0' (no regressions)
 on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
883ff2a76200 ti: 883ff211 task.ti: 883ff211
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[]  [] 
native_queued_spin_lock_slowpath+0x160/0x170
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
:883ff2113c58  EFLAGS: 0002
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0101 RBX: 0086 RCX: 0001
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0101 RSI: 0001 RDI: 881fff991ba8
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
883ff2113c58 R08: 0101 R09: 883ff082e200
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
2e04 R11: 2e04 R12: 881fff997c60
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
881fff991ba8 R14:  R15: 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
() GS:883fff00() knlGS:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS:  
ES:  CR0: 80050033
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
7f7caaa23020 CR3: 001f4674 CR4: 00160670
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  883ff2113c68 
811870eb 883ff2113c80 81819907
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  881fff991ba0 
883ff2113cb0 8111c600 881fff997300
  Nov 23 

[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
 (although without the stack traces - just a system hang)
 attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
 to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

 Some dmesg logging occurs, and systems either hangs or not.
 See examples in comments.
 
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
 change a bit how it works;  however, there are no upstream
 fixes for these patches anymore and they are still the top
 of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
 and 'stress-ng --class scheduler --sequential 0' (no regressions)
 on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
883ff2a76200 ti: 883ff211 task.ti: 883ff211
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[]  [] 
native_queued_spin_lock_slowpath+0x160/0x170
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
:883ff2113c58  EFLAGS: 0002
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0101 RBX: 0086 RCX: 0001
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0101 RSI: 0001 RDI: 881fff991ba8
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
883ff2113c58 R08: 0101 R09: 883ff082e200
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
2e04 R11: 2e04 R12: 881fff997c60
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
881fff991ba8 R14:  R15: 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
() GS:883fff00() knlGS:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS:  
ES:  CR0: 80050033
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
7f7caaa23020 CR3: 001f4674 CR4: 00160670
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  883ff2113c68 
811870eb 883ff2113c80 81819907
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  881fff991ba0 
883ff2113cb0 8111c600 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  881fff997c90 
881ff03dd400  883ff2113cc0
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105]  
[] queued_spin_lock_slowpath+0xb/0xf
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109]  
[] _raw_spin_lock_irqsave+0x37/0x40
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113]  
[] cpu_stop_queue_work+0x30/0x80
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116]  
[] stop_one_cpu_nowait+0x30/0x40
  Nov 23 15:48:33

[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
Test-case (kmod-stopper.c)
-

$ sudo apt-get -y install gcc make libelf-dev linux-headers-$(uname -r)

$ touch Makefile # fake it, and use this make line:
$ make -C /lib/modules/$(uname -r)/build M=$(pwd) obj-m=kmod-stopper.o modules

$ echo 9 | sudo tee /proc/sys/kernel/printk

$ sudo insmod kmod-stopper.ko



$ sudo rmmod kmod-stopper


** Attachment added: "kmod-stopper.c"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1821259/+attachment/5248313/+files/kmod-stopper.c

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
 (although without the stack traces - just a system hang)
 attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
 to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

 Some dmesg logging occurs, and systems either hangs or not.
 See examples in comments.
 
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
 change a bit how it works;  however, there are no upstream
 fixes for these patches anymore and they are still the top
 of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
 and 'stress-ng --class scheduler --sequential 0' (no regressions)
 on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
883ff2a76200 ti: 883ff211 task.ti: 883ff211
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[]  [] 
native_queued_spin_lock_slowpath+0x160/0x170
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
:883ff2113c58  EFLAGS: 0002
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0101 RBX: 0086 RCX: 0001
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0101 RSI: 0001 RDI: 881fff991ba8
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
883ff2113c58 R08: 0101 R09: 883ff082e200
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
2e04 R11: 2e04 R12: 881fff997c60
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
881fff991ba8 R14:  R15: 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
() GS:883fff00() knlGS:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS:  
ES:  CR0: 80050033
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
7f7caaa23020 CR3: 001f4674 CR4: 00160670
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  883ff2113c68 
811870eb 883ff2113c80 81819907
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  881fff991ba0 
883ff2113cb0 8111c600 881fff997300
  Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  881fff997c90 
881ff03dd400 00

[Kernel-packages] [Bug 1821259] Re: Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

2019-03-21 Thread Mauricio Faria de Oliveira
Analysis


The 1st hard lockup is harder to get the interesting data out of, as apparently 
the registers with variables
related to the cpu number have been clobbered by more recent calls in the 
spinlock path.

Looking at the 2nd hard lockup:

addr2line + code shows us that try_to_wake_up() in line 1997 is indeed
looping with IRQs disabled in line 1939 (thus a hard lockup):

$ addr2line -pifae 
ddeb-116.140/usr/lib/debug/boot/vmlinux-4.4.0-116-generic 0x810aacb6
 
0x810aacb6: try_to_wake_up at 
/build/linux-lts-xenial-ozsla7/linux-lts-xenial-4.4.0/kernel/sched/core.c:1997

1926 static int
1927 try_to_wake_up(struct task_struct *p, unsigned int state, int 
wake_flags)
1928 {
...
1939 raw_spin_lock_irqsave(&p->pi_lock, flags);
...
1993 /*
1994  * If the owning (remote) cpu is still in the middle of 
schedule() with
1995  * this task as prev, wait until its done referencing the task.
1996  */
1997 while (p->on_cpu)
1998 cpu_relax();
...
2027 raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2028 
2029 return success;
2030 }

The objdump disassembly of try_to_wake_up() in vmlinux for the RIP instruction 
address (810aacb6),
shows a while loop that just checks for non-zero 'p->on_cpu' and calls 
cpu_relax() (which translates to the 'pause' instruction):

810aacb1:   f3 90   pause
810aacb3:   8b 43 28mov0x28(%rbx),%eax
810aacb6:   85 c0   test   %eax,%eax
810aacb8:   75 f7   jne810aacb1 


So, it checks for the value in pointer in RBX + offset 0x28, which
according to the 'pahole' tool, is indeed the 'on_cpu' field:

$ pahole --hex -C task_struct 
ddeb-116.140/usr/lib/debug/boot/vmlinux-4.4.0-116-generic | grep on_cpu 
 
inton_cpu;   /*  0x28   0x4 */

So, the task_struct pointer is in RBX, which is:

RBX: 883ff2a76200

And that matches the other hard locked up task on CPU 10 (see its
'task:' field).

Per the stack trace in CPU 10, and the identical timestamp of the two hard 
lockup messages, and the fact both stack traces are cpu_stopper related,
it does look like CPU 10 is waiting on the spinlock of one of the 2 cpu 
stoppers held by CPU 6, which is exactly the scenario in the suggested patch.

The problem/fix has been verified with a synthetic test-case (attached).


commit 0b26351b910fb8fe6a056f8a1bbccabe50c0e19f
Author: Peter Zijlstra 
Date:   Fri Apr 20 11:50:05 2018 +0200

stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock

Matt reported the following deadlock:

CPU0CPU1

schedule(.prev=migrate/0)   
  pick_next_task()...
idle_balance()  migrate_swap()
  active_balance()stop_two_cpus()
spin_lock(stopper0->lock)
spin_lock(stopper1->lock)
ttwu(migrate/0)
  smp_cond_load_acquire() 
-- waits for schedule()
stop_one_cpu(1)
  spin_lock(stopper1->lock) -- waits for stopper lock

Fix this deadlock by taking the wakeups out from under stopper->lock.
This allows the active_balance() to queue the stop work and finish the
context switch, which in turn allows the wakeup from migrate_swap() to
observe the context and complete the wakeup.
<...>


The stop_two_cpus() call can only happen in a NUMA system per it's caller chain:
  stop_two_cpus() <- migrate_swap() <- task_numa_migrate() <- 
numa_migrate_preferred() <- [task_numa_placement()] <- task_numa_fault()

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
 soft locks up other CPUs as an effect; the system becomes
 unusable.

   * This is relatively rare / difficult to hit because it's a
 corner case in scheduling/load balancing that needs timing
 with CPU stopper code. And it needs SMP plus _NUMA_ system.
 (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
 a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
 the calls required to