Re: [PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag

2019-09-25 Thread Michael Ellerman
On Wed, 2019-09-11 at 22:31:55 UTC, Michael Roth wrote:
> On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
> of memory running the following guest configs:
...
> To handle both cases, this patch splits kvmppc_set_host_ipi() into
> separate set/clear functions, where we execute smp_mb() prior to
> setting host_ipi flag, and after clearing host_ipi flag. These
> functions pair with each other to synchronize the sender and receiver
> sides.
> 
> With that change in place the above workload ran for 20 hours without
> triggering any lock-ups.
> 
> Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell 
> handling") # v4.0
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Nicholas Piggin 
> Cc: kvm-...@vger.kernel.org
> Signed-off-by: Michael Roth 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/3a83f677a6eeff65751b29e3648d7c69c3be83f3

cheers


Re: [PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag

2019-09-23 Thread Paul Mackerras
On Wed, Sep 11, 2019 at 05:31:55PM -0500, Michael Roth wrote:
> On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
> of memory running the following guest configs:
> 
>   guest A:
> - 224GB of memory
> - 56 VCPUs (sockets=1,cores=28,threads=2), where:
>   VCPUs 0-1 are pinned to CPUs 0-3,
>   VCPUs 2-3 are pinned to CPUs 4-7,
>   ...
>   VCPUs 54-55 are pinned to CPUs 108-111
> 
>   guest B:
> - 4GB of memory
> - 4 VCPUs (sockets=1,cores=4,threads=1)
> 
> with the following workloads (with KSM and THP enabled in all):
> 
>   guest A:
> stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
> 
>   guest B:
> stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
> 
>   host:
> stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
> 
> the below soft-lockup traces were observed after an hour or so and
> persisted until the host was reset (this was found to be reliably
> reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
> and 5.3-rc5):
> 
>   [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1253.183319] rcu: 124-: (5250 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=1941
>   [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 
> 52/KVM:19709]
>   [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! 
> [worker:19913]
>   [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! 
> [worker:20331]
>   [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! 
> [worker:20338]
>   [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! 
> [avocado:19525]
>   [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> [ksmd:791]
>   [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1316.198032] rcu: 124-: (21003 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=8243
>   [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> [ksmd:791]
>   [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1379.212629] rcu: 124-: (36756 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=14714
>   [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> [ksmd:791]
>   [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1442.227115] rcu: 124-: (52509 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=21403
>   [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
>   [ 1455.111822]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
>   [ 1455.111905]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
>   [ 1455.111986]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
>   [ 1455.112068]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
>   [ 1455.112159]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
>   [ 1455.112231]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
>   [ 1455.112303]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
>   [ 1455.112392]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> 
> CPUs 45, 24, and 124 are stuck on spin locks, likely held by
> CPUs 105 and 31.
> 
> CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
> target CPU 42. For instance:
> 
>   # CPU 105 registers (via xmon)
>   R00 = c020b20c   R16 = 7d1bcd80
>   R01 = c0363eaa7970   R17 = 0001
>   R02 = c19b3a00   R18 = 006b
>   R03 = 002a   R19 = 7d537d7aecf0
>   R04 = 002a   R20 = 60e0
>   R05 = 002a   R21 = 08010080
>   R06 = c0002073fb0caa08   R22 = 0d60
>   R07 = c19ddd78   R

[PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag

2019-09-11 Thread Michael Roth
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:

  guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
  VCPUs 0-1 are pinned to CPUs 0-3,
  VCPUs 2-3 are pinned to CPUs 4-7,
  ...
  VCPUs 54-55 are pinned to CPUs 108-111

  guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)

with the following workloads (with KSM and THP enabled in all):

  guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M

  guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M

  host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M

the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):

  [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1253.183319] rcu: 124-: (5250 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=1941
  [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 
52/KVM:19709]
  [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! 
[worker:19913]
  [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! 
[worker:20331]
  [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! 
[worker:20338]
  [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! 
[avocado:19525]
  [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1316.198032] rcu: 124-: (21003 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=8243
  [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1379.212629] rcu: 124-: (36756 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=14714
  [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1442.227115] rcu: 124-: (52509 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=21403
  [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
  [ 1455.111822]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
  [ 1455.111905]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
  [ 1455.111986]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
  [ 1455.112068]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
  [ 1455.112159]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
  [ 1455.112231]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
  [ 1455.112303]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
  [ 1455.112392]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1

CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.

CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:

  # CPU 105 registers (via xmon)
  R00 = c020b20c   R16 = 7d1bcd80
  R01 = c0363eaa7970   R17 = 0001
  R02 = c19b3a00   R18 = 006b
  R03 = 002a   R19 = 7d537d7aecf0
  R04 = 002a   R20 = 60e0
  R05 = 002a   R21 = 08010080
  R06 = c0002073fb0caa08   R22 = 0d60
  R07 = c19ddd78   R23 = 0001
  R08 = 002a   R24 = c147a700
  R09 = 0001   R25 = c0002073fb0ca908
  R10 = c08ffeb4e660   R26 = 
  R11 = c0002073fb0ca900   R27 = c19e2464
  R12 = c0050790   R28 = c00812b0
  R