Re: [PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On Wed, 2019-09-11 at 22:31:55 UTC, Michael Roth wrote: > On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB > of memory running the following guest configs: ... > To handle both cases, this patch splits kvmppc_set_host_ipi() into > separate set/clear functions, where we execute smp_mb() prior to > setting host_ipi flag, and after clearing host_ipi flag. These > functions pair with each other to synchronize the sender and receiver > sides. > > With that change in place the above workload ran for 20 hours without > triggering any lock-ups. > > Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell > handling") # v4.0 > Cc: Michael Ellerman > Cc: Paul Mackerras > Cc: Nicholas Piggin > Cc: kvm-...@vger.kernel.org > Signed-off-by: Michael Roth Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/3a83f677a6eeff65751b29e3648d7c69c3be83f3 cheers
Re: [PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On Wed, Sep 11, 2019 at 05:31:55PM -0500, Michael Roth wrote: > On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB > of memory running the following guest configs: > > guest A: > - 224GB of memory > - 56 VCPUs (sockets=1,cores=28,threads=2), where: > VCPUs 0-1 are pinned to CPUs 0-3, > VCPUs 2-3 are pinned to CPUs 4-7, > ... > VCPUs 54-55 are pinned to CPUs 108-111 > > guest B: > - 4GB of memory > - 4 VCPUs (sockets=1,cores=4,threads=1) > > with the following workloads (with KSM and THP enabled in all): > > guest A: > stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M > > guest B: > stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M > > host: > stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M > > the below soft-lockup traces were observed after an hour or so and > persisted until the host was reset (this was found to be reliably > reproducible for this configuration, for kernels 4.15, 4.18, 5.0, > and 5.3-rc5): > > [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1253.183319] rcu: 124-: (5250 ticks this GP) > idle=10a/1/0x4002 softirq=5408/5408 fqs=1941 > [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU > 52/KVM:19709] > [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! > [worker:19913] > [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! > [worker:20331] > [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! > [worker:20338] > [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! > [avocado:19525] > [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! > [ksmd:791] > [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1316.198032] rcu: 124-: (21003 ticks this GP) > idle=10a/1/0x4002 softirq=5408/5408 fqs=8243 > [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! > [ksmd:791] > [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1379.212629] rcu: 124-: (36756 ticks this GP) > idle=10a/1/0x4002 softirq=5408/5408 fqs=14714 > [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! > [ksmd:791] > [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1442.227115] rcu: 124-: (52509 ticks this GP) > idle=10a/1/0x4002 softirq=5408/5408 fqs=21403 > [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds. > [ 1455.111822] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds. > [ 1455.111905] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds. > [ 1455.111986] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds. > [ 1455.112068] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds. > [ 1455.112159] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds. > [ 1455.112231] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds. > [ 1455.112303] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds. > [ 1455.112392] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > CPUs 45, 24, and 124 are stuck on spin locks, likely held by > CPUs 105 and 31. > > CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on > target CPU 42. For instance: > > # CPU 105 registers (via xmon) > R00 = c020b20c R16 = 7d1bcd80 > R01 = c0363eaa7970 R17 = 0001 > R02 = c19b3a00 R18 = 006b > R03 = 002a R19 = 7d537d7aecf0 > R04 = 002a R20 = 60e0 > R05 = 002a R21 = 08010080 > R06 = c0002073fb0caa08 R22 = 0d60 > R07 = c19ddd78 R
[PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB of memory running the following guest configs: guest A: - 224GB of memory - 56 VCPUs (sockets=1,cores=28,threads=2), where: VCPUs 0-1 are pinned to CPUs 0-3, VCPUs 2-3 are pinned to CPUs 4-7, ... VCPUs 54-55 are pinned to CPUs 108-111 guest B: - 4GB of memory - 4 VCPUs (sockets=1,cores=4,threads=1) with the following workloads (with KSM and THP enabled in all): guest A: stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M guest B: stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M host: stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M the below soft-lockup traces were observed after an hour or so and persisted until the host was reset (this was found to be reliably reproducible for this configuration, for kernels 4.15, 4.18, 5.0, and 5.3-rc5): [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU [ 1253.183319] rcu: 124-: (5250 ticks this GP) idle=10a/1/0x4002 softirq=5408/5408 fqs=1941 [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709] [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913] [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331] [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338] [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525] [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU [ 1316.198032] rcu: 124-: (21003 ticks this GP) idle=10a/1/0x4002 softirq=5408/5408 fqs=8243 [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU [ 1379.212629] rcu: 124-: (36756 ticks this GP) idle=10a/1/0x4002 softirq=5408/5408 fqs=14714 [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU [ 1442.227115] rcu: 124-: (52509 ticks this GP) idle=10a/1/0x4002 softirq=5408/5408 fqs=21403 [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds. [ 1455.111822] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds. [ 1455.111905] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds. [ 1455.111986] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds. [ 1455.112068] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds. [ 1455.112159] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds. [ 1455.112231] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds. [ 1455.112303] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds. [ 1455.112392] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 CPUs 45, 24, and 124 are stuck on spin locks, likely held by CPUs 105 and 31. CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on target CPU 42. For instance: # CPU 105 registers (via xmon) R00 = c020b20c R16 = 7d1bcd80 R01 = c0363eaa7970 R17 = 0001 R02 = c19b3a00 R18 = 006b R03 = 002a R19 = 7d537d7aecf0 R04 = 002a R20 = 60e0 R05 = 002a R21 = 08010080 R06 = c0002073fb0caa08 R22 = 0d60 R07 = c19ddd78 R23 = 0001 R08 = 002a R24 = c147a700 R09 = 0001 R25 = c0002073fb0ca908 R10 = c08ffeb4e660 R26 = R11 = c0002073fb0ca900 R27 = c19e2464 R12 = c0050790 R28 = c00812b0 R