Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2021-11-04 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of November 3, 2021 1:48 am:
> On Thu, Jan 14, 2021 at 11:08:03PM +1000, Nicholas Piggin wrote:
>> Excerpts from Michal Suchánek's message of January 14, 2021 10:40 pm:
>> > On Mon, Oct 19, 2020 at 02:50:51PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
>> >> > Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
>> >> >> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
>> >> >>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
>> >> >>> > Michal Suchánek  writes:
>> >> >>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>> >> >>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 
>> >> >>> >>> am:
>> >> >>> >>> > Hello,
>> >> >>> >>> > 
>> >> >>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 
>> >> >>> >>> > ("powerpc/64s:
>> >> >>> >>> > Reimplement book3s idle code in C").
>> >> >>> >>> > 
>> >> >>> >>> > The symptom is host locking up completely after some hours of 
>> >> >>> >>> > KVM
>> >> >>> >>> > workload with messages like
>> >> >>> >>> > 
>> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
>> >> >>> >>> > grab cpu 47
>> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
>> >> >>> >>> > grab cpu 71
>> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
>> >> >>> >>> > grab cpu 47
>> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
>> >> >>> >>> > grab cpu 71
>> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
>> >> >>> >>> > grab cpu 47
>> >> >>> >>> > 
>> >> >>> >>> > printed before the host locks up.
>> >> >>> >>> > 
>> >> >>> >>> > The machines run sandboxed builds which is a mixed workload 
>> >> >>> >>> > resulting in
>> >> >>> >>> > IO/single core/mutiple core load over time and there are 
>> >> >>> >>> > periods of no
>> >> >>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>> >> >>> >>> > setup/terdown is somewhat excercised as well.
>> >> >>> >>> > 
>> >> >>> >>> > POWER9 with the new guest entry fast path does not seem to be 
>> >> >>> >>> > affected.
>> >> >>> >>> > 
>> >> >>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 
>> >> >>> >>> > and
>> >> >>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore 
>> >> >>> >>> > IAMR
>> >> >>> >>> > after idle") which gives same idle code as 5.1.16 and the 
>> >> >>> >>> > kernel seems
>> >> >>> >>> > stable.
>> >> >>> >>> > 
>> >> >>> >>> > Config is attached.
>> >> >>> >>> > 
>> >> >>> >>> > I cannot easily revert this commit, especially if I want to use 
>> >> >>> >>> > the same
>> >> >>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are 
>> >> >>> >>> > applicable
>> >> >>> >>> > only to the new idle code.
>> >> >>> >>> > 
>> >> >>> >>> > Any idea what can be the problem?
>> >> >>> >>> 
>> >> >>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> >> >>> >>> those threads. I wonder what they are doing. POWER8 doesn't have 
>> >> >>> >>> a good
>> >> >>> >>> NMI IPI and I don't know if it supports pdbg dumping registers 
>> >> >>> >>> from the
>> >> >>> >>> BMC unfortunately.
>> >> >>> >>
>> >> >>> >> It may be possible to set up fadump with a later kernel version 
>> >> >>> >> that
>> >> >>> >> supports it on powernv and dump the whole kernel.
>> >> >>> > 
>> >> >>> > Your firmware won't support it AFAIK.
>> >> >>> > 
>> >> >>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
>> >> >>> > good chance it won't work :/
>> >> >>> 
>> >> >>> I haven't had any luck yet reproducing this still. Testing with sub 
>> >> >>> cores of various different combinations, etc. I'll keep trying though.
>> >> >> 
>> >> >> Hello,
>> >> >> 
>> >> >> I tried running some KVM guests to simulate the workload and what I get
>> >> >> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
>> >> >> kernel and qemu 4.2.1 and 5.1.0
>> >> >> 
>> >> >> To start some guests I run
>> >> >> 
>> >> >> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 
>> >> >> -accel kvm -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd 
>> >> >> -nodefaults -nographic -serial mon:telnet::444$i,server,wait & done
>> >> >> 
>> >> >> To simulate some workload I run
>> >> >> 
>> >> >> xz -zc9T0 < /dev/zero > /dev/null &
>> >> >> while true; do
>> >> >> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
>> >> >> done &
>> >> >> 
>> >> >> on the host and add a job that executes this to the ramdisk. However, 
>> >> >> most
>> >> >> guests never get to the point where the job is executed.
>> >> >> 
>> >> >> Any idea what might be the problem?
>> >> > 
>> >> > I would say try without pv queued spin locks (but if the same thing is 
>> >> > happening with 5.3 then it must be something else I guess). 
>> >> > 
>> >> > 

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2021-11-02 Thread Michal Suchánek
On Thu, Jan 14, 2021 at 11:08:03PM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of January 14, 2021 10:40 pm:
> > On Mon, Oct 19, 2020 at 02:50:51PM +1000, Nicholas Piggin wrote:
> >> Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
> >> > Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
> >> >> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> >> >>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> >> >>> > Michal Suchánek  writes:
> >> >>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >> >>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >> >>> >>> > Hello,
> >> >>> >>> > 
> >> >>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 
> >> >>> >>> > ("powerpc/64s:
> >> >>> >>> > Reimplement book3s idle code in C").
> >> >>> >>> > 
> >> >>> >>> > The symptom is host locking up completely after some hours of KVM
> >> >>> >>> > workload with messages like
> >> >>> >>> > 
> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
> >> >>> >>> > grab cpu 47
> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
> >> >>> >>> > grab cpu 71
> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
> >> >>> >>> > grab cpu 47
> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
> >> >>> >>> > grab cpu 71
> >> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't 
> >> >>> >>> > grab cpu 47
> >> >>> >>> > 
> >> >>> >>> > printed before the host locks up.
> >> >>> >>> > 
> >> >>> >>> > The machines run sandboxed builds which is a mixed workload 
> >> >>> >>> > resulting in
> >> >>> >>> > IO/single core/mutiple core load over time and there are periods 
> >> >>> >>> > of no
> >> >>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >> >>> >>> > setup/terdown is somewhat excercised as well.
> >> >>> >>> > 
> >> >>> >>> > POWER9 with the new guest entry fast path does not seem to be 
> >> >>> >>> > affected.
> >> >>> >>> > 
> >> >>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 
> >> >>> >>> > and
> >> >>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore 
> >> >>> >>> > IAMR
> >> >>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel 
> >> >>> >>> > seems
> >> >>> >>> > stable.
> >> >>> >>> > 
> >> >>> >>> > Config is attached.
> >> >>> >>> > 
> >> >>> >>> > I cannot easily revert this commit, especially if I want to use 
> >> >>> >>> > the same
> >> >>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are 
> >> >>> >>> > applicable
> >> >>> >>> > only to the new idle code.
> >> >>> >>> > 
> >> >>> >>> > Any idea what can be the problem?
> >> >>> >>> 
> >> >>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >> >>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a 
> >> >>> >>> good
> >> >>> >>> NMI IPI and I don't know if it supports pdbg dumping registers 
> >> >>> >>> from the
> >> >>> >>> BMC unfortunately.
> >> >>> >>
> >> >>> >> It may be possible to set up fadump with a later kernel version that
> >> >>> >> supports it on powernv and dump the whole kernel.
> >> >>> > 
> >> >>> > Your firmware won't support it AFAIK.
> >> >>> > 
> >> >>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> >> >>> > good chance it won't work :/
> >> >>> 
> >> >>> I haven't had any luck yet reproducing this still. Testing with sub 
> >> >>> cores of various different combinations, etc. I'll keep trying though.
> >> >> 
> >> >> Hello,
> >> >> 
> >> >> I tried running some KVM guests to simulate the workload and what I get
> >> >> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
> >> >> kernel and qemu 4.2.1 and 5.1.0
> >> >> 
> >> >> To start some guests I run
> >> >> 
> >> >> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel 
> >> >> kvm -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults 
> >> >> -nographic -serial mon:telnet::444$i,server,wait & done
> >> >> 
> >> >> To simulate some workload I run
> >> >> 
> >> >> xz -zc9T0 < /dev/zero > /dev/null &
> >> >> while true; do
> >> >> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
> >> >> done &
> >> >> 
> >> >> on the host and add a job that executes this to the ramdisk. However, 
> >> >> most
> >> >> guests never get to the point where the job is executed.
> >> >> 
> >> >> Any idea what might be the problem?
> >> > 
> >> > I would say try without pv queued spin locks (but if the same thing is 
> >> > happening with 5.3 then it must be something else I guess). 
> >> > 
> >> > I'll try to test a similar setup on a POWER8 here.
> >> 
> >> Couldn't reproduce the guest hang, they seem to run fine even with 
> >> queued spinlocks. Might have a different .config.
> >> 
> >> I might have got a lockup in 

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2021-01-14 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of January 14, 2021 10:40 pm:
> On Mon, Oct 19, 2020 at 02:50:51PM +1000, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
>> > Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
>> >> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
>> >>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
>> >>> > Michal Suchánek  writes:
>> >>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>> >>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> >>> >>> > Hello,
>> >>> >>> > 
>> >>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 
>> >>> >>> > ("powerpc/64s:
>> >>> >>> > Reimplement book3s idle code in C").
>> >>> >>> > 
>> >>> >>> > The symptom is host locking up completely after some hours of KVM
>> >>> >>> > workload with messages like
>> >>> >>> > 
>> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>> >>> >>> > cpu 47
>> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>> >>> >>> > cpu 71
>> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>> >>> >>> > cpu 47
>> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>> >>> >>> > cpu 71
>> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>> >>> >>> > cpu 47
>> >>> >>> > 
>> >>> >>> > printed before the host locks up.
>> >>> >>> > 
>> >>> >>> > The machines run sandboxed builds which is a mixed workload 
>> >>> >>> > resulting in
>> >>> >>> > IO/single core/mutiple core load over time and there are periods 
>> >>> >>> > of no
>> >>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>> >>> >>> > setup/terdown is somewhat excercised as well.
>> >>> >>> > 
>> >>> >>> > POWER9 with the new guest entry fast path does not seem to be 
>> >>> >>> > affected.
>> >>> >>> > 
>> >>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> >>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> >>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel 
>> >>> >>> > seems
>> >>> >>> > stable.
>> >>> >>> > 
>> >>> >>> > Config is attached.
>> >>> >>> > 
>> >>> >>> > I cannot easily revert this commit, especially if I want to use 
>> >>> >>> > the same
>> >>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are 
>> >>> >>> > applicable
>> >>> >>> > only to the new idle code.
>> >>> >>> > 
>> >>> >>> > Any idea what can be the problem?
>> >>> >>> 
>> >>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> >>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a 
>> >>> >>> good
>> >>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from 
>> >>> >>> the
>> >>> >>> BMC unfortunately.
>> >>> >>
>> >>> >> It may be possible to set up fadump with a later kernel version that
>> >>> >> supports it on powernv and dump the whole kernel.
>> >>> > 
>> >>> > Your firmware won't support it AFAIK.
>> >>> > 
>> >>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
>> >>> > good chance it won't work :/
>> >>> 
>> >>> I haven't had any luck yet reproducing this still. Testing with sub 
>> >>> cores of various different combinations, etc. I'll keep trying though.
>> >> 
>> >> Hello,
>> >> 
>> >> I tried running some KVM guests to simulate the workload and what I get
>> >> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
>> >> kernel and qemu 4.2.1 and 5.1.0
>> >> 
>> >> To start some guests I run
>> >> 
>> >> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel 
>> >> kvm -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults 
>> >> -nographic -serial mon:telnet::444$i,server,wait & done
>> >> 
>> >> To simulate some workload I run
>> >> 
>> >> xz -zc9T0 < /dev/zero > /dev/null &
>> >> while true; do
>> >> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
>> >> done &
>> >> 
>> >> on the host and add a job that executes this to the ramdisk. However, most
>> >> guests never get to the point where the job is executed.
>> >> 
>> >> Any idea what might be the problem?
>> > 
>> > I would say try without pv queued spin locks (but if the same thing is 
>> > happening with 5.3 then it must be something else I guess). 
>> > 
>> > I'll try to test a similar setup on a POWER8 here.
>> 
>> Couldn't reproduce the guest hang, they seem to run fine even with 
>> queued spinlocks. Might have a different .config.
>> 
>> I might have got a lockup in the host (although different symptoms than 
>> the original report). I'll look into that a bit further.
> 
> Hello,
> 
> any progress on this?

No progress, I still wasn't able to reproduce, and it fell off the 
radar sorry.

I expect hwthred_state must be getting corrupted somewhere or a
secondary thread getting stuck 

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2021-01-14 Thread Michal Suchánek
On Mon, Oct 19, 2020 at 02:50:51PM +1000, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
> > Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
> >> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> >>> > Michal Suchánek  writes:
> >>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >>> >>> > Hello,
> >>> >>> > 
> >>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> >>> >>> > Reimplement book3s idle code in C").
> >>> >>> > 
> >>> >>> > The symptom is host locking up completely after some hours of KVM
> >>> >>> > workload with messages like
> >>> >>> > 
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 47
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 71
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 47
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 71
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 47
> >>> >>> > 
> >>> >>> > printed before the host locks up.
> >>> >>> > 
> >>> >>> > The machines run sandboxed builds which is a mixed workload 
> >>> >>> > resulting in
> >>> >>> > IO/single core/mutiple core load over time and there are periods of 
> >>> >>> > no
> >>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >>> >>> > setup/terdown is somewhat excercised as well.
> >>> >>> > 
> >>> >>> > POWER9 with the new guest entry fast path does not seem to be 
> >>> >>> > affected.
> >>> >>> > 
> >>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> >>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> >>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel 
> >>> >>> > seems
> >>> >>> > stable.
> >>> >>> > 
> >>> >>> > Config is attached.
> >>> >>> > 
> >>> >>> > I cannot easily revert this commit, especially if I want to use the 
> >>> >>> > same
> >>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are 
> >>> >>> > applicable
> >>> >>> > only to the new idle code.
> >>> >>> > 
> >>> >>> > Any idea what can be the problem?
> >>> >>> 
> >>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a 
> >>> >>> good
> >>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from 
> >>> >>> the
> >>> >>> BMC unfortunately.
> >>> >>
> >>> >> It may be possible to set up fadump with a later kernel version that
> >>> >> supports it on powernv and dump the whole kernel.
> >>> > 
> >>> > Your firmware won't support it AFAIK.
> >>> > 
> >>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> >>> > good chance it won't work :/
> >>> 
> >>> I haven't had any luck yet reproducing this still. Testing with sub 
> >>> cores of various different combinations, etc. I'll keep trying though.
> >> 
> >> Hello,
> >> 
> >> I tried running some KVM guests to simulate the workload and what I get
> >> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
> >> kernel and qemu 4.2.1 and 5.1.0
> >> 
> >> To start some guests I run
> >> 
> >> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel 
> >> kvm -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults 
> >> -nographic -serial mon:telnet::444$i,server,wait & done
> >> 
> >> To simulate some workload I run
> >> 
> >> xz -zc9T0 < /dev/zero > /dev/null &
> >> while true; do
> >> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
> >> done &
> >> 
> >> on the host and add a job that executes this to the ramdisk. However, most
> >> guests never get to the point where the job is executed.
> >> 
> >> Any idea what might be the problem?
> > 
> > I would say try without pv queued spin locks (but if the same thing is 
> > happening with 5.3 then it must be something else I guess). 
> > 
> > I'll try to test a similar setup on a POWER8 here.
> 
> Couldn't reproduce the guest hang, they seem to run fine even with 
> queued spinlocks. Might have a different .config.
> 
> I might have got a lockup in the host (although different symptoms than 
> the original report). I'll look into that a bit further.

Hello,

any progress on this?

I considered reinstating the old assembly code for POWER[78] but even
the way it's called has changed slightly.

Thanks

Michal


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-10-18 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
> Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
>> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
>>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
>>> > Michal Suchánek  writes:
>>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>>> >>> > Hello,
>>> >>> > 
>>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>>> >>> > Reimplement book3s idle code in C").
>>> >>> > 
>>> >>> > The symptom is host locking up completely after some hours of KVM
>>> >>> > workload with messages like
>>> >>> > 
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 47
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 71
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 47
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 71
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 47
>>> >>> > 
>>> >>> > printed before the host locks up.
>>> >>> > 
>>> >>> > The machines run sandboxed builds which is a mixed workload resulting 
>>> >>> > in
>>> >>> > IO/single core/mutiple core load over time and there are periods of no
>>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>>> >>> > setup/terdown is somewhat excercised as well.
>>> >>> > 
>>> >>> > POWER9 with the new guest entry fast path does not seem to be 
>>> >>> > affected.
>>> >>> > 
>>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>>> >>> > stable.
>>> >>> > 
>>> >>> > Config is attached.
>>> >>> > 
>>> >>> > I cannot easily revert this commit, especially if I want to use the 
>>> >>> > same
>>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>>> >>> > only to the new idle code.
>>> >>> > 
>>> >>> > Any idea what can be the problem?
>>> >>> 
>>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>>> >>> BMC unfortunately.
>>> >>
>>> >> It may be possible to set up fadump with a later kernel version that
>>> >> supports it on powernv and dump the whole kernel.
>>> > 
>>> > Your firmware won't support it AFAIK.
>>> > 
>>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
>>> > good chance it won't work :/
>>> 
>>> I haven't had any luck yet reproducing this still. Testing with sub 
>>> cores of various different combinations, etc. I'll keep trying though.
>> 
>> Hello,
>> 
>> I tried running some KVM guests to simulate the workload and what I get
>> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
>> kernel and qemu 4.2.1 and 5.1.0
>> 
>> To start some guests I run
>> 
>> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel kvm 
>> -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults -nographic 
>> -serial mon:telnet::444$i,server,wait & done
>> 
>> To simulate some workload I run
>> 
>> xz -zc9T0 < /dev/zero > /dev/null &
>> while true; do
>> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
>> done &
>> 
>> on the host and add a job that executes this to the ramdisk. However, most
>> guests never get to the point where the job is executed.
>> 
>> Any idea what might be the problem?
> 
> I would say try without pv queued spin locks (but if the same thing is 
> happening with 5.3 then it must be something else I guess). 
> 
> I'll try to test a similar setup on a POWER8 here.

Couldn't reproduce the guest hang, they seem to run fine even with 
queued spinlocks. Might have a different .config.

I might have got a lockup in the host (although different symptoms than 
the original report). I'll look into that a bit further.

Thanks,
Nick


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-10-18 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
>> > Michal Suchánek  writes:
>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> >>> > Hello,
>> >>> > 
>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>> >>> > Reimplement book3s idle code in C").
>> >>> > 
>> >>> > The symptom is host locking up completely after some hours of KVM
>> >>> > workload with messages like
>> >>> > 
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 47
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 71
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 47
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 71
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 47
>> >>> > 
>> >>> > printed before the host locks up.
>> >>> > 
>> >>> > The machines run sandboxed builds which is a mixed workload resulting 
>> >>> > in
>> >>> > IO/single core/mutiple core load over time and there are periods of no
>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>> >>> > setup/terdown is somewhat excercised as well.
>> >>> > 
>> >>> > POWER9 with the new guest entry fast path does not seem to be affected.
>> >>> > 
>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>> >>> > stable.
>> >>> > 
>> >>> > Config is attached.
>> >>> > 
>> >>> > I cannot easily revert this commit, especially if I want to use the 
>> >>> > same
>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>> >>> > only to the new idle code.
>> >>> > 
>> >>> > Any idea what can be the problem?
>> >>> 
>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>> >>> BMC unfortunately.
>> >>
>> >> It may be possible to set up fadump with a later kernel version that
>> >> supports it on powernv and dump the whole kernel.
>> > 
>> > Your firmware won't support it AFAIK.
>> > 
>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
>> > good chance it won't work :/
>> 
>> I haven't had any luck yet reproducing this still. Testing with sub 
>> cores of various different combinations, etc. I'll keep trying though.
> 
> Hello,
> 
> I tried running some KVM guests to simulate the workload and what I get
> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
> kernel and qemu 4.2.1 and 5.1.0
> 
> To start some guests I run
> 
> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel kvm 
> -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults -nographic 
> -serial mon:telnet::444$i,server,wait & done
> 
> To simulate some workload I run
> 
> xz -zc9T0 < /dev/zero > /dev/null &
> while true; do
> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
> done &
> 
> on the host and add a job that executes this to the ramdisk. However, most
> guests never get to the point where the job is executed.
> 
> Any idea what might be the problem?

I would say try without pv queued spin locks (but if the same thing is 
happening with 5.3 then it must be something else I guess). 

I'll try to test a similar setup on a POWER8 here.

Thanks,
Nick

> 
> In the past I was able to boot guests quite realiably.
> 
> This is boot log of one of the VMs
> 
> Trying ::1...
> Connected to localhost.
> Escape character is '^]'.
> 
> 
> SLOF **
> QEMU Starting
>  Build Date = Jul 17 2020 11:15:24
>  FW Version = git-e18ddad8516ff2cf
>  Press "s" to enter Open Firmware.
> 
> Populating /vdevice methods
> Populating /vdevice/vty@7100
> Populating /vdevice/nvram@7101
> Populating /pci@8002000
> No NVRAM common partition, re-initializing...
> Scanning USB 
> Using default console: /vdevice/vty@7100
> Detected RAM kernel at 40 (27c8620 bytes) 
>  
>   Welcome to Open Firmware
> 
>   Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
>   This program and the accompanying materials are made available
>   under the terms of the BSD License available at
>   http://www.opensource.org/licenses/bsd-license.php
> 
> Booting from memory...
> OF stdout device is: /vdevice/vty@7100
> Preparing to boot Linux version 5.9.0-1.g11733e1-default (geeko@buildhost) 
> (gcc (SUSE Linux) 10.2.1 20200825 

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-10-16 Thread Michal Suchánek
On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> > Michal Suchánek  writes:
> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >>> > Hello,
> >>> > 
> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> >>> > Reimplement book3s idle code in C").
> >>> > 
> >>> > The symptom is host locking up completely after some hours of KVM
> >>> > workload with messages like
> >>> > 
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 
> >>> > printed before the host locks up.
> >>> > 
> >>> > The machines run sandboxed builds which is a mixed workload resulting in
> >>> > IO/single core/mutiple core load over time and there are periods of no
> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >>> > setup/terdown is somewhat excercised as well.
> >>> > 
> >>> > POWER9 with the new guest entry fast path does not seem to be affected.
> >>> > 
> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> >>> > stable.
> >>> > 
> >>> > Config is attached.
> >>> > 
> >>> > I cannot easily revert this commit, especially if I want to use the same
> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> >>> > only to the new idle code.
> >>> > 
> >>> > Any idea what can be the problem?
> >>> 
> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
> >>> BMC unfortunately.
> >>
> >> It may be possible to set up fadump with a later kernel version that
> >> supports it on powernv and dump the whole kernel.
> > 
> > Your firmware won't support it AFAIK.
> > 
> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> > good chance it won't work :/
> 
> I haven't had any luck yet reproducing this still. Testing with sub 
> cores of various different combinations, etc. I'll keep trying though.

Hello,

I tried running some KVM guests to simulate the workload and what I get
is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
kernel and qemu 4.2.1 and 5.1.0

To start some guests I run

for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel kvm 
-smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults -nographic 
-serial mon:telnet::444$i,server,wait & done

To simulate some workload I run

xz -zc9T0 < /dev/zero > /dev/null &
while true; do
killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
done &

on the host and add a job that executes this to the ramdisk. However, most
guests never get to the point where the job is executed.

Any idea what might be the problem?

In the past I was able to boot guests quite realiably.

This is boot log of one of the VMs

Trying ::1...
Connected to localhost.
Escape character is '^]'.


SLOF **
QEMU Starting
 Build Date = Jul 17 2020 11:15:24
 FW Version = git-e18ddad8516ff2cf
 Press "s" to enter Open Firmware.

Populating /vdevice methods
Populating /vdevice/vty@7100
Populating /vdevice/nvram@7101
Populating /pci@8002000
No NVRAM common partition, re-initializing...
Scanning USB 
Using default console: /vdevice/vty@7100
Detected RAM kernel at 40 (27c8620 bytes) 
 
  Welcome to Open Firmware

  Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
  This program and the accompanying materials are made available
  under the terms of the BSD License available at
  http://www.opensource.org/licenses/bsd-license.php

Booting from memory...
OF stdout device is: /vdevice/vty@7100
Preparing to boot Linux version 5.9.0-1.g11733e1-default (geeko@buildhost) (gcc 
(SUSE Linux) 10.2.1 20200825 [revision 
c0746a1beb1ba073c7981eb09f55b3d993b32e5c], GNU ld (GNU Binutils; openSUSE 
Tumbleweed) 2.34.0.20200325-1) #1 SMP Sun Oct 11 22:20:46 UTC 2020 (11733e1)
Detected machine type: 0101
command line:  
Max number of cores passed to firmware: 2048 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
memory layout at init:
  memory_limit :  (16 MB aligned)
  alloc_bottom : 0381
  alloc_top: 3000
  

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-09-07 Thread Michal Suchánek
On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> > Michal Suchánek  writes:
> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >>> > Hello,
> >>> > 
> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> >>> > Reimplement book3s idle code in C").
> >>> > 
> >>> > The symptom is host locking up completely after some hours of KVM
> >>> > workload with messages like
> >>> > 
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 
> >>> > printed before the host locks up.
> >>> > 
> >>> > The machines run sandboxed builds which is a mixed workload resulting in
> >>> > IO/single core/mutiple core load over time and there are periods of no
> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >>> > setup/terdown is somewhat excercised as well.
> >>> > 
> >>> > POWER9 with the new guest entry fast path does not seem to be affected.
> >>> > 
> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> >>> > stable.
> >>> > 
> >>> > Config is attached.
> >>> > 
> >>> > I cannot easily revert this commit, especially if I want to use the same
> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> >>> > only to the new idle code.
> >>> > 
> >>> > Any idea what can be the problem?
> >>> 
> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
> >>> BMC unfortunately.
> >>
> >> It may be possible to set up fadump with a later kernel version that
> >> supports it on powernv and dump the whole kernel.
> > 
> > Your firmware won't support it AFAIK.
> > 
> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> > good chance it won't work :/
> 
> I haven't had any luck yet reproducing this still. Testing with sub 
> cores of various different combinations, etc. I'll keep trying though.
> 
> I don't know if there's much we can add to debug it. Can we run pdbg
> on the BMCs on these things?

I suppose it depends on the machine type?

Thanks

Michal


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-09-07 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> Michal Suchánek  writes:
>> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>>> > Hello,
>>> > 
>>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>>> > Reimplement book3s idle code in C").
>>> > 
>>> > The symptom is host locking up completely after some hours of KVM
>>> > workload with messages like
>>> > 
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> > 
>>> > printed before the host locks up.
>>> > 
>>> > The machines run sandboxed builds which is a mixed workload resulting in
>>> > IO/single core/mutiple core load over time and there are periods of no
>>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>>> > setup/terdown is somewhat excercised as well.
>>> > 
>>> > POWER9 with the new guest entry fast path does not seem to be affected.
>>> > 
>>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>>> > stable.
>>> > 
>>> > Config is attached.
>>> > 
>>> > I cannot easily revert this commit, especially if I want to use the same
>>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>>> > only to the new idle code.
>>> > 
>>> > Any idea what can be the problem?
>>> 
>>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>>> BMC unfortunately.
>>
>> It may be possible to set up fadump with a later kernel version that
>> supports it on powernv and dump the whole kernel.
> 
> Your firmware won't support it AFAIK.
> 
> You could try kdump, but if we have CPUs stuck in KVM then there's a
> good chance it won't work :/

I haven't had any luck yet reproducing this still. Testing with sub 
cores of various different combinations, etc. I'll keep trying though.

I don't know if there's much we can add to debug it. Can we run pdbg
on the BMCs on these things?

Thanks,
Nick


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Ruediger Oertel
Am 31.08.20 um 14:58 schrieb Michael Ellerman:
[...]
>> The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off".
>> The number of VMs varies across the machines:
>> obs-power8-01: 18 instances, "-smp 16,threads=8"
>> obs-power8-02: 20 instances, "-smp 8,threads=8"
>> obs-power8-03: 30 instances, "-smp 8,threads=8"
>> obs-power8-04: 20 instances, "-smp 8,threads=8"
> 
> Can you send us the output of:
> 
> # grep . /sys/module/kvm_hv/parameters/*

of course, the current values are:

/sys/module/kvm_hv/parameters/dynamic_mt_modes:6
/sys/module/kvm_hv/parameters/h_ipi_redirect:1
/sys/module/kvm_hv/parameters/indep_threads_mode:Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass:1
/sys/module/kvm_hv/parameters/nested:Y
/sys/module/kvm_hv/parameters/one_vm_per_core:N
/sys/module/kvm_hv/parameters/target_smt_mode:0

(actually identical on all 5 above)

-- 
with kind regards (mit freundlichem Grinsen),
  Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de)
Do-Not-Accept-Binary-Blobs.Ever.From-Anyone.
Key fingerprint   =   17DC 6553 86A7 384B 53C5  CA5C 3CE4 F2E7 23F2 B417
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
  Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer



signature.asc
Description: OpenPGP digital signature


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michael Ellerman
Ruediger Oertel  writes:
> Am 31.08.20 um 03:14 schrieb Nicholas Piggin:
>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>>> Hello,
>>>
>>> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>>> Reimplement book3s idle code in C").
>>>
>>> The symptom is host locking up completely after some hours of KVM
>>> workload with messages like
>>>
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>>
>>> printed before the host locks up.
>>>
>>> The machines run sandboxed builds which is a mixed workload resulting in
>>> IO/single core/mutiple core load over time and there are periods of no
>>> activity and no VMS runnig as well. The VMs are shortlived so VM
>>> setup/terdown is somewhat excercised as well.
>>>
>>> POWER9 with the new guest entry fast path does not seem to be affected.
>>>
>>> Reverted the patch and the followup idle fixes on top of 5.2.14 and
>>> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>>> after idle") which gives same idle code as 5.1.16 and the kernel seems
>>> stable.
>>>
>>> Config is attached.
>>>
>>> I cannot easily revert this commit, especially if I want to use the same
>>> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>>> only to the new idle code.
>>>
>>> Any idea what can be the problem?
>> 
>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>> BMC unfortunately. Do the messages always come in pairs of CPUs?
>> 
>> I'm not sure where to start with reproducing, I'll have to try. How many
>> vCPUs in the guests? Do you have several guests running at once?
>
> Hello all,
>
> some details on the setup...
> these machines are buildservice workers, (build.opensuse.org) and all they
> do is spawn new VMs, run a package building job inside (rpmbuild, 
> debbuild,...)
>
> The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off".
> The number of VMs varies across the machines:
> obs-power8-01: 18 instances, "-smp 16,threads=8"
> obs-power8-02: 20 instances, "-smp 8,threads=8"
> obs-power8-03: 30 instances, "-smp 8,threads=8"
> obs-power8-04: 20 instances, "-smp 8,threads=8"

Can you send us the output of:

# grep . /sys/module/kvm_hv/parameters/*

cheers



Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michael Ellerman
Michal Suchánek  writes:
> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> > Hello,
>> > 
>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>> > Reimplement book3s idle code in C").
>> > 
>> > The symptom is host locking up completely after some hours of KVM
>> > workload with messages like
>> > 
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> > 
>> > printed before the host locks up.
>> > 
>> > The machines run sandboxed builds which is a mixed workload resulting in
>> > IO/single core/mutiple core load over time and there are periods of no
>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>> > setup/terdown is somewhat excercised as well.
>> > 
>> > POWER9 with the new guest entry fast path does not seem to be affected.
>> > 
>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>> > stable.
>> > 
>> > Config is attached.
>> > 
>> > I cannot easily revert this commit, especially if I want to use the same
>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>> > only to the new idle code.
>> > 
>> > Any idea what can be the problem?
>> 
>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>> BMC unfortunately.
>
> It may be possible to set up fadump with a later kernel version that
> supports it on powernv and dump the whole kernel.

Your firmware won't support it AFAIK.

You could try kdump, but if we have CPUs stuck in KVM then there's a
good chance it won't work :/

cheers


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Ruediger Oertel
Am 31.08.20 um 03:14 schrieb Nicholas Piggin:

> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately.


> Do the messages always come in pairs of CPUs?

well,
- one problem is that at some point the machine just locks up completely,
  so I can not tell if there were lines not printed any more and in some
  cases all I get is a single line
- looking at the stats in generally it's either one cpu printed several
  times or a pair ("not strictly") alternatingly

2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.029821] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.058630] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.108268] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.210206] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.323465] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.334420] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.345470] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.395185] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.517182] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:17+00:00 obs-power8-03 kernel: [51284.600716] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:18+00:00 obs-power8-03 kernel: [51286.201589] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:19+00:00 obs-power8-03 kernel: [51286.627273] KVM: couldn't 
grab cpu 92

2020-07-30T16:44:16+00:00 obs-power8-04 kernel: [30099.726288] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:16+00:00 obs-power8-04 kernel: [30099.736843] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.747429] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.877138] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.916422] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.931755] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.029003] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.334895] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.392713] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.569011] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.617048] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.628107] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:18+00:00 obs-power8-04 kernel: [30100.809046] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:18+00:00 obs-power8-04 kernel: [30101.001097] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:19+00:00 obs-power8-04 kernel: [30102.109007] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:19+00:00 obs-power8-04 kernel: [30102.254470] KVM: couldn't 
grab cpu 61



> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?
> 
> Thanks,
> Nick
> 


-- 
with kind regards (mit freundlichem Grinsen),
  Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de)
Do-Not-Accept-Binary-Blobs.Ever.From-Anyone.
Key fingerprint   =   17DC 6553 86A7 384B 53C5  CA5C 3CE4 F2E7 23F2 B417
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
  Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer



signature.asc
Description: OpenPGP digital signature


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Ruediger Oertel
Am 31.08.20 um 03:14 schrieb Nicholas Piggin:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> Hello,
>>
>> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>> Reimplement book3s idle code in C").
>>
>> The symptom is host locking up completely after some hours of KVM
>> workload with messages like
>>
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>
>> printed before the host locks up.
>>
>> The machines run sandboxed builds which is a mixed workload resulting in
>> IO/single core/mutiple core load over time and there are periods of no
>> activity and no VMS runnig as well. The VMs are shortlived so VM
>> setup/terdown is somewhat excercised as well.
>>
>> POWER9 with the new guest entry fast path does not seem to be affected.
>>
>> Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> after idle") which gives same idle code as 5.1.16 and the kernel seems
>> stable.
>>
>> Config is attached.
>>
>> I cannot easily revert this commit, especially if I want to use the same
>> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>> only to the new idle code.
>>
>> Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately. Do the messages always come in pairs of CPUs?
> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?

Hello all,

some details on the setup...
these machines are buildservice workers, (build.opensuse.org) and all they
do is spawn new VMs, run a package building job inside (rpmbuild, debbuild,...)

The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off".
The number of VMs varies across the machines:
obs-power8-01: 18 instances, "-smp 16,threads=8"
obs-power8-02: 20 instances, "-smp 8,threads=8"
obs-power8-03: 30 instances, "-smp 8,threads=8"
obs-power8-04: 20 instances, "-smp 8,threads=8"
obs-power8-05: 36 instances, "-smp 4,threads=2" (this one with "ppc64_cpu 
--subcores-per-core=4"

but anyway the stalls can be seen on all of them, sometimes after 4 hours
sometimes just after about a day. The 01 with more cpu overcommit seems
a little faster reproducing the problem, but that's more gut feeling than
anything backed by real numbers.


-- 
with kind regards (mit freundlichem Grinsen),
  Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de)
Do-Not-Accept-Binary-Blobs.Ever.From-Anyone.
Key fingerprint   =   17DC 6553 86A7 384B 53C5  CA5C 3CE4 F2E7 23F2 B417
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
  Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer



signature.asc
Description: OpenPGP digital signature


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michal Suchánek
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> > Hello,
> > 
> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> > Reimplement book3s idle code in C").
> > 
> > The symptom is host locking up completely after some hours of KVM
> > workload with messages like
> > 
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 
> > printed before the host locks up.
> > 
> > The machines run sandboxed builds which is a mixed workload resulting in
> > IO/single core/mutiple core load over time and there are periods of no
> > activity and no VMS runnig as well. The VMs are shortlived so VM
> > setup/terdown is somewhat excercised as well.
> > 
> > POWER9 with the new guest entry fast path does not seem to be affected.
> > 
> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> > stable.
> > 
> > Config is attached.
> > 
> > I cannot easily revert this commit, especially if I want to use the same
> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> > only to the new idle code.
> > 
> > Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately.
It may be possible to set up fadump with a later kernel version that
supports it on powernv and dump the whole kernel.

Thanks

Michal


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michal Suchánek
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> > Hello,
> > 
> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> > Reimplement book3s idle code in C").
> > 
> > The symptom is host locking up completely after some hours of KVM
> > workload with messages like
> > 
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 
> > printed before the host locks up.
> > 
> > The machines run sandboxed builds which is a mixed workload resulting in
> > IO/single core/mutiple core load over time and there are periods of no
> > activity and no VMS runnig as well. The VMs are shortlived so VM
> > setup/terdown is somewhat excercised as well.
> > 
> > POWER9 with the new guest entry fast path does not seem to be affected.
> > 
> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> > stable.
> > 
> > Config is attached.
> > 
> > I cannot easily revert this commit, especially if I want to use the same
> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> > only to the new idle code.
> > 
> > Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately. Do the messages always come in pairs of CPUs?
> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?

The guests are spawned on demand - there are like 20-30 'slots'
configured where a VM may be running or it may be idle with no VM
spawned when there are no jobs available.

Thanks

Michal


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-30 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> Hello,
> 
> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> Reimplement book3s idle code in C").
> 
> The symptom is host locking up completely after some hours of KVM
> workload with messages like
> 
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 
> printed before the host locks up.
> 
> The machines run sandboxed builds which is a mixed workload resulting in
> IO/single core/mutiple core load over time and there are periods of no
> activity and no VMS runnig as well. The VMs are shortlived so VM
> setup/terdown is somewhat excercised as well.
> 
> POWER9 with the new guest entry fast path does not seem to be affected.
> 
> Reverted the patch and the followup idle fixes on top of 5.2.14 and
> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> after idle") which gives same idle code as 5.1.16 and the kernel seems
> stable.
> 
> Config is attached.
> 
> I cannot easily revert this commit, especially if I want to use the same
> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> only to the new idle code.
> 
> Any idea what can be the problem?

So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
those threads. I wonder what they are doing. POWER8 doesn't have a good
NMI IPI and I don't know if it supports pdbg dumping registers from the
BMC unfortunately. Do the messages always come in pairs of CPUs?

I'm not sure where to start with reproducing, I'll have to try. How many
vCPUs in the guests? Do you have several guests running at once?

Thanks,
Nick