Re: fixing kvm lapic hangs
On Thu, Feb 08, 2018 at 04:32:29PM +1300, Jonathan Matthew wrote: > This diff (most of which has been around for a while) changes delay_func > when running in KVM to use pvclock (effectively the TSC) to determine when to > stop spinning. Since this is done in a KVM-specific driver, it won't have > any effect anywhere other than in KVM guests. > > Using pvclock rather than the lapic avoids the hangs caused by KVM's use of > the VMX preemption timer to emulate a lapic. To summarise that problem: > occasionally KVM fails to restart the lapic timer until it gets an exit from > the guest, and if we're busy polling the lapic, we'll never generate such an > exit, and the lapic counter will never reach the value we're waiting for, at > which point we're stuck. > > It also adds a timecounter based on pvclock, with its priority set below > acpihpet, so it won't be used for now. Later on I'd like to make this > the preferred timecounter as it's much faster to read than the emulated > acpihpet. > > A couple of testers have confirmed that this does *not* fix the time slowdowns > seen on KVM guests. I believe fixing that will require more invasive changes. > > Since this fixes a concrete problem, rather than just making me feel better > about doing lots of clock reads in some other work I'm doing, I'd like to get > this in now. > > ok? This diff needs a bit more work. It introduces a race in MP kernels that often causes the machine to crash shortly after pvbus attaches. While we're attaching mainbus etc., secondary cpus are waiting in a delay loop for the primary cpu to tell them it's ok to run. pvclock requires some per-cpu intialization that only happens after this delay loop finishes, so it's not safe for kvmclock to change the delay_func pointer until all cpus have finished hatching. Another exciting discovery that came out of this is that you can't enter ddb on a cpu that isn't meant to be running yet. > > > Index: arch/amd64/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v > retrieving revision 1.450 > diff -u -p -u -p -r1.450 GENERIC > --- arch/amd64/conf/GENERIC 24 Dec 2017 19:50:56 - 1.450 > +++ arch/amd64/conf/GENERIC 8 Feb 2018 02:32:49 - > @@ -81,6 +81,8 @@ hyperv0 at pvbus? # Hyper-V guest > hvn* at hyperv? # Hyper-V NetVSC > hvs* at hyperv? # Hyper-V StorVSC > > +kvmclock0 at pvbus? # KVM clock > + > option PCIVERBOSE > option USBVERBOSE > > Index: arch/amd64/conf/RAMDISK_CD > === > RCS file: /cvs/src/sys/arch/amd64/conf/RAMDISK_CD,v > retrieving revision 1.169 > diff -u -p -u -p -r1.169 RAMDISK_CD > --- arch/amd64/conf/RAMDISK_CD16 Nov 2017 18:12:27 - 1.169 > +++ arch/amd64/conf/RAMDISK_CD8 Feb 2018 02:32:49 - > @@ -64,6 +64,8 @@ hyperv0 at pvbus? # Hyper-V guest > hvn* at hyperv? # Hyper-V NetVSC > hvs* at hyperv? # Hyper-V StorVSC > > +kvmclock0at pvbus? # KVM clock > + > pchb*at pci? # PCI-Host bridges > aapic* at pci? # AMD 8131 IO apic > ppb* at pci? # PCI-PCI bridges > Index: arch/i386/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v > retrieving revision 1.829 > diff -u -p -u -p -r1.829 GENERIC > --- arch/i386/conf/GENERIC28 Aug 2017 19:32:53 - 1.829 > +++ arch/i386/conf/GENERIC8 Feb 2018 02:32:49 - > @@ -40,6 +40,7 @@ amdmsr0 at mainbus? # MSR access for AM > > pvbus0 at mainbus0 # Paravirtual device bus > vmt0 at pvbus? # VMware Tools > +kvmclock0 at pvbus? # KVM clock > > acpitimer* at acpi? > acpihpet*at acpi? > Index: dev/pv/files.pv > === > RCS file: /cvs/src/sys/dev/pv/files.pv,v > retrieving revision 1.13 > diff -u -p -u -p -r1.13 files.pv > --- dev/pv/files.pv 14 Jun 2017 10:25:40 - 1.13 > +++ dev/pv/files.pv 8 Feb 2018 02:32:49 - > @@ -75,3 +75,8 @@ filedev/pv/vioscsi.cvioscsi > device vmmci > attach vmmci at virtio > file dev/pv/vmmci.c vmmci > + > +device kvmclock > +attach kvmclock at pvbus > +file dev/pv/kvmclock.c kvmclock > +file dev/pv/pvclock.ckvmclock > Index: dev/pv/kvmclock.c > === > RCS file: dev/pv/kvmclock.c > diff -N dev/pv/kvmclock.c > --- /dev/null 1 Jan 1970 00:00:00 - > +++ dev/pv/kvmclock.c 8 Feb 2018 02:32:49 - > @@ -0,0 +1,151 @@ > +/* $OpenBSD$ */ > + > +/* > + *
Re: fixing kvm lapic hangs
On Thu, Feb 08, 2018 at 04:32:29PM +1300, Jonathan Matthew wrote: > This diff (most of which has been around for a while) changes delay_func > when running in KVM to use pvclock (effectively the TSC) to determine when to > stop spinning. Since this is done in a KVM-specific driver, it won't have > any effect anywhere other than in KVM guests. > > Using pvclock rather than the lapic avoids the hangs caused by KVM's use of > the VMX preemption timer to emulate a lapic. To summarise that problem: > occasionally KVM fails to restart the lapic timer until it gets an exit from > the guest, and if we're busy polling the lapic, we'll never generate such an > exit, and the lapic counter will never reach the value we're waiting for, at > which point we're stuck. >From my understanding of your explanation, that fixes the hangs i was seeing with KVM vms, hangs which were worked around by disabling the host 'preemption timer' via putting kvm-intel.preemption_timer=0 on the host kernel commandline.. so if i get it right, once this diff is in, no need for this workaround on the host ? Correct ? > A couple of testers have confirmed that this does *not* fix the time slowdowns > seen on KVM guests. I believe fixing that will require more invasive changes. Sign me up for testing those changes then ! Thanks for your work on it :) Landry
Re: fixing kvm lapic hangs
On Wed, Feb 07, 2018 at 09:35:04PM -0800, Mike Larkin wrote: > On Thu, Feb 08, 2018 at 04:32:29PM +1300, Jonathan Matthew wrote: > > This diff (most of which has been around for a while) changes delay_func > > when running in KVM to use pvclock (effectively the TSC) to determine when > > to > > stop spinning. Since this is done in a KVM-specific driver, it won't have > > any effect anywhere other than in KVM guests. > > > > Using pvclock rather than the lapic avoids the hangs caused by KVM's use of > > the VMX preemption timer to emulate a lapic. To summarise that problem: > > occasionally KVM fails to restart the lapic timer until it gets an exit from > > the guest, and if we're busy polling the lapic, we'll never generate such an > > exit, and the lapic counter will never reach the value we're waiting for, at > > which point we're stuck. > > > > It also adds a timecounter based on pvclock, with its priority set below > > acpihpet, so it won't be used for now. Later on I'd like to make this > > the preferred timecounter as it's much faster to read than the emulated > > acpihpet. > > > > A couple of testers have confirmed that this does *not* fix the time > > slowdowns > > seen on KVM guests. I believe fixing that will require more invasive > > changes. > > > > Since this fixes a concrete problem, rather than just making me feel better > > about doing lots of clock reads in some other work I'm doing, I'd like to > > get > > this in now. > > > > ok? > > > > After discussing this with jmatthew@ offline, I understand what he's trying to > do here, and I withdraw the earlier objection. > > Some comments: > > 1. does this need special treatment in RAMDISK? I'll double check that it works on ramdisks before committing, but it should work the same way there as GENERIC. > > 2. can you add a man page for kvmclock(4) and describe what it's for (and > summarize why it's desirable on kvm)? sure. > > 3. do you want to include this in i386? is it even an issue there? i386 has the same lapic code, so it'll hang the same way as amd64, and it can benefit from better timecounters too. I've tested this on i386 and it works there. > > other than those things, ok mlarkin thanks for taking a look.
Re: fixing kvm lapic hangs
On Thu, Feb 08, 2018 at 04:32:29PM +1300, Jonathan Matthew wrote: > This diff (most of which has been around for a while) changes delay_func > when running in KVM to use pvclock (effectively the TSC) to determine when to > stop spinning. Since this is done in a KVM-specific driver, it won't have > any effect anywhere other than in KVM guests. > > Using pvclock rather than the lapic avoids the hangs caused by KVM's use of > the VMX preemption timer to emulate a lapic. To summarise that problem: > occasionally KVM fails to restart the lapic timer until it gets an exit from > the guest, and if we're busy polling the lapic, we'll never generate such an > exit, and the lapic counter will never reach the value we're waiting for, at > which point we're stuck. > > It also adds a timecounter based on pvclock, with its priority set below > acpihpet, so it won't be used for now. Later on I'd like to make this > the preferred timecounter as it's much faster to read than the emulated > acpihpet. > > A couple of testers have confirmed that this does *not* fix the time slowdowns > seen on KVM guests. I believe fixing that will require more invasive changes. > > Since this fixes a concrete problem, rather than just making me feel better > about doing lots of clock reads in some other work I'm doing, I'd like to get > this in now. > > ok? > After discussing this with jmatthew@ offline, I understand what he's trying to do here, and I withdraw the earlier objection. Some comments: 1. does this need special treatment in RAMDISK? 2. can you add a man page for kvmclock(4) and describe what it's for (and summarize why it's desirable on kvm)? 3. do you want to include this in i386? is it even an issue there? other than those things, ok mlarkin > > Index: arch/amd64/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v > retrieving revision 1.450 > diff -u -p -u -p -r1.450 GENERIC > --- arch/amd64/conf/GENERIC 24 Dec 2017 19:50:56 - 1.450 > +++ arch/amd64/conf/GENERIC 8 Feb 2018 02:32:49 - > @@ -81,6 +81,8 @@ hyperv0 at pvbus? # Hyper-V guest > hvn* at hyperv? # Hyper-V NetVSC > hvs* at hyperv? # Hyper-V StorVSC > > +kvmclock0 at pvbus? # KVM clock > + > option PCIVERBOSE > option USBVERBOSE > > Index: arch/amd64/conf/RAMDISK_CD > === > RCS file: /cvs/src/sys/arch/amd64/conf/RAMDISK_CD,v > retrieving revision 1.169 > diff -u -p -u -p -r1.169 RAMDISK_CD > --- arch/amd64/conf/RAMDISK_CD16 Nov 2017 18:12:27 - 1.169 > +++ arch/amd64/conf/RAMDISK_CD8 Feb 2018 02:32:49 - > @@ -64,6 +64,8 @@ hyperv0 at pvbus? # Hyper-V guest > hvn* at hyperv? # Hyper-V NetVSC > hvs* at hyperv? # Hyper-V StorVSC > > +kvmclock0at pvbus? # KVM clock > + > pchb*at pci? # PCI-Host bridges > aapic* at pci? # AMD 8131 IO apic > ppb* at pci? # PCI-PCI bridges > Index: arch/i386/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v > retrieving revision 1.829 > diff -u -p -u -p -r1.829 GENERIC > --- arch/i386/conf/GENERIC28 Aug 2017 19:32:53 - 1.829 > +++ arch/i386/conf/GENERIC8 Feb 2018 02:32:49 - > @@ -40,6 +40,7 @@ amdmsr0 at mainbus? # MSR access for AM > > pvbus0 at mainbus0 # Paravirtual device bus > vmt0 at pvbus? # VMware Tools > +kvmclock0 at pvbus? # KVM clock > > acpitimer* at acpi? > acpihpet*at acpi? > Index: dev/pv/files.pv > === > RCS file: /cvs/src/sys/dev/pv/files.pv,v > retrieving revision 1.13 > diff -u -p -u -p -r1.13 files.pv > --- dev/pv/files.pv 14 Jun 2017 10:25:40 - 1.13 > +++ dev/pv/files.pv 8 Feb 2018 02:32:49 - > @@ -75,3 +75,8 @@ filedev/pv/vioscsi.cvioscsi > device vmmci > attach vmmci at virtio > file dev/pv/vmmci.c vmmci > + > +device kvmclock > +attach kvmclock at pvbus > +file dev/pv/kvmclock.c kvmclock > +file dev/pv/pvclock.ckvmclock > Index: dev/pv/kvmclock.c > === > RCS file: dev/pv/kvmclock.c > diff -N dev/pv/kvmclock.c > --- /dev/null 1 Jan 1970 00:00:00 - > +++ dev/pv/kvmclock.c 8 Feb 2018 02:32:49 - > @@ -0,0 +1,151 @@ > +/* $OpenBSD$ */ > + > +/* > + * Copyright (c) 2017 Jonathan Matthew > + * > + * Permission to use, copy, modify, and distribute this software for any > + * purpose with or without fee is hereby granted, provided
Re: fixing kvm lapic hangs
On Thu, Feb 08, 2018 at 04:32:29PM +1300, Jonathan Matthew wrote: > This diff (most of which has been around for a while) changes delay_func > when running in KVM to use pvclock (effectively the TSC) to determine when to > stop spinning. Since this is done in a KVM-specific driver, it won't have > any effect anywhere other than in KVM guests. > > Using pvclock rather than the lapic avoids the hangs caused by KVM's use of > the VMX preemption timer to emulate a lapic. To summarise that problem: > occasionally KVM fails to restart the lapic timer until it gets an exit from > the guest, and if we're busy polling the lapic, we'll never generate such an > exit, and the lapic counter will never reach the value we're waiting for, at ... should we instead just force an exit by doing an intentional VMCALL if we detect this? Seems like it might be a less arduous approach than creating a whole device. If we're going to do that, we might as well call it kvmbrokenclock(4) -ml > which point we're stuck. > > It also adds a timecounter based on pvclock, with its priority set below > acpihpet, so it won't be used for now. Later on I'd like to make this > the preferred timecounter as it's much faster to read than the emulated > acpihpet. > > A couple of testers have confirmed that this does *not* fix the time slowdowns > seen on KVM guests. I believe fixing that will require more invasive changes. > > Since this fixes a concrete problem, rather than just making me feel better > about doing lots of clock reads in some other work I'm doing, I'd like to get > this in now. > > ok? > > > Index: arch/amd64/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v > retrieving revision 1.450 > diff -u -p -u -p -r1.450 GENERIC > --- arch/amd64/conf/GENERIC 24 Dec 2017 19:50:56 - 1.450 > +++ arch/amd64/conf/GENERIC 8 Feb 2018 02:32:49 - > @@ -81,6 +81,8 @@ hyperv0 at pvbus? # Hyper-V guest > hvn* at hyperv? # Hyper-V NetVSC > hvs* at hyperv? # Hyper-V StorVSC > > +kvmclock0 at pvbus? # KVM clock > + > option PCIVERBOSE > option USBVERBOSE > > Index: arch/amd64/conf/RAMDISK_CD > === > RCS file: /cvs/src/sys/arch/amd64/conf/RAMDISK_CD,v > retrieving revision 1.169 > diff -u -p -u -p -r1.169 RAMDISK_CD > --- arch/amd64/conf/RAMDISK_CD16 Nov 2017 18:12:27 - 1.169 > +++ arch/amd64/conf/RAMDISK_CD8 Feb 2018 02:32:49 - > @@ -64,6 +64,8 @@ hyperv0 at pvbus? # Hyper-V guest > hvn* at hyperv? # Hyper-V NetVSC > hvs* at hyperv? # Hyper-V StorVSC > > +kvmclock0at pvbus? # KVM clock > + > pchb*at pci? # PCI-Host bridges > aapic* at pci? # AMD 8131 IO apic > ppb* at pci? # PCI-PCI bridges > Index: arch/i386/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v > retrieving revision 1.829 > diff -u -p -u -p -r1.829 GENERIC > --- arch/i386/conf/GENERIC28 Aug 2017 19:32:53 - 1.829 > +++ arch/i386/conf/GENERIC8 Feb 2018 02:32:49 - > @@ -40,6 +40,7 @@ amdmsr0 at mainbus? # MSR access for AM > > pvbus0 at mainbus0 # Paravirtual device bus > vmt0 at pvbus? # VMware Tools > +kvmclock0 at pvbus? # KVM clock > > acpitimer* at acpi? > acpihpet*at acpi? > Index: dev/pv/files.pv > === > RCS file: /cvs/src/sys/dev/pv/files.pv,v > retrieving revision 1.13 > diff -u -p -u -p -r1.13 files.pv > --- dev/pv/files.pv 14 Jun 2017 10:25:40 - 1.13 > +++ dev/pv/files.pv 8 Feb 2018 02:32:49 - > @@ -75,3 +75,8 @@ filedev/pv/vioscsi.cvioscsi > device vmmci > attach vmmci at virtio > file dev/pv/vmmci.c vmmci > + > +device kvmclock > +attach kvmclock at pvbus > +file dev/pv/kvmclock.c kvmclock > +file dev/pv/pvclock.ckvmclock > Index: dev/pv/kvmclock.c > === > RCS file: dev/pv/kvmclock.c > diff -N dev/pv/kvmclock.c > --- /dev/null 1 Jan 1970 00:00:00 - > +++ dev/pv/kvmclock.c 8 Feb 2018 02:32:49 - > @@ -0,0 +1,151 @@ > +/* $OpenBSD$ */ > + > +/* > + * Copyright (c) 2017 Jonathan Matthew > + * > + * Permission to use, copy, modify, and distribute this software for any > + * purpose with or without fee is hereby granted, provided that the above > + * copyright notice and this permission notice appear in all copies. > + * > + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
fixing kvm lapic hangs
This diff (most of which has been around for a while) changes delay_func when running in KVM to use pvclock (effectively the TSC) to determine when to stop spinning. Since this is done in a KVM-specific driver, it won't have any effect anywhere other than in KVM guests. Using pvclock rather than the lapic avoids the hangs caused by KVM's use of the VMX preemption timer to emulate a lapic. To summarise that problem: occasionally KVM fails to restart the lapic timer until it gets an exit from the guest, and if we're busy polling the lapic, we'll never generate such an exit, and the lapic counter will never reach the value we're waiting for, at which point we're stuck. It also adds a timecounter based on pvclock, with its priority set below acpihpet, so it won't be used for now. Later on I'd like to make this the preferred timecounter as it's much faster to read than the emulated acpihpet. A couple of testers have confirmed that this does *not* fix the time slowdowns seen on KVM guests. I believe fixing that will require more invasive changes. Since this fixes a concrete problem, rather than just making me feel better about doing lots of clock reads in some other work I'm doing, I'd like to get this in now. ok? Index: arch/amd64/conf/GENERIC === RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v retrieving revision 1.450 diff -u -p -u -p -r1.450 GENERIC --- arch/amd64/conf/GENERIC 24 Dec 2017 19:50:56 - 1.450 +++ arch/amd64/conf/GENERIC 8 Feb 2018 02:32:49 - @@ -81,6 +81,8 @@ hyperv0 at pvbus? # Hyper-V guest hvn* at hyperv? # Hyper-V NetVSC hvs* at hyperv? # Hyper-V StorVSC +kvmclock0 at pvbus?# KVM clock + option PCIVERBOSE option USBVERBOSE Index: arch/amd64/conf/RAMDISK_CD === RCS file: /cvs/src/sys/arch/amd64/conf/RAMDISK_CD,v retrieving revision 1.169 diff -u -p -u -p -r1.169 RAMDISK_CD --- arch/amd64/conf/RAMDISK_CD 16 Nov 2017 18:12:27 - 1.169 +++ arch/amd64/conf/RAMDISK_CD 8 Feb 2018 02:32:49 - @@ -64,6 +64,8 @@ hyperv0 at pvbus? # Hyper-V guest hvn* at hyperv? # Hyper-V NetVSC hvs* at hyperv? # Hyper-V StorVSC +kvmclock0 at pvbus? # KVM clock + pchb* at pci? # PCI-Host bridges aapic* at pci? # AMD 8131 IO apic ppb* at pci? # PCI-PCI bridges Index: arch/i386/conf/GENERIC === RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v retrieving revision 1.829 diff -u -p -u -p -r1.829 GENERIC --- arch/i386/conf/GENERIC 28 Aug 2017 19:32:53 - 1.829 +++ arch/i386/conf/GENERIC 8 Feb 2018 02:32:49 - @@ -40,6 +40,7 @@ amdmsr0 at mainbus? # MSR access for AM pvbus0 at mainbus0 # Paravirtual device bus vmt0 at pvbus? # VMware Tools +kvmclock0 at pvbus?# KVM clock acpitimer* at acpi? acpihpet* at acpi? Index: dev/pv/files.pv === RCS file: /cvs/src/sys/dev/pv/files.pv,v retrieving revision 1.13 diff -u -p -u -p -r1.13 files.pv --- dev/pv/files.pv 14 Jun 2017 10:25:40 - 1.13 +++ dev/pv/files.pv 8 Feb 2018 02:32:49 - @@ -75,3 +75,8 @@ file dev/pv/vioscsi.cvioscsi device vmmci attach vmmci at virtio file dev/pv/vmmci.c vmmci + +device kvmclock +attach kvmclock at pvbus +file dev/pv/kvmclock.c kvmclock +file dev/pv/pvclock.ckvmclock Index: dev/pv/kvmclock.c === RCS file: dev/pv/kvmclock.c diff -N dev/pv/kvmclock.c --- /dev/null 1 Jan 1970 00:00:00 - +++ dev/pv/kvmclock.c 8 Feb 2018 02:32:49 - @@ -0,0 +1,151 @@ +/* $OpenBSD$ */ + +/* + * Copyright (c) 2017 Jonathan Matthew + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include + +#include +