Re: [PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Sun, Dec 20, 2015 at 03:05:41AM -0800, Andy Lutomirski wrote: > From: Andy Lutomirski> > The pvclock vdso code was too abstracted to understand easily and > excessively paranoid. Simplify it for a huge speedup. > > This opens the door for additional simplifications, as the vdso no > longer accesses the pvti for any vcpu other than vcpu 0. > > Before, vclock_gettime using kvm-clock took about 45ns on my machine. > With this change, it takes 29ns, which is almost as fast as the pure TSC > implementation. > > Reviewed-by: Paolo Bonzini > Signed-off-by: Andy Lutomirski > --- > arch/x86/entry/vdso/vclock_gettime.c | 81 > > 1 file changed, 46 insertions(+), 35 deletions(-) > > diff --git a/arch/x86/entry/vdso/vclock_gettime.c > b/arch/x86/entry/vdso/vclock_gettime.c > index ca94fa649251..c325ba1bdddf 100644 > --- a/arch/x86/entry/vdso/vclock_gettime.c > +++ b/arch/x86/entry/vdso/vclock_gettime.c > @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info > *get_pvti(int cpu) > > static notrace cycle_t vread_pvclock(int *mode) > { > - const struct pvclock_vsyscall_time_info *pvti; > + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; > cycle_t ret; > - u64 last; > - u32 version; > - u8 flags; > - unsigned cpu, cpu1; > - > + u64 tsc, pvti_tsc; > + u64 last, delta, pvti_system_time; > + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; > > /* > - * Note: hypervisor must guarantee that: > - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info. > - * 2. that per-CPU pvclock time info is updated if the > - *underlying CPU changes. > - * 3. that version is increased whenever underlying CPU > - *changes. > + * Note: The kernel and hypervisor must guarantee that cpu ID > + * number maps 1:1 to per-CPU pvclock time info. > + * > + * Because the hypervisor is entirely unaware of guest userspace > + * preemption, it cannot guarantee that per-CPU pvclock time > + * info is updated if the underlying CPU changes or that that > + * version is increased whenever underlying CPU changes. >* > + * On KVM, we are guaranteed that pvti updates for any vCPU are > + * atomic as seen by *all* vCPUs. This is an even stronger > + * guarantee than we get with a normal seqlock. > + * > + * On Xen, we don't appear to have that guarantee, but Xen still > + * supplies a valid seqlock using the version field. > + > + * We only do pvclock vdso timing at all if > + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to > + * mean that all vCPUs have matching pvti and that the TSC is > + * synced, so we can just look at vCPU 0's pvti. >*/ > - do { > - cpu = __getcpu() & VGETCPU_CPU_MASK; > - /* TODO: We can put vcpu id into higher bits of pvti.version. > - * This will save a couple of cycles by getting rid of > - * __getcpu() calls (Gleb). > - */ > - > - pvti = get_pvti(cpu); > - > - version = __pvclock_read_cycles(>pvti, , ); > - > - /* > - * Test we're still on the cpu as well as the version. > - * We could have been migrated just after the first > - * vgetcpu but before fetching the version, so we > - * wouldn't notice a version change. > - */ > - cpu1 = __getcpu() & VGETCPU_CPU_MASK; > - } while (unlikely(cpu != cpu1 || > - (pvti->pvti.version & 1) || > - pvti->pvti.version != version)); > - > - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) > + > + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { > *mode = VCLOCK_NONE; > + return 0; > + } > + > + do { > + version = pvti->version; > + > + /* This is also a read barrier, so we'll read version first. */ > + tsc = rdtsc_ordered(); > + > + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; > + pvti_tsc_shift = pvti->tsc_shift; > + pvti_system_time = pvti->system_time; > + pvti_tsc = pvti->tsc_timestamp; > + > + /* Make sure that the version double-check is last. */ > + smp_rmb(); > + } while (unlikely((version & 1) || version != pvti->version)); Andy, What happens if PVCLOCK_TSC_STABLE_BIT is disabled here? > + > + delta = tsc - pvti_tsc; > + ret = pvti_system_time + > + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, > + pvti_tsc_shift); > > /* refer to tsc.c read_tsc() comment for rationale */ > last = gtod->cycle_last; > -- > 2.5.0 > > -- > To unsubscribe from this list: send the line
Re: [PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Mon, Jan 04, 2016 at 02:33:12PM -0800, Andy Lutomirski wrote: > On Mon, Jan 4, 2016 at 12:26 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Sun, Dec 20, 2015 at 03:05:41AM -0800, Andy Lutomirski wrote: > >> From: Andy Lutomirski <l...@amacapital.net> > >> > >> The pvclock vdso code was too abstracted to understand easily and > >> excessively paranoid. Simplify it for a huge speedup. > >> > >> This opens the door for additional simplifications, as the vdso no > >> longer accesses the pvti for any vcpu other than vcpu 0. > >> > >> Before, vclock_gettime using kvm-clock took about 45ns on my machine. > >> With this change, it takes 29ns, which is almost as fast as the pure TSC > >> implementation. > >> > >> Reviewed-by: Paolo Bonzini <pbonz...@redhat.com> > >> Signed-off-by: Andy Lutomirski <l...@amacapital.net> > >> --- > >> arch/x86/entry/vdso/vclock_gettime.c | 81 > >> > >> 1 file changed, 46 insertions(+), 35 deletions(-) > >> > >> diff --git a/arch/x86/entry/vdso/vclock_gettime.c > >> b/arch/x86/entry/vdso/vclock_gettime.c > >> index ca94fa649251..c325ba1bdddf 100644 > >> --- a/arch/x86/entry/vdso/vclock_gettime.c > >> +++ b/arch/x86/entry/vdso/vclock_gettime.c > >> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info > >> *get_pvti(int cpu) > >> > >> static notrace cycle_t vread_pvclock(int *mode) > >> { > >> - const struct pvclock_vsyscall_time_info *pvti; > >> + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; > >> cycle_t ret; > >> - u64 last; > >> - u32 version; > >> - u8 flags; > >> - unsigned cpu, cpu1; > >> - > >> + u64 tsc, pvti_tsc; > >> + u64 last, delta, pvti_system_time; > >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; > >> > >> /* > >> - * Note: hypervisor must guarantee that: > >> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info. > >> - * 2. that per-CPU pvclock time info is updated if the > >> - *underlying CPU changes. > >> - * 3. that version is increased whenever underlying CPU > >> - *changes. > >> + * Note: The kernel and hypervisor must guarantee that cpu ID > >> + * number maps 1:1 to per-CPU pvclock time info. > >> + * > >> + * Because the hypervisor is entirely unaware of guest userspace > >> + * preemption, it cannot guarantee that per-CPU pvclock time > >> + * info is updated if the underlying CPU changes or that that > >> + * version is increased whenever underlying CPU changes. > >>* > >> + * On KVM, we are guaranteed that pvti updates for any vCPU are > >> + * atomic as seen by *all* vCPUs. This is an even stronger > >> + * guarantee than we get with a normal seqlock. > >> + * > >> + * On Xen, we don't appear to have that guarantee, but Xen still > >> + * supplies a valid seqlock using the version field. > >> + > >> + * We only do pvclock vdso timing at all if > >> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to > >> + * mean that all vCPUs have matching pvti and that the TSC is > >> + * synced, so we can just look at vCPU 0's pvti. > >>*/ > >> - do { > >> - cpu = __getcpu() & VGETCPU_CPU_MASK; > >> - /* TODO: We can put vcpu id into higher bits of pvti.version. > >> - * This will save a couple of cycles by getting rid of > >> - * __getcpu() calls (Gleb). > >> - */ > >> - > >> - pvti = get_pvti(cpu); > >> - > >> - version = __pvclock_read_cycles(>pvti, , ); > >> - > >> - /* > >> - * Test we're still on the cpu as well as the version. > >> - * We could have been migrated just after the first > >> - * vgetcpu but before fetching the version, so we > >> - * wouldn't notice a version change. > >> - */ > >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK; > >> - } while (unlikely(cpu != cpu1 || > >> - (pvti->pvti.version & 1) || > >&g
Re: kvmclock doesn't work, help?
On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: > >> [cc: John Stultz -- maybe you have ideas on how this should best > >> integrate with the core code] > >> > >> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> > >> wrote: > > >> > Can you write an actual proposal (with details) that accomodates the > >> > issue described at "Assuming a stable TSC across physical CPUS, and a > >> > stable TSC" ? > >> > > >> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for > >> > realtime guests. > >> > >> This shouldn't require many details, and I don't think there's an ABI > >> change. The rules are: > >> > >> When the overall system timebase changes (e.g. when the selected > >> clocksource changes or when update_pvclock_gtod is called), the KVM > >> host would: > >> > >> optionally: preempt_disable(); /* for performance */ > >> > >> for all vms { > >> > >> for all registered pvti structures { > >> pvti->version++; /* should be odd now */ > >> } > > > > pvti is userspace data, so you have to pin it before? > > Yes. > > Fortunately, most systems probably only have one page of pvti > structures, I think (unless there are a ton of vcpus), so the > performance impact should be negligible. > > > > >> /* Note: right now, any vcpu that tries to access pvti will start > >> infinite looping. We should add cpu_relax() to the guests. */ > >> > >> for all registered pvti structures { > >> update everything except pvti->version; > >> } > >> > >> for all registered pvti structures { > >> pvti->version++; /* should be even now */ > >> } > >> > >> cond_resched(); > >> } > >> > >> Is this enough detail? This should work with all existing guests, > >> too, unless there's a buggy guest out there that actually fails to > >> double-check version. > > > > What is the advantage of this over the brute force method, given > > that guests will busy spin? > > > > (busy spin is equally problematic as IPI for realtime guests). > > I disagree. It's never been safe to call clock_gettime from an RT > task and expect a guarantee of real-time performance. We could fix > that, but it's not even safe on non-KVM. The problem is how long the IPI (or busy spinning in case of version above) interrupts the vcpu. > Sending an IPI *always* stalls the task. Taking a lock (which is > effectively what this is doing) only stalls the tasks that contend for > the lock, which, most of the time, means that nothing stalls. > > Also, if the host disables preemption or otherwise boosts its priority > while version is odd, then the actual stall will be very short, in > contrast to an IPI-induced stall, which will be much, much longer. > > --Andy 1) The updates are rare. 2) There are no user complaints about the IPI mechanism. Don't see a reason to change this. For the suspend issue, though, there are complaints (guests on laptops which fail to use masterclock). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: > [cc: John Stultz -- maybe you have ideas on how this should best > integrate with the core code] > > On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: > >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> > >> wrote: > >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> > >> >> wrote: > >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti > >> >> >> <mtosa...@redhat.com> wrote: > >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski > >> >> >> >> <l...@amacapital.net> wrote: > >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini > >> >> >> >> > <pbonz...@redhat.com> wrote: > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >> >> >>> > RAW TSC NTP corrected TSC > >> >> >> >> >>> > t0 10 10 > >> >> >> >> >>> > t1 20 19.99 > >> >> >> >> >>> > t2 30 29.98 > >> >> >> >> >>> > t3 40 39.97 > >> >> >> >> >>> > t4 50 49.96 > >> >> > > >> >> > (1) > >> >> > > >> >> >> >> >>> > > >> >> >> >> >>> > ... > >> >> >> >> >>> > > >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >> >> >>> > you can see what will happen. > >> >> >> >> >>> > >> >> >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> >> >> > >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. > >> >> >> >> >> After > >> >> >> >> >> resume, the TSC certainly increases at the same rate as > >> >> >> >> >> before, but the > >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased > >> >> >> >> >> slower > >> >> >> >> >> than the guest kvmclock. > >> >> >> >> > > >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> >> >> > > >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after > >> >> >> >> > resume to > >> >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> >> >> > backwards. > >> >> >> >> > > >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied > >> >> >> >> > on top > >> >> >> >> > of kvmclock, and this shouldn't matter. > >> >> >> >> > > >> >> >> >> > I still feel like I'm missing something very basic here. > >> >> >> >> > > >> >> >> >> > >> >> >> >> OK, I think I get it. > >> >> >> >> > >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the > >> >> >> >> host's > >> >> >> >> correction to the guest. If it did, indeed, propagate the > >> >> >> >> correction > >> >> >> >> then, after resume, the host's new system_time would ma
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> > >> wrote: > >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> > >> >> wrote: > >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski > >> >> >> <l...@amacapital.net> wrote: > >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini > >> >> >> > <pbonz...@redhat.com> wrote: > >> >> >> >> > >> >> >> >> > >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >> >>> > RAW TSC NTP corrected TSC > >> >> >> >>> > t0 10 10 > >> >> >> >>> > t1 20 19.99 > >> >> >> >>> > t2 30 29.98 > >> >> >> >>> > t3 40 39.97 > >> >> >> >>> > t4 50 49.96 > >> > > >> > (1) > >> > > >> >> >> >>> > > >> >> >> >>> > ... > >> >> >> >>> > > >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >> >>> > you can see what will happen. > >> >> >> >>> > >> >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> >> > >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. > >> >> >> >> After > >> >> >> >> resume, the TSC certainly increases at the same rate as before, > >> >> >> >> but the > >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased > >> >> >> >> slower > >> >> >> >> than the guest kvmclock. > >> >> >> > > >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> >> > > >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume > >> >> >> > to > >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> >> > backwards. > >> >> >> > > >> >> >> > If it's the guest's, then the guest's NTP correction is applied on > >> >> >> > top > >> >> >> > of kvmclock, and this shouldn't matter. > >> >> >> > > >> >> >> > I still feel like I'm missing something very basic here. > >> >> >> > > >> >> >> > >> >> >> OK, I think I get it. > >> >> >> > >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> >> >> correction to the guest. If it did, indeed, propagate the correction > >> >> >> then, after resume, the host's new system_time would match the > >> >> >> guest's > >> >> >> idea of it (after accounting for the guest's long nap), and I don't > >> >> >> think there would be a problem. > >> >> >> That being said, I can't find the code in the masterclock stuff that > >> >> >> would actually do this. > >> >> > > >> >> > Guest clock is maintained by guest timekeeping code, which does: > >> >> > > >> >> > timer_interrupt() > >> >> > offset = read clocksource since last timer interrupt > >> >> > accumulate_to_systemclock(offset) > >> >> > > >> >> > The frequency correction of NTP in the host can be applied to > >> >> > kvmclock, which will be visible to the
Re: kvmclock doesn't work, help?
On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> > >> wrote: > >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> > >> >> wrote: > >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> > >> >> > wrote: > >> >> >> > >> >> >> > >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >>> > RAW TSC NTP corrected TSC > >> >> >>> > t0 10 10 > >> >> >>> > t1 20 19.99 > >> >> >>> > t2 30 29.98 > >> >> >>> > t3 40 39.97 > >> >> >>> > t4 50 49.96 > > > > (1) > > > >> >> >>> > > >> >> >>> > ... > >> >> >>> > > >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >>> > you can see what will happen. > >> >> >>> > >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> > >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> >> >> resume, the TSC certainly increases at the same rate as before, but > >> >> >> the > >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower > >> >> >> than the guest kvmclock. > >> >> > > >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> > > >> >> > If it's the host's, then wouldn't systemtime be reset after resume to > >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> > backwards. > >> >> > > >> >> > If it's the guest's, then the guest's NTP correction is applied on top > >> >> > of kvmclock, and this shouldn't matter. > >> >> > > >> >> > I still feel like I'm missing something very basic here. > >> >> > > >> >> > >> >> OK, I think I get it. > >> >> > >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> >> correction to the guest. If it did, indeed, propagate the correction > >> >> then, after resume, the host's new system_time would match the guest's > >> >> idea of it (after accounting for the guest's long nap), and I don't > >> >> think there would be a problem. > >> >> That being said, I can't find the code in the masterclock stuff that > >> >> would actually do this. > >> > > >> > Guest clock is maintained by guest timekeeping code, which does: > >> > > >> > timer_interrupt() > >> > offset = read clocksource since last timer interrupt > >> > accumulate_to_systemclock(offset) > >> > > >> > The frequency correction of NTP in the host can be applied to > >> > kvmclock, which will be visible to the guest > >> > at "read clocksource since last timer interrupt" > >> > (kvmclock_clocksource_read function). > >> > >> pvclock_clocksource_read? That seems to do the same thing as all the > >> other clocksource access functions. > >> > >> > > >> > This does not mean that the NTP correction in the host is propagated > >> > to the guests system clock directly. > >> > > >> > (For example, the guest can run NTP which is free to do further > >> > adjustments at "accumulate_to_systemclock(offset)" time). > >> > >> Of course. But I expected that, in the absence of NTP on the guest, > >> that the guest would track the host's *corrected* time. > >> > >> > > >> >> If, on the other hand, the host's NTP correction is not supposed to &g
Re: kvmclock doesn't work, help?
On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirskiwrote: > > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: > >> > >> > >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >>> > RAW TSC NTP corrected TSC > >>> > t0 10 10 > >>> > t1 20 19.99 > >>> > t2 30 29.98 > >>> > t3 40 39.97 > >>> > t4 50 49.96 > >>> > > >>> > ... > >>> > > >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >>> > you can see what will happen. > >>> > >>> Sure, but why would you ever switch from one to the other? > >> > >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> resume, the TSC certainly increases at the same rate as before, but the > >> raw TSC restarted counting from 0 and systemtime has increased slower > >> than the guest kvmclock. > > > > Wait, are we talking about the host's NTP or the guest's NTP? > > > > If it's the host's, then wouldn't systemtime be reset after resume to > > the NTP corrected value? If so, the guest wouldn't see time go > > backwards. > > > > If it's the guest's, then the guest's NTP correction is applied on top > > of kvmclock, and this shouldn't matter. > > > > I still feel like I'm missing something very basic here. > > > > OK, I think I get it. > > Marcelo, I thought that kvmclock was supposed to propagate the host's > correction to the guest. If it did, indeed, propagate the correction > then, after resume, the host's new system_time would match the guest's > idea of it (after accounting for the guest's long nap), and I don't > think there would be a problem. > That being said, I can't find the code in the masterclock stuff that > would actually do this. Guest clock is maintained by guest timekeeping code, which does: timer_interrupt() offset = read clocksource since last timer interrupt accumulate_to_systemclock(offset) The frequency correction of NTP in the host can be applied to kvmclock, which will be visible to the guest at "read clocksource since last timer interrupt" (kvmclock_clocksource_read function). This does not mean that the NTP correction in the host is propagated to the guests system clock directly. (For example, the guest can run NTP which is free to do further adjustments at "accumulate_to_systemclock(offset)" time). > If, on the other hand, the host's NTP correction is not supposed to > propagate to the guest, This is optional. There is a module option to control this, in fact. Its nice to have, because then you can execute a guest without NTP (say without network connection), and have a kvmclock (kvmclock is a clocksource, not a guest system clock) which is NTP corrected. > then shouldn't KVM just update system_time on > resume to whatever the guest would think it had (which I think would > be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly > shifted by some per-guest constant offset). > > --Andy Sure, you could add a correction to compensate and make sure the guest clock does not see time backwards. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
GOn Mon, Dec 14, 2015 at 02:31:10PM -0800, Andy Lutomirski wrote: > On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote: > >> > >> > >> On 11/12/2015 22:57, Andy Lutomirski wrote: > >> > I'm still not seeing the issue. > >> > > >> > The formula is: > >> > > >> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> > >> > pvti->tsc_shift) + pvti->system_time > >> > > >> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value > >> > after suspend/resume, you would also need to update system_time. > >> > > >> > I don't see what this has to do with suspend/resume or with whether > >> > the effective scale factor is greater than or less than one. The only > >> > suspend/resume interaction I can see is that, if the host allows the > >> > guest-observed TSC value to jump (which is arguably a bug, what that's > >> > not important here), it needs to update pvti before resuming the > >> > guest. > >> > >> Which is not an issue, since freezing obviously gets all CPUs out of > >> guest mode. > >> > >> Marcelo, can you provide an example with made-up values for tsc and pvti? > > > > I meant "systemtime" at ^. > > > > guest visible clock = systemtime (updated at time 0, guest initialization) > > + scaled tsc reads=LARGE VALUE. > > ^^ > > guest reads clock to memory at location A = scaled tsc read. > > -> suspend resume event > > guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled > > tsc reads=0. > > guest reads clock to memory at location B. > > > > So before the suspend/resume event, the clock is the RAW TSC values > > (scaled by kvmclock, but the frequency of the RAW TSC). > > > > After suspend/resume event, the clock is updated from the host > > via get_kernel_ns(), which reads the corrected NTP frequency TSC. > > > > So you switch the timebase, from a clock running at a given frequency, > > to a clock running at another frequency (effective frequency). > > > > Example: > > > > RAW TSC NTP corrected TSC > > t0 10 10 > > t1 20 19.99 > > t2 30 29.98 > > t3 40 39.97 > > t4 50 49.96 > > > > ... > > > > if you suddenly switch from RAW TSC to NTP corrected TSC, > > you can see what will happen. > > Sure, but why would you ever switch from one to the other? Because thats what happens when you ask kvmclock to update from system time (which is a reliable clock, resistant to suspend/resume issues). > I'm still not seeing the scenario under which this discontinuity is > visible to anything other than the kvmclock code itself. Host userspace can see if it uses TSC and clock_gettime() and expects them to run hand in hand. > The only things that need to be monotonic are the output from > vread_pvclock and the in-kernel equivalent, I think. > > --Andy clock_gettime as well, should be monotonic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> > >> wrote: > >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> > >> > wrote: > >> >> > >> >> > >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >>> > RAW TSC NTP corrected TSC > >> >>> > t0 10 10 > >> >>> > t1 20 19.99 > >> >>> > t2 30 29.98 > >> >>> > t3 40 39.97 > >> >>> > t4 50 49.96 (1) > >> >>> > > >> >>> > ... > >> >>> > > >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >>> > you can see what will happen. > >> >>> > >> >>> Sure, but why would you ever switch from one to the other? > >> >> > >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> >> resume, the TSC certainly increases at the same rate as before, but the > >> >> raw TSC restarted counting from 0 and systemtime has increased slower > >> >> than the guest kvmclock. > >> > > >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> > > >> > If it's the host's, then wouldn't systemtime be reset after resume to > >> > the NTP corrected value? If so, the guest wouldn't see time go > >> > backwards. > >> > > >> > If it's the guest's, then the guest's NTP correction is applied on top > >> > of kvmclock, and this shouldn't matter. > >> > > >> > I still feel like I'm missing something very basic here. > >> > > >> > >> OK, I think I get it. > >> > >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> correction to the guest. If it did, indeed, propagate the correction > >> then, after resume, the host's new system_time would match the guest's > >> idea of it (after accounting for the guest's long nap), and I don't > >> think there would be a problem. > >> That being said, I can't find the code in the masterclock stuff that > >> would actually do this. > > > > Guest clock is maintained by guest timekeeping code, which does: > > > > timer_interrupt() > > offset = read clocksource since last timer interrupt > > accumulate_to_systemclock(offset) > > > > The frequency correction of NTP in the host can be applied to > > kvmclock, which will be visible to the guest > > at "read clocksource since last timer interrupt" > > (kvmclock_clocksource_read function). > > pvclock_clocksource_read? That seems to do the same thing as all the > other clocksource access functions. > > > > > This does not mean that the NTP correction in the host is propagated > > to the guests system clock directly. > > > > (For example, the guest can run NTP which is free to do further > > adjustments at "accumulate_to_systemclock(offset)" time). > > Of course. But I expected that, in the absence of NTP on the guest, > that the guest would track the host's *corrected* time. > > > > >> If, on the other hand, the host's NTP correction is not supposed to > >> propagate to the guest, > > > > This is optional. There is a module option to control this, in fact. > > > > Its nice to have, because then you can execute a guest without NTP > > (say without network connection), and have a kvmclock (kvmclock is a > > clocksource, not a guest system clock) which is NTP corrected. > > Can you point to how this works? I found kvm_guest_time_update, whch > is called under circumstances that I haven't untangled. I can't > really tell what it's trying to do. Documentation/virtual/kvm/timekeeping.txt. > In any case, this still seems much more convoluted than it has to be. > In the case in which the host has a stable TSC (tsc is selected in the > core timekeeping code, VCLOCK_TSC is set, etc), which is basically all > the time on the last few generations of CPUs, then the core > timekeeping code is already exposing
Re: kvmclock doesn't work, help?
On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote: > On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > >> I'm trying to clean up kvmclock and I can't get it to work at all. My > >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > >> > >> If I boot an SMP (2 vcpus) guest, tracing says: > >> > >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 0 > >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> > >> > >> If I boot a UP guest, tracing says: > >> > >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> > >> I suspect, but I haven't verified, that this is fallout from: > >> > >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > >> Author: Marcelo Tosatti <mtosa...@redhat.com> > >> Date: Wed May 14 12:43:24 2014 -0300 > >> > >> KVM: x86: disable master clock if TSC is reset during suspend > >> > >> Updating system_time from the kernel clock once master clock > >> has been enabled can result in time backwards event, in case > >> kernel clock frequency is lower than TSC frequency. > >> > >> Disable master clock in case it is necessary to update it > >> from the resume path. > >> > >> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com> > >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > >> > >> > >> Can we please stop making kvmclock more complex? It's a beast right > >> now, and not in a good way. It's far too tangled with the vclock > >> machinery on both the host and guest sides, the pvclock stuff is not > >> well thought out (even in principle in an ABI sense), and it's never > >> been clear to my what problem exactly the kvmclock stuff is supposed > >> to solve. > >> > >> I'm somewhat tempted to suggest that we delete kvmclock entirely and > >> start over. A correctly functioning KVM guest using TSC (i.e. > >> ignoring kvmclock entirely) > >> seems to work rather more reliably and > >> considerably faster than a kvmclock guest. > >> > >> --Andy > >> > >> -- > >> Andy Lutomirski > >> AMA Capital Management, LLC > > > > Andy, > > > > I am all for solving practical problems rather than pleasing aesthetic > > pleasure. > > > >> Updating system_time from the kernel clock once master clock > >> has been enabled can result in time backwards event, in case > >> kernel clock frequency is lower than TSC frequency. > >> > >> Disable master clock in case it is necessary to update it > >> from the resume path. > > > >> once master clock > >> has been enabled can result in time backwards event, in case > >> kernel clock frequency is lower than TSC frequency. > > > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. > > >
Re: kvmclock doesn't work, help?
On Mon, Dec 14, 2015 at 10:07:21AM -0800, Andy Lutomirski wrote: > On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote: > > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote: > >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosa...@redhat.com> > >> wrote: > >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > >> >> I'm trying to clean up kvmclock and I can't get it to work at all. My > >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > >> >> > >> >> If I boot an SMP (2 vcpus) guest, tracing says: > >> >> > >> >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 0 > >> >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > >> >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> > >> >> > >> >> If I boot a UP guest, tracing says: > >> >> > >> >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> > >> >> I suspect, but I haven't verified, that this is fallout from: > >> >> > >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > >> >> Author: Marcelo Tosatti <mtosa...@redhat.com> > >> >> Date: Wed May 14 12:43:24 2014 -0300 > >> >> > >> >> KVM: x86: disable master clock if TSC is reset during suspend > >> >> > >> >> Updating system_time from the kernel clock once master clock > >> >> has been enabled can result in time backwards event, in case > >> >> kernel clock frequency is lower than TSC frequency. > >> >> > >> >> Disable master clock in case it is necessary to update it > >> >> from the resume path. > >> >> > >> >> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com> > >> >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > >> >> > >> >> > >> >> Can we please stop making kvmclock more complex? It's a beast right > >> >> now, and not in a good way. It's far too tangled with the vclock > >> >> machinery on both the host and guest sides, the pvclock stuff is not > >> >> well thought out (even in principle in an ABI sense), and it's never > >> >> been clear to my what problem exactly the kvmclock stuff is supposed > >> >> to solve. > >> >> > >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and > >> >> start over. A correctly functioning KVM guest using TSC (i.e. > >> >> ignoring kvmclock entirely) > >> >> seems to work rather more reliably and > >> >> considerably faster than a kvmclock guest. > >> >> > >> >> --Andy > >> >> > >> >> -- > >> >> Andy Lutomirski > >> >> AMA
Re: kvmclock doesn't work, help?
On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote: > > > On 11/12/2015 22:57, Andy Lutomirski wrote: > > I'm still not seeing the issue. > > > > The formula is: > > > > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> > > pvti->tsc_shift) + pvti->system_time > > > > Obviously, if you reset pvti->tsc_timestamp to the current tsc value > > after suspend/resume, you would also need to update system_time. > > > > I don't see what this has to do with suspend/resume or with whether > > the effective scale factor is greater than or less than one. The only > > suspend/resume interaction I can see is that, if the host allows the > > guest-observed TSC value to jump (which is arguably a bug, what that's > > not important here), it needs to update pvti before resuming the > > guest. > > Which is not an issue, since freezing obviously gets all CPUs out of > guest mode. > > Marcelo, can you provide an example with made-up values for tsc and pvti? I meant "systemtime" at ^. guest visible clock = systemtime (updated at time 0, guest initialization) + scaled tsc reads=LARGE VALUE. ^^ guest reads clock to memory at location A = scaled tsc read. -> suspend resume event guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc reads=0. guest reads clock to memory at location B. So before the suspend/resume event, the clock is the RAW TSC values (scaled by kvmclock, but the frequency of the RAW TSC). After suspend/resume event, the clock is updated from the host via get_kernel_ns(), which reads the corrected NTP frequency TSC. So you switch the timebase, from a clock running at a given frequency, to a clock running at another frequency (effective frequency). Example: RAW TSC NTP corrected TSC t0 10 10 t1 20 19.99 t2 30 29.98 t3 40 39.97 t4 50 49.96 ... if you suddenly switch from RAW TSC to NTP corrected TSC, you can see what will happen. Does that make sense? > > suspend/resume event. > > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc > > reads=0. > > > Can you clarify concretely what goes wrong here? > > > > (I'm also at a bit of a loss as to why this needs both system_time and > > tsc_timestamp. They're redundant in the sense that you could set > > tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >> > > tsc_shift to system_time without changing the result of the > > calculation.) > > You would have to ensure that all elements of pvti are rounded correctly > whenever the base TSC is updated. Doable, but it does seem simpler to > keep subtract-TSC and add-nanoseconds separate. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > I'm trying to clean up kvmclock and I can't get it to work at all. My > host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > > If I boot an SMP (2 vcpus) guest, tracing says: > > qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > > If I boot a UP guest, tracing says: > > qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > I suspect, but I haven't verified, that this is fallout from: > > commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > Author: Marcelo Tosatti <mtosa...@redhat.com> > Date: Wed May 14 12:43:24 2014 -0300 > > KVM: x86: disable master clock if TSC is reset during suspend > > Updating system_time from the kernel clock once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. > > Disable master clock in case it is necessary to update it > from the resume path. > > Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com> > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. > > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) > seems to work rather more reliably and > considerably faster than a kvmclock guest. > > --Andy > > -- > Andy Lutomirski > AMA Capital Management, LLC Andy, I am all for solving practical problems rather than pleasing aesthetic pleasure. > Updating system_time from the kernel clock once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. > > Disable master clock in case it is necessary to update it > from the resume path. > once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. If the effective frequency of the kernel clock is lower (for example due to NTP correcting the TSC frequency of the system), and you resume and update the system, the following happens: guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE. suspend/resume event. guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 09, 2015 at 02:27:36PM -0800, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonziniwrote: > > > > > > On 09/12/2015 22:49, Andy Lutomirski wrote: > >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: > >>> > >>> > >>> On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. > >>> > >>> It's supposed to solve the problem that: > >>> > >>> - not all hosts have a working TSC > >> > >> Fine, but we don't need any vdso integration for that. > > > > Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. > > > > >>> - even if they all do, virtual machines can be migrated (or > >>> saved/restored) to a host with a different TSC frequency > >>> > >>> - any MMIO- or PIO-based mechanism to access the current time is orders > >>> of magnitude slower than the TSC and less precise too. > >> > >> Yup. But TSC by itself gets that benefit, too. > > > > Yes, the problem is if you want to solve all three of them. The first > > two are solved by the ACPI PM timer with a decent resolution (70 > > ns---much faster anyway than an I/O port access). The third is solved > > by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? > > > > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. > >>> > >>> If all your hosts have a working TSC and you don't do migration or > >>> save/restore, that's a valid configuration. It's not a good default, > >>> however. > >> > >> Er? > >> > >> kvmclock is still really quite slow and buggy. > > > > Unless it takes 3-4000 clock cycles for a gettimeofday, which it > > shouldn't even with vdso disabled, it's definitely not slower than PIO. > > > >> And the patch I identified is definitely a problem here: > >> > >> [ 136.131241] KVM: disabling fast timing permanently due to inability > >> to recover from suspend > >> > >> I got that on the host with this whitespace-damaged patch: > >> > >> if (backwards_tsc) { > >> u64 delta_cyc = max_tsc - local_tsc; > >> + if (!backwards_tsc_observed) > >> + pr_warn("KVM: disabling fast timing > >> permanently due to inability to recover from suspend\n"); > >> > >> when I suspended and resumed. > >> > >> Can anyone explain what problem > >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On > >> brief inspection, it just seems to be incorrect. Shouldn't KVM's > >> normal TSC logic handle that case right? After all, all vcpus should > >> be paused when we resume from suspend. At worst, we should just need > >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, > >> shouldn't we do that regardless of which way the TSC jumped on > >> suspend/resume? After all, the jTSC-to-wall-clock offset is quite > >> likely to change except on the very small handful of CPUs (if any) > >> that keep the TSC running in S3 and hibernate. > > > > I don't recall the details of that patch, so Marcelo will have to answer > > this, or Alex too since he chimed in the original thread. At least it > > should be made conditional on the existence of a VM at suspend time (and > > the master clock stuff should be made per VM, as I suggested at > > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). > > > > It would indeed be great if the master clock could be dropped. But I'm > > definitely missing some of the subtle details. :( > > Me, too. > > Anyway, see the attached untested patch. Marcelo? > > --Andy Read the last email, about the problem. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > I'm trying to clean up kvmclock and I can't get it to work at all. My > host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > > If I boot an SMP (2 vcpus) guest, tracing says: > > qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > > If I boot a UP guest, tracing says: > > qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > I suspect, but I haven't verified, that this is fallout from: > > commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > Author: Marcelo Tosatti <mtosa...@redhat.com> > Date: Wed May 14 12:43:24 2014 -0300 > > KVM: x86: disable master clock if TSC is reset during suspend > > Updating system_time from the kernel clock once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. > > Disable master clock in case it is necessary to update it > from the resume path. > > Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com> > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. > > > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. > > --Andy Users can do that, if they want. "clocksource=tsc" kernel option. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()
On Fri, Nov 13, 2015 at 07:47:28PM -0200, Marcelo Tosatti wrote: > On Thu, Nov 12, 2015 at 08:52:45PM +0900, Takuya Yoshikawa wrote: > > kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does > > nearly the same as the for_each_rmap_spte macro. The only difference > > is that is_shadow_present_pte() checks cannot be placed there because > > kvm_mmu_mark_parents_unsync() can be called with a new parent pointer > > whose entry is not set yet. > > > > By calling mark_unsync() separately for the parent and adding the parent > > pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro > > works with no problem. > > > > Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya...@lab.ntt.co.jp> > > --- > > arch/x86/kvm/mmu.c | 36 +--- > > 1 file changed, 13 insertions(+), 23 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > > index e8cfdc4..1691171 100644 > > --- a/arch/x86/kvm/mmu.c > > +++ b/arch/x86/kvm/mmu.c > > @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long > > *pte_list) > > } > > } > > > > -typedef void (*pte_list_walk_fn) (u64 *spte); > > -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) > > -{ > > - struct pte_list_desc *desc; > > - int i; > > - > > - if (!*pte_list) > > - return; > > - > > - if (!(*pte_list & 1)) > > - return fn((u64 *)*pte_list); > > - > > - desc = (struct pte_list_desc *)(*pte_list & ~1ul); > > - while (desc) { > > - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) > > - fn(desc->sptes[i]); > > - desc = desc->more; > > - } > > -} > > - > > static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, > > struct kvm_memory_slot *slot) > > { > > @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page > > *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, > > static void mark_unsync(u64 *spte); > > static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) > > { > > - pte_list_walk(>parent_ptes, mark_unsync); > > + u64 *sptep; > > + struct rmap_iterator iter; > > + > > + for_each_rmap_spte(>parent_ptes, , sptep) { > > + mark_unsync(sptep); > > + } > > } > > > > static void mark_unsync(u64 *spte) > > @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct > > kvm_vcpu *vcpu, > > Faulting a spte, and one of the levels of sptes, either > > > spte-1 > spte-2 > spte-3 > > has present bit clear. So we're searching for a guest page to shadow, with > gfn "gfn". > > > if (sp->unsync && kvm_sync_page_transient(vcpu, sp)) > > break; > > If a shadow for gfn exists, but is unsync, sync guest-page ---to--> kvm > sptes. > > > - mmu_page_add_parent_pte(vcpu, sp, parent_pte); > > add "gfn" (actually its "struct kvm_mmu_page *sp" pointer) to > the parent. > > if (sp->unsync_children) { > > kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); > > kvm_mmu_mark_parents_unsync(sp); > > kvm_mmu_mark_parents_unsync relied on the links from current level all > the way to top level to mark all levels unsync, so that on guest entry, > KVM_REQ_MMU_SYNC is processed and any level is brought from guest --> > kvm pages. This now fails, because you removed "mmu_page_add_parent_pte" > (the link is not formed all the way to root). > > Unless i am missing something, this is not correct. The actual issue is this: a higher level page that had, under its children, no out of sync pages, now, due to your addition, a child that is unsync: initial state: level1 final state: level1 -x-> level2 -x-> level3 Where -x-> are the links created by this pagefault fixing round. If _any_ page under you is unsync (not necessarily the ones this pagefault is accessing), you have to mark parents unsync. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()
On Thu, Nov 12, 2015 at 08:52:45PM +0900, Takuya Yoshikawa wrote: > kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does > nearly the same as the for_each_rmap_spte macro. The only difference > is that is_shadow_present_pte() checks cannot be placed there because > kvm_mmu_mark_parents_unsync() can be called with a new parent pointer > whose entry is not set yet. > > By calling mark_unsync() separately for the parent and adding the parent > pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro > works with no problem. > > Signed-off-by: Takuya Yoshikawa> --- > arch/x86/kvm/mmu.c | 36 +--- > 1 file changed, 13 insertions(+), 23 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index e8cfdc4..1691171 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long > *pte_list) > } > } > > -typedef void (*pte_list_walk_fn) (u64 *spte); > -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) > -{ > - struct pte_list_desc *desc; > - int i; > - > - if (!*pte_list) > - return; > - > - if (!(*pte_list & 1)) > - return fn((u64 *)*pte_list); > - > - desc = (struct pte_list_desc *)(*pte_list & ~1ul); > - while (desc) { > - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) > - fn(desc->sptes[i]); > - desc = desc->more; > - } > -} > - > static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, > struct kvm_memory_slot *slot) > { > @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct > kvm_vcpu *vcpu, > static void mark_unsync(u64 *spte); > static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) > { > - pte_list_walk(>parent_ptes, mark_unsync); > + u64 *sptep; > + struct rmap_iterator iter; > + > + for_each_rmap_spte(>parent_ptes, , sptep) { > + mark_unsync(sptep); > + } > } > > static void mark_unsync(u64 *spte) > @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct > kvm_vcpu *vcpu, Faulting a spte, and one of the levels of sptes, either spte-1 spte-2 spte-3 has present bit clear. So we're searching for a guest page to shadow, with gfn "gfn". > if (sp->unsync && kvm_sync_page_transient(vcpu, sp)) > break; If a shadow for gfn exists, but is unsync, sync guest-page ---to--> kvm sptes. > - mmu_page_add_parent_pte(vcpu, sp, parent_pte); add "gfn" (actually its "struct kvm_mmu_page *sp" pointer) to the parent. > if (sp->unsync_children) { > kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); > kvm_mmu_mark_parents_unsync(sp); kvm_mmu_mark_parents_unsync relied on the links from current level all the way to top level to mark all levels unsync, so that on guest entry, KVM_REQ_MMU_SYNC is processed and any level is brought from guest --> kvm pages. This now fails, because you removed "mmu_page_add_parent_pte" (the link is not formed all the way to root). Unless i am missing something, this is not correct. > - } else if (sp->unsync) > + if (parent_pte) > + mark_unsync(parent_pte); > + } else if (sp->unsync) { > kvm_mmu_mark_parents_unsync(sp); > + if (parent_pte) > + mark_unsync(parent_pte); > + } > + mmu_page_add_parent_pte(vcpu, sp, parent_pte); > > __clear_sp_write_flooding_count(sp); > trace_kvm_mmu_get_page(sp, false); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/10] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes
On Thu, Nov 12, 2015 at 08:53:43PM +0900, Takuya Yoshikawa wrote: > At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is > placed right after the call to detect unrelated sptes which must not be > found in the reverse-mapping list. > > Move this check in rmap_get_first/next() so that all call sites, not > just the users of the for_each_rmap_spte() macro, will be checked the > same way. In addition, change the BUG_ON to WARN_ON since killing the > whole host is the last thing that KVM should try. It should be a BUG_ON, if KVM continues it will corrupt (more) memory. > One thing to keep in mind is that kvm_mmu_unlink_parents() also uses > rmap_get_first() to handle parent sptes. The change will not break it > because parent sptes are present, at least until drop_parent_pte() > actually unlinks them, and not mmio-sptes. > > Signed-off-by: Takuya Yoshikawa> --- > Documentation/virtual/kvm/mmu.txt | 4 ++-- > arch/x86/kvm/mmu.c| 26 +- > 2 files changed, 19 insertions(+), 11 deletions(-) > > diff --git a/Documentation/virtual/kvm/mmu.txt > b/Documentation/virtual/kvm/mmu.txt > index 3a4d681..daf9c0f 100644 > --- a/Documentation/virtual/kvm/mmu.txt > +++ b/Documentation/virtual/kvm/mmu.txt > @@ -203,10 +203,10 @@ Shadow pages contain the following information: > page cannot be destroyed. See role.invalid. >parent_ptes: > The reverse mapping for the pte/ptes pointing at this page's spt. If > -parent_ptes bit 0 is zero, only one spte points at this pages and > +parent_ptes bit 0 is zero, only one spte points at this page and > parent_ptes points at this single spte, otherwise, there exists multiple > sptes pointing at this page and (parent_ptes & ~0x1) points at a data > -structure with a list of parent_ptes. > +structure with a list of parent sptes. >unsync: > If true, then the translations in this page may not match the guest's > translation. This is equivalent to the state of the tlb when a pte is > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 1691171..ee7b101 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -1079,17 +1079,23 @@ struct rmap_iterator { > */ > static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter) > { > + u64 *sptep; > + > if (!rmap) > return NULL; > > if (!(rmap & 1)) { > iter->desc = NULL; > - return (u64 *)rmap; > + sptep = (u64 *)rmap; > + goto out; > } > > iter->desc = (struct pte_list_desc *)(rmap & ~1ul); > iter->pos = 0; > - return iter->desc->sptes[iter->pos]; > + sptep = iter->desc->sptes[iter->pos]; > +out: > + WARN_ON(!is_shadow_present_pte(*sptep)); > + return sptep; > } > > /* > @@ -1099,14 +1105,14 @@ static u64 *rmap_get_first(unsigned long rmap, struct > rmap_iterator *iter) > */ > static u64 *rmap_get_next(struct rmap_iterator *iter) > { > + u64 *sptep; > + > if (iter->desc) { > if (iter->pos < PTE_LIST_EXT - 1) { > - u64 *sptep; > - > ++iter->pos; > sptep = iter->desc->sptes[iter->pos]; > if (sptep) > - return sptep; > + goto out; > } > > iter->desc = iter->desc->more; > @@ -1114,17 +1120,20 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) > if (iter->desc) { > iter->pos = 0; > /* desc->sptes[0] cannot be NULL */ > - return iter->desc->sptes[iter->pos]; > + sptep = iter->desc->sptes[iter->pos]; > + goto out; > } > } > > return NULL; > +out: > + WARN_ON(!is_shadow_present_pte(*sptep)); > + return sptep; > } > > #define for_each_rmap_spte(_rmap_, _iter_, _spte_) \ > for (_spte_ = rmap_get_first(*_rmap_, _iter_); \ > - _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;}); \ > - _spte_ = rmap_get_next(_iter_)) > + _spte_; _spte_ = rmap_get_next(_iter_)) > > static void drop_spte(struct kvm *kvm, u64 *sptep) > { > @@ -1338,7 +1347,6 @@ static bool kvm_zap_rmapp(struct kvm *kvm, unsigned > long *rmapp) > bool flush = false; > > while ((sptep = rmap_get_first(*rmapp, ))) { > - BUG_ON(!(*sptep & PT_PRESENT_MASK)); > rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep); > > drop_spte(kvm, sptep); > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from
Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ
On Wed, Oct 28, 2015 at 06:05:00PM +0100, Paolo Bonzini wrote: > > > On 28/10/2015 17:00, Alex Williamson wrote: > > > Alex, would it make sense to use the IRQ bypass infrastructure always, > > > not just for VT-d, to do the MSI injection directly from the VFIO > > > interrupt handler and bypass the eventfd? Basically this would add an > > > RCU-protected list of consumers matching the token to struct > > > irq_bypass_producer, and a > > > > > > int (*inject)(struct irq_bypass_consumer *); > > > > > > callback to struct irq_bypass_consumer. If any callback returns true, > > > the eventfd is not signaled. > > > > Yeah, that might be a good idea, it's probably more plausible than > > making the eventfd_signal() code friendly to call from hard interrupt > > context. On the vfio side can we use request_threaded_irq() directly > > for this? > > I don't know if that gives you a non-threaded IRQ with the real-time > kernel... CCing Marcelo to get some insight. The vfio interrupt handler (threaded or not) runs at a higher priority than the vcpu thread. So don't worry about -RT. About bypass: the smaller number of instructions between device ISR and injection of interrupt to guest, the better, as that will translate directly to reduction in interrupt latency times, which is important, as it determines 1. how often you can switch from pollmode to ACPI C-states. 2. whether the realtime workload is virtualizable. The answer to properties of request_threaded_irq() is: don't know. > > Making the hard irq handler return IRQ_HANDLED if we can use > > the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd. > > I think we need some way to get back to irq thread context to use > > eventfd_signal(). > > The irqfd is already able to schedule a work item, because it runs with > interrupts disabled, so I think we can always return IRQ_HANDLED. > > There's another little complication. Right now, only x86 has > kvm_set_msi_inatomic. We should merge kvm_set_msi_inatomic, > kvm_set_irq_inatomic and kvm_arch_set_irq. > > Some cleanups are needed there; the flow between the functions is really > badly structured because the API grew somewhat by accretion. I'll get > to it next week or on the way back to Italy. > > > Would we ever not want to use the direct bypass > > manager path if available? Thanks, > > I don't think so. KVM always registers itself as a consumer, even if > there is no VT-d posted interrupts. add_producer simply returns -EINVAL > then. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] target-i386: save/restore vcpu's TSC rate during migration
On Thu, Oct 22, 2015 at 04:45:21PM -0200, Eduardo Habkost wrote: > On Tue, Oct 20, 2015 at 03:22:51PM +0800, Haozhong Zhang wrote: > > This patchset enables QEMU to save/restore vcpu's TSC rate during the > > migration. When cooperating with KVM which supports TSC scaling, guest > > programs can observe a consistent guest TSC rate even though they are > > migrated among machines with different host TSC rates. > > > > A pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' are added to > > control the migration of vcpu's TSC rate. > > The requirements and goals aren't clear to me. I see two possible use > cases, here: > > 1) Best effort to keep TSC frequency constant if possible (but not >aborting migration if not possible). This would be an interesting >default, but a bit unpredictable. > 2) Strictly ensuring TSC frequency stays constant on migration (and >aborting migration if not possible). This would be an useful feature, >but can't be enabled by default unless both hosts have the same TSC >frequency or support TSC scaling. Only destination needs to support TSC scaling, to match the frequency of the incoming host. The KVM code for this feature has submitted or integrated? > Which one(s) you are trying to implement? > > In other words, what is the right behavior when KVM_SET_TSC_KHZ fails or > KVM_CAP_TSC_CONTROL is not available? We can't answer that question if > the requirements and goals are not clear. > > Once we know what exactly is the goal, we could enable the new mode with > a single option, instead of raw options to control migration stream > loading/saving. Windows and Linux guests have paravirt clocks and/or options to disable direct TSC usage for timekeeping purposes. So disabling migration seems overkill. > > > > * By default, the migration of vcpu's TSC rate is enabled only on > >pc-*-2.5 and newer machine types. If the cpu option 'save-tsc-freq' > >is present, the vcpu's TSC rate will be migrated from older machine > >types as well. > > * Another cpu option 'load-tsc-freq' controls whether the migrated > >vcpu's TSC rate is used. By default, QEMU will not use the migrated > >TSC rate if this option is not present. Otherwise, QEMU will use > >the migrated TSC rate and override the TSC rate given by the cpu > >option 'tsc-freq'. > > > > Changes in v2: > > * Add a pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' to > >control the migration of vcpu's TSC rate. > > * Move all logic of setting TSC rate to target-i386. > > * Remove the duplicated TSC setup in kvm_arch_init_vcpu(). > > > > Haozhong Zhang (3): > > target-i386: add a subsection for migrating vcpu's TSC rate > > target-i386: calculate vcpu's TSC rate to be migrated > > target-i386: load the migrated vcpu's TSC rate > > > > include/hw/i386/pc.h | 5 + > > target-i386/cpu.c | 2 ++ > > target-i386/cpu.h | 3 +++ > > target-i386/kvm.c | 61 > > +++ > > target-i386/machine.c | 19 > > 5 files changed, 81 insertions(+), 9 deletions(-) > > > > -- > > 2.4.8 > > > > -- > Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: move steal time initialization to vcpu entry time
As reported at https://bugs.launchpad.net/qemu/+bug/1494350, it is possible to have vcpu->arch.st.last_steal initialized from a thread other than vcpu thread, say the iothread, via KVM_SET_MSRS. Which can cause an overflow later (when subtracting from vcpu threads sched_info.run_delay). To avoid that, move steal time accumulation to vcpu entry time, before copying steal time data to guest. Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8f0f6ec..0e0332e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2030,6 +2030,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu) static void record_steal_time(struct kvm_vcpu *vcpu) { + accumulate_steal_time(vcpu); + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; @@ -2182,12 +2184,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!(data & KVM_MSR_ENABLED)) break; - vcpu->arch.st.last_steal = current->sched_info.run_delay; - - preempt_disable(); - accumulate_steal_time(vcpu); - preempt_enable(); - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); break; @@ -2830,7 +2826,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) vcpu->cpu = cpu; } - accumulate_steal_time(vcpu); kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
On Tue, Sep 22, 2015 at 09:52:49PM +0200, Paolo Bonzini wrote: > > > On 22/09/2015 21:01, Marcelo Tosatti wrote: > > On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote: > >> Shifting pvclock_vcpu_time_info.system_time on write to KVM system time > >> MSR is a change of ABI. Probably only 2.6.16 based SLES 10 breaks due > >> to its custom enhancements to kvmclock, but KVM never declared the MSR > >> only for one-shot initialization. (Doc says that only one write is > >> needed.) > >> > >> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5. > >> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO. > >> > >> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> > >> --- > >> arch/x86/include/asm/pvclock-abi.h | 1 + > >> arch/x86/kvm/x86.c | 4 > >> 2 files changed, 1 insertion(+), 4 deletions(-) > >> > >> diff --git a/arch/x86/include/asm/pvclock-abi.h > >> b/arch/x86/include/asm/pvclock-abi.h > >> index 655e07a48f6c..67f08230103a 100644 > >> --- a/arch/x86/include/asm/pvclock-abi.h > >> +++ b/arch/x86/include/asm/pvclock-abi.h > >> @@ -41,6 +41,7 @@ struct pvclock_wall_clock { > >> > >> #define PVCLOCK_TSC_STABLE_BIT(1 << 0) > >> #define PVCLOCK_GUEST_STOPPED (1 << 1) > >> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */ > >> #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2) > >> #endif /* __ASSEMBLY__ */ > >> #endif /* _ASM_X86_PVCLOCK_ABI_H */ > >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >> index 18d59b584dee..34d33f4757d2 100644 > >> --- a/arch/x86/kvm/x86.c > >> +++ b/arch/x86/kvm/x86.c > >> @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > >>vcpu->pvclock_set_guest_stopped_request = false; > >>} > >> > >> - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO; > >> - > >>/* If the host uses TSC clocksource, then it is stable */ > >>if (use_master_clock) > >>pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; > >> @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct > >> msr_data *msr_info) > >>>requests); > >> > >>ka->boot_vcpu_runs_old_kvmclock = tmp; > >> - > >> - ka->kvmclock_offset = -get_kernel_ns(); > >>} > >> > >>vcpu->arch.time = data; > >> -- > >> 2.5.2 > > > > ACK > > So I suppose you changed your mind :) but can you explain the reasoning? > > Paolo The patch is correct. Overflow (issue raised) is only an issue without PVCLOCK_TSC_STABLE_BIT. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
On Fri, Sep 18, 2015 at 05:54:29PM +0200, Radim Krčmář wrote: > Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore. > The purpose of that flags was to start counting system time from 0 when > the KVM clock has been initialized. > We can achieve the same by selecting one read as the initial point. > > A simple subtraction will work unless the KVM clock count overflows > earlier (has smaller width) than scheduler's cycle count. We should be > safe till x86_128. > > Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors, > setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might > regress on older ones. > > I presume we don't need to change kvm_clock_read instead of introducing > kvm_sched_clock_read. A problem could arise in case sched_clock is > expected to return the same value as get_cycles, but we should have > merged those clocks in that case. > > Signed-off-by: Radim Krčmář> --- > arch/x86/kernel/kvmclock.c | 46 > +++--- > 1 file changed, 35 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index 2c7aafa70702..ef5b3d2cecce 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -32,6 +32,7 @@ > static int kvmclock = 1; > static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; > static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; > +static cycle_t kvm_sched_clock_offset; > > static int parse_no_kvmclock(char *arg) > { > @@ -92,6 +93,29 @@ static cycle_t kvm_clock_get_cycles(struct clocksource *cs) > return kvm_clock_read(); > } > > +static cycle_t kvm_sched_clock_read(void) > +{ > + return kvm_clock_read() - kvm_sched_clock_offset; > +} > + > +static inline void kvm_sched_clock_init(bool stable) > +{ > + if (!stable) { > + pv_time_ops.sched_clock = kvm_clock_read; > + return; > + } > + > + kvm_sched_clock_offset = kvm_clock_read(); > + pv_time_ops.sched_clock = kvm_sched_clock_read; > + set_sched_clock_stable(); > + > + printk("kvm-clock: using sched offset of %llu cycles\n", > + kvm_sched_clock_offset); > + > + BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) > > + sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time)); > +} > + > /* > * If we don't do that, there is the possibility that the guest > * will calibrate under heavy load - thus, getting a lower lpj - > @@ -248,7 +272,17 @@ void __init kvmclock_init(void) > memblock_free(mem, size); > return; > } > - pv_time_ops.sched_clock = kvm_clock_read; > + > + if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) > + pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); > + > + cpu = get_cpu(); > + vcpu_time = _clock[cpu].pvti; > + flags = pvclock_read_flags(vcpu_time); > + > + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); > + put_cpu(); > + > x86_platform.calibrate_tsc = kvm_get_tsc_khz; > x86_platform.get_wallclock = kvm_get_wallclock; > x86_platform.set_wallclock = kvm_set_wallclock; > @@ -265,16 +299,6 @@ void __init kvmclock_init(void) > kvm_get_preset_lpj(); > clocksource_register_hz(_clock, NSEC_PER_SEC); > pv_info.name = "KVM"; > - > - if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) > - pvclock_set_flags(~0); > - > - cpu = get_cpu(); > - vcpu_time = _clock[cpu].pvti; > - flags = pvclock_read_flags(vcpu_time); > - if (flags & PVCLOCK_COUNTS_FROM_ZERO) > - set_sched_clock_stable(); > - put_cpu(); > } > > int __init kvm_setup_vsyscall_timeinfo(void) > -- > 2.5.2 ACK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim Krčmář wrote: > PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it. > All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW). > The reverted patch treated the MSR write as a one-shot initializer: > any write from VCPU 0 would reset system_time. > > And this is what broke for Linux guests: > * Onlining/hotplugging of VCPU 0 > > VCPU has to assign an address to KVM clock before use, which is done > with MSR_KVM_SYSTEM_TIME_NEW. Linux has an idea that time should not > jump backward, so any `sleep` won't return before |system_time at the > point of offline| elapses since the online. Be sure to run ntp. > > * S3 and S4 resume > > If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze > for |system_time at the point of suspend|, because pvclock ensures > monoticity and kvmclock did not think about it. > > If you have stable clock, execution will resume immediately, but > restoring KVM clock writes to the MSR and dmesg starts to count from > zero. It's better than the onlining, but not what we want either. > > * Boot of SLES 10 guest > > SLES 10 has a custom implementation of kvm clock that calls > MSR_KVM_SYSTEM_TIME before every read to enhance precision ... > Two things are happening at the same time: > 1) The guest periodically receives an interrupt that is handled by > main_timer_handler(): > a) get time using the kvm clock: > 1) write the address to MSR_KVM_SYSTEM_TIME > 2) read tsc and pvclock (tsc_offset, system_time) > 3) time = tsc - tsc_offset + system_time > b) compute time since the last main_timer_handler() > c) bump jiffies if enough time has elapsed > 2) the guest wants to calibrate loops per jiffy [1]: > a) read tsc > b) loop till jiffies increase > c) compute lpj > > Because (1a1) always resets the system_time to 0, we read the same > value over and over so the condition for (1c) is never true and > jiffies remain constant. A hang happens in (2b) as it is the first > place that depends on jiffies. > > > We could make hypervisor workaround for this, but that is just asking > for more trouble. Luckily, reverting does not break to guests that > learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways. > Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which > is a good trade-off for not regressing. > > This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5. > And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO. > > Cc: sta...@vger.kernel.org > Signed-off-by: Radim Krčmář> --- > v1: Extended commit message based on a discussion with Marcelo > > arch/x86/include/asm/pvclock-abi.h | 1 + > arch/x86/kvm/x86.c | 4 > 2 files changed, 1 insertion(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/pvclock-abi.h > b/arch/x86/include/asm/pvclock-abi.h > index 655e07a48f6c..67f08230103a 100644 > --- a/arch/x86/include/asm/pvclock-abi.h > +++ b/arch/x86/include/asm/pvclock-abi.h > @@ -41,6 +41,7 @@ struct pvclock_wall_clock { > > #define PVCLOCK_TSC_STABLE_BIT (1 << 0) > #define PVCLOCK_GUEST_STOPPED(1 << 1) > +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */ > #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2) > #endif /* __ASSEMBLY__ */ > #endif /* _ASM_X86_PVCLOCK_ABI_H */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 4bca39f0fdb3..71731994d897 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > vcpu->pvclock_set_guest_stopped_request = false; > } > > - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO; > - > /* If the host uses TSC clocksource, then it is stable */ > if (use_master_clock) > pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; > @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > >requests); > > ka->boot_vcpu_runs_old_kvmclock = tmp; > - > - ka->kvmclock_offset = -get_kernel_ns(); > } > > vcpu->arch.time = data; > -- > 2.5.3 NACK, please use original patchset. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote: > Shifting pvclock_vcpu_time_info.system_time on write to KVM system time > MSR is a change of ABI. Probably only 2.6.16 based SLES 10 breaks due > to its custom enhancements to kvmclock, but KVM never declared the MSR > only for one-shot initialization. (Doc says that only one write is > needed.) > > This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5. > And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO. > > Signed-off-by: Radim Krčmář> --- > arch/x86/include/asm/pvclock-abi.h | 1 + > arch/x86/kvm/x86.c | 4 > 2 files changed, 1 insertion(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/pvclock-abi.h > b/arch/x86/include/asm/pvclock-abi.h > index 655e07a48f6c..67f08230103a 100644 > --- a/arch/x86/include/asm/pvclock-abi.h > +++ b/arch/x86/include/asm/pvclock-abi.h > @@ -41,6 +41,7 @@ struct pvclock_wall_clock { > > #define PVCLOCK_TSC_STABLE_BIT (1 << 0) > #define PVCLOCK_GUEST_STOPPED(1 << 1) > +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */ > #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2) > #endif /* __ASSEMBLY__ */ > #endif /* _ASM_X86_PVCLOCK_ABI_H */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 18d59b584dee..34d33f4757d2 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > vcpu->pvclock_set_guest_stopped_request = false; > } > > - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO; > - > /* If the host uses TSC clocksource, then it is stable */ > if (use_master_clock) > pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; > @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > >requests); > > ka->boot_vcpu_runs_old_kvmclock = tmp; > - > - ka->kvmclock_offset = -get_kernel_ns(); > } > > vcpu->arch.time = data; > -- > 2.5.2 ACK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
> So either: > > Proceed with guest solution: > -> Make sure the overflow can't happen (and write down why not in the > code). Don't assume a small delta between kvmclock values of vcpus. > -> Handle stable -> non-stable kvmclock transition. > -> kvmclock counts from zero should not depend on stable kvmclock > (because nohz_full should work on hpet host systems). > > Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW: > -> Figure out whats wrong with different kvmclock values on hotplug, > and fix it. Find data which allows you to differentiate between hotplug of pCPU-0 and system initialization. Easy one: whether MSR_KVM_SYSTEM_TIME_NEW contains valid data (that is kvmclock is enabled) for vCPUs other than vCPU-0. This can't be the case on system initialization (otherwise the host will be corrupting guest memory), and must be the case when hotplugging vCPU-0. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote: > 2015-09-21 17:53-0300, Marcelo Tosatti: > > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote: > >> 2015-09-21 12:52-0300, Marcelo Tosatti: > >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote: > >> >> 2015-09-20 19:57-0300, Marcelo Tosatti: > >> >>> Is it counting from zero that breaks SLES10? > >> >> > >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. > >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, > >> >> while still keeping system time; we used to allow that, which means an > >> >> ABI breakage. (And we can't even say that guest's behaviour is against > >> >> the spec ...) > >> > > >> > Because this behaviour was not defined. > >> > >> It is defined by implementation. > >> > >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on > >> > boot_vcpu_runs_old_kvmclock == false? > >> > The patch would be much simpler. > >> > >> If you mean the hunk in cover letter, I don't like it because we presume > >> that no other guests were broken. > > > > Yes patch 1. > > (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and > patch 2 = [2/2].) > > > Do you have an example of another guest that is broken? > > Yes, the hot-plug in any relatively recent Linux guest. > > > Really, assuming things about the value of the msr read when you > > write to the MSR is very awkward. That SuSE code is incorrect, it fails > > on other occasions as well (see > > 54750f2cf042c42b4223d67b1bd20138464bde0e). > > Yeah, I even read the SUSE implementation, sad times. > > >> I really don't like it > > > > Because it changes behaviour that was previously unspecified? > > No, because it adds another exemption to already ugly code. > (Talking about the hunk in [0/2].) > > >> so I thought about other problems with > >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0? > > > > Can't unplug VCPU 0 i think. > > You can. > > >> It doesn't work well ;) > > > > Can you hot-unplug vcpu 0? > > I can, I did, hot-plug bugged. Due to PVCLOCK_COUNTS_FROM_ZERO patch? > >> > The problem is, "selecting one read as the initial point" is inherently > >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu, > >> > but must be applied to all vcpus. > >> > >> I don't think that is a problem. > >> > >> Kvmclock has a notion of a global system_time in nanoseconds (one value > >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries > >> to propagate system_time into guest VCPUs as precisely as possible with > >> the help of TSC. > > > >Different pairs of values (system_time, tsc reads) are fundamentally > >broken. This is why > > > >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 > >x86, paravirt: Add a global synchronization point for pvclock > > > >Exists. > > > >Then to guarantee monotonicity you need to stop those updates > >(or perform the pair update in lock step): > > > >commit d828199e84447795c6669ff0e6c6d55eb9beeff6 > >KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag > > (Thanks for the commits, split values are the core design that avoids > modifying rdtsc, Which is broken (thats the point). > so I'll be grad to read its details.) > > >> > 2) You rely on monotonicity across vcpus to perform > >> > the 'minus delta that was read on vcpu0' calculation, but > >> > monotonicity across vcpus can fail during runtime > >> >(say host clocksource goes tsc->hpet due to tsc instability). > >> > >> That could be a problem, but I'm adding a VCPU independent constant to > >> all reads -- does the new code rely on monoticity in places where the > >> old one didn't? > > > > The problem is overflow: > > kvmclock non-monotonic across vcpus AND delta subtraction: > > > > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | > > vcpu1sched_clock > > 1 1 1 > > 2 2 21 > > KVM sched clock not used before this point (I even moved the code to > make
Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote: > 2015-09-20 19:57-0300, Marcelo Tosatti: > > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote: > >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is > >> RFC because I haven't explored many potential problems or tested it. > > > > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you > > haven't explored potential problems or tested it? Sorry can't parse it. > > > >> > >> [1/2] uses a different algorithm in the guest to start counting from 0. > >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor. > >> > >> A viable alternative would be to implement opt-in features in kvm clock. > >> > >> And because we probably only broke one old user (the infamous SLES 10), a > >> workaround like this is also possible: (but I'd rather not do that) > > > > Please describe why SLES 10 breaks in detail: the state of the guest and > > the host before the patch, the state of the guest and host after the > > patch. > > 1) The guest periodically receives an interrupt that is handled by >main_timer_handler(): >a) get time using the kvm clock: > 1) write the address to MSR_KVM_SYSTEM_TIME > 2) read tsc and pvclock (tsc_offset, system_time) > 3) time = tsc - tsc_offset + system_time >b) compute time since the last main_timer_handler() >c) bump jiffies if enough time has elapsed > 2) the guest wants to calibrate loops per jiffy [1]: >a) read tsc >b) loop till jiffies increase >c) compute lpj > > Because (1a1) always resets the system_time to 0, we read the same value > over and over so the condition for (1c) is never true and jiffies remain > constant. This is the problem. A hang happens in (2b) as it is the > first place that depends on jiffies. > > > What does SLES10 expect? > > That a write to MSR_KVM_SYSTEM_TIME does not reset the system time. > > > Is it counting from zero that breaks SLES10? > > Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. > The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, > while still keeping system time; we used to allow that, which means an > ABI breakage. (And we can't even say that guest's behaviour is against > the spec ...) Because this behaviour was not defined. Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on boot_vcpu_runs_old_kvmclock == false? The patch would be much simpler. The problem is, "selecting one read as the initial point" is inherently racy: that delta is relative to one moment (kvmclock read) at one vcpu, but must be applied to all vcpus. Besides: 1) Stable sched clock in guest does not depend on KVM_FEATURE_CLOCKSOURCE_STABLE_BIT. 2) You rely on monotonicity across vcpus to perform the 'minus delta that was read on vcpu0' calculation, but monotonicity across vcpus can fail during runtime (say host clocksource goes tsc->hpet due to tsc instability). > > > --- > 1: I also did diassembly, but the reproducer is easier to paste >(couldn't find debuginfo) ># qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \ > -serial stdio -monitor /dev/null -append 'console=ttyS0' > > and you can get a bit further when setting loops per jiffy manually, > -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678' > > The dmesg for failing run is > Initializing CPU#0 > PID hash table entries: 512 (order: 9, 16384 bytes) > kvm-clock: cpu 0, msr 0:3f6041, boot clock > kvm_get_tsc_khz: cpu 0, msr 0:e001 > time.c: Using tsc for timekeeping HZ 250 > time.c: Using 100.00 MHz WALL KVM GTOD KVM timer. > time.c: Detected 2591.580 MHz processor. > Console: colour VGA+ 80x25 > Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) > Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) > Checking aperture... > Nosave address range: 0009f000 - 000a > Nosave address range: 000a - 000f > Nosave address range: 000f - 0010 > Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, > 812k data, 188k init) > [Infinitely querying kvm clock here ...] > > With '-cpu kvm64,-kvmclock', the next line is > Calibrating delay using timer specific routine.. 5199.75 BogoMIPS > (lpj=10399519) > > With 'lpj=10399519', > Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset > [Manages to get stuck later, in default_idle.] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote: > 2015-09-21 12:52-0300, Marcelo Tosatti: > > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote: > >> 2015-09-20 19:57-0300, Marcelo Tosatti: > >>> Is it counting from zero that breaks SLES10? > >> > >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did. > >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to, > >> while still keeping system time; we used to allow that, which means an > >> ABI breakage. (And we can't even say that guest's behaviour is against > >> the spec ...) > > > > Because this behaviour was not defined. > > It is defined by implementation. > > > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on > > boot_vcpu_runs_old_kvmclock == false? > > The patch would be much simpler. > > > If you mean the hunk in cover letter, I don't like it because we presume > that no other guests were broken. Yes patch 1. Do you have an example of another guest that is broken? Really, assuming things about the value of the msr read when you write to the MSR is very awkward. That SuSE code is incorrect, it fails on other occasions as well (see 54750f2cf042c42b4223d67b1bd20138464bde0e). > I really don't like it Because it changes behaviour that was previously unspecified? > so I thought about other problems with > PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0? Can't unplug VCPU 0 i think. > It doesn't work well ;) Can you hot-unplug vcpu 0? > We don't want to guess what the guest wants so I'd go for the opt-in > paravirt feature unless counting from zero can be done in guest alone. The problem it is tricky (to do the counting inside the guest). Its much cleaner and less complicated if the host starts counting from zero. > > > The problem is, "selecting one read as the initial point" is inherently > > racy: that delta is relative to one moment (kvmclock read) at one vcpu, > > but must be applied to all vcpus. > > I don't think that is a problem. > > Kvmclock has a notion of a global system_time in nanoseconds (one value > that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries > to propagate system_time into guest VCPUs as precisely as possible with > the help of TSC. Different pairs of values (system_time, tsc reads) are fundamentally broken. This is why commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 x86, paravirt: Add a global synchronization point for pvclock Exists. Then to guarantee monotonicity you need to stop those updates (or perform the pair update in lock step): commit d828199e84447795c6669ff0e6c6d55eb9beeff6 KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag > sched_clock uses kvmclock to get nanoseconds since the system was > brought up and [1/2] only works with this abstracted ns count. > A poorly synchronized initial read is irrelevant because all VCPUs will > be using the same constant offset. > (We can never know the precise time anyway.) > > > Besides: > > > > 1) Stable sched clock in guest does not depend on > >KVM_FEATURE_CLOCKSOURCE_STABLE_BIT. > > Yes, thanks, I will remove that requirement in v1. (We'd need to > improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.) > > Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away, > there is now one possible unsigned overflow: in case the clock was very > imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the > time of initialization. It's very unlikely that we'll ever reach legal > overflow so I can add a condition there. > > > 2) You rely on monotonicity across vcpus to perform > >the 'minus delta that was read on vcpu0' calculation, but > >monotonicity across vcpus can fail during runtime > >(say host clocksource goes tsc->hpet due to tsc instability). > > That could be a problem, but I'm adding a VCPU independent constant to > all reads -- does the new code rely on monoticity in places where the > old one didn't? The problem is overflow: kvmclock non-monotonic across vcpus AND delta subtraction: ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock 11 1 22 21 33 02 -1 44 130 55 241 66 352 77 4
Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote: > This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is > RFC because I haven't explored many potential problems or tested it. The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you haven't explored potential problems or tested it? Sorry can't parse it. > > [1/2] uses a different algorithm in the guest to start counting from 0. > [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor. > > A viable alternative would be to implement opt-in features in kvm clock. > > And because we probably only broke one old user (the infamous SLES 10), a > workaround like this is also possible: (but I'd rather not do that) Please describe why SLES 10 breaks in detail: the state of the guest and the host before the patch, the state of the guest and host after the patch. What does SLES10 expect? Is it counting from zero that breaks SLES10? > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index a60bdbccff51..ae9049248aaf 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > > ka->boot_vcpu_runs_old_kvmclock = tmp; > > - ka->kvmclock_offset = -get_kernel_ns(); > + if (!ka->boot_vcpu_runs_old_kvmclock) > + ka->kvmclock_offset = -get_kernel_ns(); > } > > vcpu->arch.time = data; > > > Radim Krčmář (2): > x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO > Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock > system MSR" > > arch/x86/include/asm/pvclock-abi.h | 1 + > arch/x86/kernel/kvmclock.c | 46 > +- > arch/x86/kvm/x86.c | 4 > 3 files changed, 36 insertions(+), 15 deletions(-) > > -- > 2.5.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM fixes for 4.1-rc8
Linus, Please pull from git://git.kernel.org/pub/scm/virt/kvm/kvm.git master To receive the following KVM bug fix, which restores APIC migration functionality. Radim Krčmář (1): KVM: x86: fix lapic.timer_mode on restore arch/x86/kvm/lapic.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
On Tue, Apr 14, 2015 at 07:37:44AM +, Wu, Feng wrote: -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Tuesday, March 31, 2015 7:56 AM To: Wu, Feng Cc: h...@zytor.com; t...@linutronix.de; mi...@redhat.com; x...@kernel.org; g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org; j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com; eric.au...@linaro.org; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; kvm@vger.kernel.org Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked On Mon, Mar 30, 2015 at 04:46:55AM +, Wu, Feng wrote: -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Saturday, March 28, 2015 3:30 AM To: Wu, Feng Cc: h...@zytor.com; t...@linutronix.de; mi...@redhat.com; x...@kernel.org; g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org; j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com; eric.au...@linaro.org; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; kvm@vger.kernel.org Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked On Fri, Mar 27, 2015 at 06:34:14AM +, Wu, Feng wrote: Currently, the following code is executed before local_irq_disable() is called, so do you mean 1)moving local_irq_disable() to the place before it. 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit is set? 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit is set. Here is my understanding about your comments here: - Disable interrupts - Check 'ON' - Set KVM_REQ_EVENT if 'ON' is set Then we can put the above code inside if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) just like it used to be. However, I still have some questions about this comment: 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or other places? See below: If in vcpu_enter_guest(), since currently local_irq_disable() is called after 'KVM_REQ_EVENT' is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is called? local_irq_disable(); *** add code here *** So we need add code like the following here, right? if ('ON' is set) kvm_make_request(KVM_REQ_EVENT, vcpu); Hi Marcelo, I changed the code as above, then I found that the ping latency was extremely big, (70ms - 400ms). I digged into it and got the root cause. We cannot use checking-on as the judgment, since 'ON' can be cleared by hypervisor software in lots of places. In this case, KVM_REQ_EVENT cannot be set when we check 'ON' bit, hence the interrupts are not injected to the guest in time. Please refer to the following code, in which 'ON' bit can be cleared: apic_find_highest_irr () -- vmx_sync_pir_to_irr () -- pi_test_and_clear_on() Searching from the code step by step, apic_find_highest_irr() can be called by many other guys. Thanks, Ok then, ignore my suggestion. Can you resend the latest version please ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch v2 0/3] kvmclock: allow stable sched clock
kvmclock provides the behaviour sched_clock users expect. Mark it as stable allowing nohz_full in guests. See individual patches for more details. v2: address comments from Paolo: - set COUNTS_FROM_ZERO unconditionally (host). - rely on KVM_FEATURE_CLOCKSOURCE_STABLE_BIT (guest). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch v2 3/3] KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR
Initialize kvmclock base, on kvmclock system MSR write time, so that the guest sees kvmclock counting from zero. This matches baremetal behaviour when kvmclock in guest sets sched clock stable. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/kvm/x86.c |4 1 file changed, 4 insertions(+) Index: kvm/arch/x86/kvm/x86.c === --- kvm.orig/arch/x86/kvm/x86.c 2015-05-28 19:18:12.621372286 -0300 +++ kvm/arch/x86/kvm/x86.c 2015-05-28 19:19:17.738268690 -0300 @@ -1700,6 +1700,8 @@ vcpu-pvclock_set_guest_stopped_request = false; } + /* pvclock counts from zero */ + pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO; /* If the host uses TSC clocksource, then it is stable */ if (use_master_clock) pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; @@ -2282,6 +2284,8 @@ vcpu-requests); ka-boot_vcpu_runs_old_kvmclock = tmp; + + ka-kvmclock_offset = -get_kernel_ns(); } vcpu-arch.time = data; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch v2 2/3] x86: kvmclock: set scheduler clock stable
From: Luiz Capitulino lcapitul...@redhat.com If you try to enable NOHZ_FULL on a guest today, you'll get the following error when the guest tries to deactivate the scheduler tick: WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290() NO_HZ FULL will not work with unstable sched clock CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: events flush_to_ldisc 8162a0c7 88011f583e88 814e6ba0 0002 88011f583ed8 88011f583ec8 8104d095 88011f583eb8 0003 0001 0001 Call Trace: IRQ [814e6ba0] dump_stack+0x4f/0x7b [8104d095] warn_slowpath_common+0x85/0xc0 [8104d146] warn_slowpath_fmt+0x46/0x50 [810bd2a9] can_stop_full_tick+0xb9/0x290 [810bd9ed] tick_nohz_irq_exit+0x8d/0xb0 [810511c5] irq_exit+0xc5/0x130 [814f180a] smp_apic_timer_interrupt+0x4a/0x60 [814eff5e] apic_timer_interrupt+0x6e/0x80 EOI [814ee5d1] ? _raw_spin_unlock_irqrestore+0x31/0x60 [8108bbc8] __wake_up+0x48/0x60 [8134836c] n_tty_receive_buf_common+0x49c/0xba0 [8134a6bf] ? tty_ldisc_ref+0x1f/0x70 [81348a84] n_tty_receive_buf2+0x14/0x20 [8134b390] flush_to_ldisc+0xe0/0x120 [81064d05] process_one_work+0x1d5/0x540 [81064c81] ? process_one_work+0x151/0x540 [81065191] worker_thread+0x121/0x470 [81065070] ? process_one_work+0x540/0x540 [8106b4df] kthread+0xef/0x110 [8106b3f0] ? __kthread_parkme+0xa0/0xa0 [814ef4f2] ret_from_fork+0x42/0x70 [8106b3f0] ? __kthread_parkme+0xa0/0xa0 ---[ end trace 06e3507544a38866 ]--- However, it turns out that kvmclock does provide a stable sched_clock callback. So, let the scheduler know this which in turn makes NOHZ_FULL work in the guest. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- arch/x86/kernel/kvmclock.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) Index: kvm/arch/x86/kernel/kvmclock.c === --- kvm.orig/arch/x86/kernel/kvmclock.c 2015-05-27 18:00:53.616391551 -0300 +++ kvm/arch/x86/kernel/kvmclock.c 2015-05-28 19:19:13.878274636 -0300 @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/memblock.h +#include linux/sched.h #include asm/x86_init.h #include asm/reboot.h @@ -217,8 +218,10 @@ void __init kvmclock_init(void) { + struct pvclock_vcpu_time_info *vcpu_time; unsigned long mem; - int size; + int size, cpu; + u8 flags; size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); @@ -264,7 +267,14 @@ pv_info.name = KVM; if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) - pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); + pvclock_set_flags(~0); + + cpu = get_cpu(); + vcpu_time = hv_clock[cpu].pvti; + flags = pvclock_read_flags(vcpu_time); + if (flags PVCLOCK_COUNTS_FROM_ZERO) + set_sched_clock_stable(); + put_cpu(); } int __init kvm_setup_vsyscall_timeinfo(void) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch v2 1/3] x86: kvmclock: add flag to indicate pvclock counts from zero
Setting sched clock stable for kvmclock causes the printk timestamps to not start from zero, which is different from baremetal and can possibly break userspace. Add a flag to indicate that hypervisor sets clock base at zero when kvmclock is initialized. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/include/asm/pvclock-abi.h |1 + 1 file changed, 1 insertion(+) Index: kvm/arch/x86/include/asm/pvclock-abi.h === --- kvm.orig/arch/x86/include/asm/pvclock-abi.h 2014-11-06 23:59:14.615913334 -0200 +++ kvm/arch/x86/include/asm/pvclock-abi.h 2015-05-27 17:40:53.435192771 -0300 @@ -41,5 +41,6 @@ #define PVCLOCK_TSC_STABLE_BIT (1 0) #define PVCLOCK_GUEST_STOPPED (1 1) +#define PVCLOCK_COUNTS_FROM_ZERO (1 2) #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/3] x86: kvmclock: set scheduler clock stable
From: Luiz Capitulino lcapitul...@redhat.com If you try to enable NOHZ_FULL on a guest today, you'll get the following error when the guest tries to deactivate the scheduler tick: WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290() NO_HZ FULL will not work with unstable sched clock CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: events flush_to_ldisc 8162a0c7 88011f583e88 814e6ba0 0002 88011f583ed8 88011f583ec8 8104d095 88011f583eb8 0003 0001 0001 Call Trace: IRQ [814e6ba0] dump_stack+0x4f/0x7b [8104d095] warn_slowpath_common+0x85/0xc0 [8104d146] warn_slowpath_fmt+0x46/0x50 [810bd2a9] can_stop_full_tick+0xb9/0x290 [810bd9ed] tick_nohz_irq_exit+0x8d/0xb0 [810511c5] irq_exit+0xc5/0x130 [814f180a] smp_apic_timer_interrupt+0x4a/0x60 [814eff5e] apic_timer_interrupt+0x6e/0x80 EOI [814ee5d1] ? _raw_spin_unlock_irqrestore+0x31/0x60 [8108bbc8] __wake_up+0x48/0x60 [8134836c] n_tty_receive_buf_common+0x49c/0xba0 [8134a6bf] ? tty_ldisc_ref+0x1f/0x70 [81348a84] n_tty_receive_buf2+0x14/0x20 [8134b390] flush_to_ldisc+0xe0/0x120 [81064d05] process_one_work+0x1d5/0x540 [81064c81] ? process_one_work+0x151/0x540 [81065191] worker_thread+0x121/0x470 [81065070] ? process_one_work+0x540/0x540 [8106b4df] kthread+0xef/0x110 [8106b3f0] ? __kthread_parkme+0xa0/0xa0 [814ef4f2] ret_from_fork+0x42/0x70 [8106b3f0] ? __kthread_parkme+0xa0/0xa0 ---[ end trace 06e3507544a38866 ]--- However, it turns out that kvmclock does provide a stable sched_clock callback. So, let the scheduler know this which in turn makes NOHZ_FULL work in the guest. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- arch/x86/kernel/kvmclock.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) Index: kvm/arch/x86/kernel/kvmclock.c === --- kvm.orig/arch/x86/kernel/kvmclock.c 2015-05-27 18:00:53.616391551 -0300 +++ kvm/arch/x86/kernel/kvmclock.c 2015-05-27 22:43:14.474432962 -0300 @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/memblock.h +#include linux/sched.h #include asm/x86_init.h #include asm/reboot.h @@ -217,8 +218,10 @@ void __init kvmclock_init(void) { + struct pvclock_vcpu_time_info *vcpu_time; unsigned long mem; - int size; + int size, cpu; + u8 flags; size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); @@ -263,8 +266,18 @@ clocksource_register_hz(kvm_clock, NSEC_PER_SEC); pv_info.name = KVM; + flags = PVCLOCK_COUNTS_FROM_ZERO; if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) - pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); + flags |= PVCLOCK_TSC_STABLE_BIT; + + pvclock_set_flags(flags); + + cpu = get_cpu(); + vcpu_time = hv_clock[cpu].pvti; + flags = pvclock_read_flags(vcpu_time); + if (flags PVCLOCK_COUNTS_FROM_ZERO) + set_sched_clock_stable(); + put_cpu(); } int __init kvm_setup_vsyscall_timeinfo(void) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/3] x86: kvmclock: add flag to indicate pvclock counts from zero
Setting sched clock stable for kvmclock causes the printk timestamps to not start from zero, which is different from baremetal and can possibly break userspace. Add a flag to indicate that hypervisor sets clock base at zero when kvmclock is initialized. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/include/asm/pvclock-abi.h |1 + 1 file changed, 1 insertion(+) Index: kvm/arch/x86/include/asm/pvclock-abi.h === --- kvm.orig/arch/x86/include/asm/pvclock-abi.h 2014-11-06 23:59:14.615913334 -0200 +++ kvm/arch/x86/include/asm/pvclock-abi.h 2015-05-27 17:40:53.435192771 -0300 @@ -41,5 +41,6 @@ #define PVCLOCK_TSC_STABLE_BIT (1 0) #define PVCLOCK_GUEST_STOPPED (1 1) +#define PVCLOCK_COUNTS_FROM_ZERO (1 2) #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/3] kvmclock: allow stable sched clock
kvmclock provides the behaviour sched_clock users expect. Mark it as stable allowing nohz_full in guests. See individual patches for more details. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/3] KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR
Initialize kvmclock base, on kvmclock system MSR write time, so that the guest sees kvmclock counting from zero. This matches baremetal behaviour when kvmclock in guest sets sched clock stable. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/kvm/x86.c |5 + 1 file changed, 5 insertions(+) Index: kvm/arch/x86/kvm/x86.c === --- kvm.orig/arch/x86/kvm/x86.c 2015-05-27 17:40:46.948189811 -0300 +++ kvm/arch/x86/kvm/x86.c 2015-05-27 22:43:47.340413347 -0300 @@ -1703,6 +1703,8 @@ /* If the host uses TSC clocksource, then it is stable */ if (use_master_clock) pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; + if (ka-kvmclk_counts_from_zero) + pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO; vcpu-hv_clock.flags = pvclock_flags; @@ -2282,6 +2284,9 @@ vcpu-requests); ka-boot_vcpu_runs_old_kvmclock = tmp; + + ka-kvmclock_offset = -get_kernel_ns(); + ka-kvmclk_counts_from_zero = true; } vcpu-arch.time = data; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR
Initialize kvmclock base, on kvmclock system MSR write time, so that the guest sees kvmclock counting from zero. This matches baremetal behaviour when kvmclock in guest sets sched clock stable. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759..ea40d24 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2188,6 +2188,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) vcpu-requests); ka-boot_vcpu_runs_old_kvmclock = tmp; + + ka-kvmclock_offset = -get_kernel_ns(); } vcpu-arch.time = data; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR
Initialize kvmclock base, on kvmclock system MSR write time, so that the guest sees kvmclock counting from zero. This matches baremetal behaviour when kvmclock in guest sets sched clock stable. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759..ea40d24 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2188,6 +2188,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) vcpu-requests); ka-boot_vcpu_runs_old_kvmclock = tmp; + + ka-kvmclock_offset = get_kernel_ns(); } vcpu-arch.time = data; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm: odd time values since kvmclock: set scheduler clock stable
On Mon, May 18, 2015 at 10:13:03PM -0400, Sasha Levin wrote: On 05/18/2015 10:02 PM, Sasha Levin wrote: On 05/18/2015 08:13 PM, Marcelo Tosatti wrote: GOn Mon, May 18, 2015 at 07:45:41PM -0400, Sasha Levin wrote: On 05/18/2015 06:39 PM, Marcelo Tosatti wrote: On Tue, May 12, 2015 at 07:17:24PM -0400, Sasha Levin wrote: Hi all, I'm seeing odd jump in time values during boot of a KVM guest: [...] [0.00] tsc: Detected 2260.998 MHz processor [3376355.247558] Calibrating delay loop (skipped) preset value.. [...] I've bisected it to: Paolo, Sasha, Although this might seem undesirable, there is no requirement for sched_clock to initialize at 0: * * There is no strict promise about the base, although it tends to start * at 0 on boot (but people really shouldn't rely on that). * Sasha, are you seeing any problem other than the apparent time jump? Nope, but I've looked at it again and it seems that it jumps to the host's clock (that is, in the example above the 3376355 value was the host's clock value). Thanks, Sasha Sasha, thats right. Its the host monotonic clock. It's worth figuring out if (what) userspace breaks on that. I know it says that you shouldn't rely on that, but I'd happily place a bet on at least one userspace treating it as seconds since boot or something similar. Didn't need to go far... In the guest: # date Tue May 19 02:11:46 UTC 2015 # echo hi /dev/kmsg [3907533.080112] hi # dmesg -T [Fri Jul 3 07:33:41 2015] hi Sasha, Can you give the suggested patch (hypervisor patch...) a try please? (with a patched guest, obviously). KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: x86: add module parameter to disable periodic kvmclock sync
The periodic kvmclock sync can be an undesired source of latencies. When running cyclictest on a guest, a latency spike is visible. With kvmclock periodic sync disabled, the spike is gone. Guests should use ntp which means the propagations of ntp corrections from the host clock are unnecessary. v2: - Make parameter read-only (Radim) - Return early on kvmclock_sync_fn (Andrew) Reported-and-tested-by: Luiz Capitulino lcapitul...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 986b3f5..4c0c954 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -99,6 +99,9 @@ module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR); unsigned int min_timer_period_us = 500; module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); +static bool __read_mostly kvmclock_periodic_sync = true; +module_param(kvmclock_periodic_sync, bool, S_IRUGO); + bool kvm_has_tsc_control; EXPORT_SYMBOL_GPL(kvm_has_tsc_control); u32 kvm_max_guest_tsc_khz; @@ -1768,6 +1771,9 @@ static void kvmclock_sync_fn(struct work_struct *work) kvmclock_sync_work); struct kvm *kvm = container_of(ka, struct kvm, arch); + if (!kvmclock_periodic_sync) + return; + schedule_delayed_work(kvm-arch.kvmclock_update_work, 0); schedule_delayed_work(kvm-arch.kvmclock_sync_work, KVMCLOCK_SYNC_PERIOD); @@ -7109,6 +7115,9 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) kvm_write_tsc(vcpu, msr); vcpu_put(vcpu); + if (!kvmclock_periodic_sync) + return; + schedule_delayed_work(kvm-arch.kvmclock_sync_work, KVMCLOCK_SYNC_PERIOD); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvmclock: set scheduler clock stable
On Thu, Apr 23, 2015 at 05:12:42PM -0400, Luiz Capitulino wrote: If you try to enable NOHZ_FULL on a guest today, you'll get the following error when the guest tries to deactivate the scheduler tick: WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290() NO_HZ FULL will not work with unstable sched clock CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: events flush_to_ldisc 8162a0c7 88011f583e88 814e6ba0 0002 88011f583ed8 88011f583ec8 8104d095 88011f583eb8 0003 0001 0001 Call Trace: IRQ [814e6ba0] dump_stack+0x4f/0x7b [8104d095] warn_slowpath_common+0x85/0xc0 [8104d146] warn_slowpath_fmt+0x46/0x50 [810bd2a9] can_stop_full_tick+0xb9/0x290 [810bd9ed] tick_nohz_irq_exit+0x8d/0xb0 [810511c5] irq_exit+0xc5/0x130 [814f180a] smp_apic_timer_interrupt+0x4a/0x60 [814eff5e] apic_timer_interrupt+0x6e/0x80 EOI [814ee5d1] ? _raw_spin_unlock_irqrestore+0x31/0x60 [8108bbc8] __wake_up+0x48/0x60 [8134836c] n_tty_receive_buf_common+0x49c/0xba0 [8134a6bf] ? tty_ldisc_ref+0x1f/0x70 [81348a84] n_tty_receive_buf2+0x14/0x20 [8134b390] flush_to_ldisc+0xe0/0x120 [81064d05] process_one_work+0x1d5/0x540 [81064c81] ? process_one_work+0x151/0x540 [81065191] worker_thread+0x121/0x470 [81065070] ? process_one_work+0x540/0x540 [8106b4df] kthread+0xef/0x110 [8106b3f0] ? __kthread_parkme+0xa0/0xa0 [814ef4f2] ret_from_fork+0x42/0x70 [8106b3f0] ? __kthread_parkme+0xa0/0xa0 ---[ end trace 06e3507544a38866 ]--- However, it turns out that kvmclock does provide a stable sched_clock callback. So, let the scheduler know this which in turn makes NOHZ_FULL work in the guest. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- PS: Original author of this patch is Marcelo. I did most of the testing and backported it to an older real-time kernel tree. Works like a charm. arch/x86/kernel/kvmclock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 42caaef..4e03921 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/memblock.h +#include linux/sched.h #include asm/x86_init.h #include asm/reboot.h @@ -265,6 +266,8 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); + + set_sched_clock_stable(); } int __init kvm_setup_vsyscall_timeinfo(void) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Ping? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvmclock: drop rdtsc_barrier()
On Fri, Apr 24, 2015 at 10:36:14PM -0300, Marcelo Tosatti wrote: Drop unnecessary rdtsc_barrier(), as has been determined empirically, see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details. Noticed by Andy Lutomirski. Improves clock_gettime() by approximately 15% on Intel i7-3520M @ 2.90GHz. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 25b1cc0..a5beb23 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -86,7 +86,6 @@ unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, offset = pvclock_get_nsec_offset(src); ret = src-system_time + offset; ret_flags = src-flags; - rdtsc_barrier(); *cycles = ret; *flags = ret_flags; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Ping? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: kvmclock: drop rdtsc_barrier()
Drop unnecessary rdtsc_barrier(), as has been determined empirically, see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details. Noticed by Andy Lutomirski. Improves clock_gettime() by approximately 15% on Intel i7-3520M @ 2.90GHz. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 25b1cc0..a5beb23 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -86,7 +86,6 @@ unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, offset = pvclock_get_nsec_offset(src); ret = src-system_time + offset; ret_flags = src-flags; - rdtsc_barrier(); *cycles = ret; *flags = ret_flags; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: fix kvmclock update protocol
On Thu, Apr 23, 2015 at 01:46:55PM +0200, Paolo Bonzini wrote: From: Radim Krčmář rkrc...@redhat.com The kvmclock spec says that the host will increment a version field to an odd number, then update stuff, then increment it to an even number. The host is buggy and doesn't do this, and the result is observable when one vcpu reads another vcpu's kvmclock data. There's no good way for a guest kernel to keep its vdso from reading a different vcpu's kvmclock data, but we don't need to care about changing VCPUs as long as we read a consistent data from kvmclock. (VCPU can change outside of this loop too, so it doesn't matter if we return a value not fit for this VCPU.) Based on a patch by Radim Krčmář. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/x86.c | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ed31c31b2485..c73efcd03e29 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1669,12 +1669,28 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) guest_hv_clock, sizeof(guest_hv_clock return 0; - /* - * The interface expects us to write an even number signaling that the - * update is finished. Since the guest won't see the intermediate - * state, we just increase by 2 at the end. + /* This VCPU is paused, but it's legal for a guest to read another + * VCPU's kvmclock, so we really have to follow the specification where + * it says that version is odd if data is being modified, and even after + * it is consistent. + * + * Version field updates must be kept separate. This is because + * kvm_write_guest_cached might use a rep movs instruction, and + * writes within a string instruction are weakly ordered. So there + * are three writes overall. + * + * As a small optimization, only write the version field in the first + * and third write. The vcpu-pv_time cache is still valid, because the + * version field is the first in the struct. */ - vcpu-hv_clock.version = guest_hv_clock.version + 2; + BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0); + + vcpu-hv_clock.version = guest_hv_clock.version + 1; + kvm_write_guest_cached(v-kvm, vcpu-pv_time, + vcpu-hv_clock, + sizeof(vcpu-hv_clock.version)); + + smp_wmb(); /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ pvclock_flags = (guest_hv_clock.flags PVCLOCK_GUEST_STOPPED); @@ -1695,6 +1711,13 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) kvm_write_guest_cached(v-kvm, vcpu-pv_time, vcpu-hv_clock, sizeof(vcpu-hv_clock)); + + smp_wmb(); + + vcpu-hv_clock.version++; + kvm_write_guest_cached(v-kvm, vcpu-pv_time, + vcpu-hv_clock, + sizeof(vcpu-hv_clock.version)); return 0; } -- 1.8.3.1 Acked-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Thu, Apr 23, 2015 at 02:02:29PM +0200, Paolo Bonzini wrote: On 23/04/2015 13:51, Marcelo Tosatti wrote: https://bugzilla.redhat.com/show_bug.cgi?id=1174664 That was the missing volatile in an asm. Older compilers didn't catch it. :( How do you know that? It looks like memory corruption (look at the pattern at the end). I suspect some kind of operator error there, it makes no sense. if (unlikely(s-flags SLAB_POISON)) memset(start, POISON_INUSE, PAGE_SIZE order); * Padding is done using 0x5a (POISON_INUSE) On the other hand, bug 1178975 is much clearer and the symptoms are the same. In that bug, you can see that the same kernel source works on f20 (package version 3.17.7-200.fc20.x86_64) and fails on f21 (package version 3.17.7-300.fc21.x86_64). Of course the compiler is different. The newer one hoists the lsl out of the loop; if you get a CPU migration at the wrong time, the cpu != cpu1 condition will always be true the loop will never exit. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Thu, Apr 23, 2015 at 11:13:23AM +0200, Paolo Bonzini wrote: On 22/04/2015 23:21, Marcelo Tosatti wrote: On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote: On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 17/04/2015 22:18, Marcelo Tosatti wrote: The bug which this is fixing is very rare, have no memory of a report. In fact, its even difficult to create a synthetic reproducer. But then why was the task migration notifier even in Jeremy's original code for Xen? Was it supposed to work even on non-synchronized TSC? If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Did you think of something that would cause the seqcount-like protocol to fail, and that turned out not to be the case later? I was only following the mailing list sparsely in March. I don't think anyone ever tried that hard to test this stuff. There was an infinte loop that Firefox was triggering as a KVM guest somewhat reliably until a couple months ago in the same vdso code. :( https://bugzilla.redhat.com/show_bug.cgi?id=1174664 That was the missing volatile in an asm. Older compilers didn't catch it. :( How do you know that? It looks like memory corruption (look at the pattern at the end). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote: On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 17/04/2015 22:18, Marcelo Tosatti wrote: The bug which this is fixing is very rare, have no memory of a report. In fact, its even difficult to create a synthetic reproducer. But then why was the task migration notifier even in Jeremy's original code for Xen? Was it supposed to work even on non-synchronized TSC? If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Did you think of something that would cause the seqcount-like protocol to fail, and that turned out not to be the case later? I was only following the mailing list sparsely in March. I don't think anyone ever tried that hard to test this stuff. There was an infinte loop that Firefox was triggering as a KVM guest somewhat reliably until a couple months ago in the same vdso code. :( https://bugzilla.redhat.com/show_bug.cgi?id=1174664 --- Comment #5 from Juan Quintela quint...@redhat.com --- Another round # dmesg | grep msr [0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00 [0.00] kvm-clock: cpu 0, msr 1:1ffd8001, primary cpu clock [0.00] kvm-stealtime: cpu 0, msr 11fc0d100 [0.041174] kvm-clock: cpu 1, msr 1:1ffd8041, secondary cpu clock [0.053011] kvm-stealtime: cpu 1, msr 11fc8d100 After start: [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8000' 1ffd8000: 0x3b401060 0xfffc7f4b 0x3b42d040 0xfffc7f4b 1ffd8010: 0x3b42d460 0xfffc7f4b 0x3b42d4c0 0xfffc7f4b [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8040' 1ffd8040: 0x3b42d700 0xfffc7f4b 0x3b42d760 0xfffc7f4b 1ffd8050: 0x3b42d7c0 0xfffc7f4b 0x3b42d820 0xfffc7f4b When firefox hangs [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8000' 1ffd8000: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 1ffd8010: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8040' 1ffd8040: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 1ffd8050: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote: On 22/04/2015 22:56, Marcelo Tosatti wrote: But then why was the task migration notifier even in Jeremy's original code for Xen? To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe. Ok, to cover it for non-synchronized TSC. While KVM requires synchronized TSC. If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Because it fixes the problem Andy reported (see Subject: KVM: x86: fix kvmclock write race (v2) on kvm@). As long as you have Radim's fix on top. But if it's so rare, and it was known that fixing the host protocol was just as good a solution, why was the guest fix committed? I don't know. Should have fixed the host protocol. I'm just trying to understand. I am worried that this patch was rushed in; so far I had assumed it wasn't (a revert of a revert is rare enough that you don't do it lightly...) but maybe I was wrong. Yes it was rushed in. Right now I cannot even decide whether to revert it (and please Peter in the process :)) or submit the Kconfig symbol patch officially. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Mon, Apr 20, 2015 at 06:59:04PM +0200, Paolo Bonzini wrote: On 17/04/2015 22:18, Marcelo Tosatti wrote: The bug which this is fixing is very rare, have no memory of a report. In fact, its even difficult to create a synthetic reproducer. But then why was the task migration notifier even in Jeremy's original code for Xen? To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe. Was it supposed to work even on non-synchronized TSC? Yes it is supposed to work on non-synchronized TSC. If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Because it fixes the problem Andy reported (see Subject: KVM: x86: fix kvmclock write race (v2) on kvm@). As long as you have Radim's fix on top. Did you think of something that would cause the seqcount-like protocol to fail, and that turned out not to be the case later? I was only following the mailing list sparsely in March. No. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote: From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini pbonz...@redhat.com Date: Fri, 17 Apr 2015 14:57:34 +0200 Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER The task migration notifier is only used in x86 paravirt. Make it possible to compile it out. While at it, move some code around to ensure tmn is filled from CPU registers. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/Kconfig| 1 + init/Kconfig| 3 +++ kernel/sched/core.c | 9 - 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d43e7e1c784b..9af252c8698d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST config PARAVIRT bool Enable paravirtualization code + select TASK_MIGRATION_NOTIFIER ---help--- This changes the kernel so it can modify itself when it is run under a hypervisor, potentially improving performance significantly diff --git a/init/Kconfig b/init/Kconfig index 3b9df1aa35db..891917123338 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2016,6 +2016,9 @@ source block/Kconfig config PREEMPT_NOTIFIERS bool +config TASK_MIGRATION_NOTIFIER + bool + config PADATA depends on SMP bool diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9123a82cbb6..c07a53aa543c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) rq_clock_skip_update(rq, true); } +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); void register_task_migration_notifier(struct notifier_block *n) { atomic_notifier_chain_register(task_migration_notifier, n); } +#endif #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER struct task_migration_notifier tmn; + int from_cpu = task_cpu(p); +#endif if (p-sched_class-migrate_task_rq) p-sched_class-migrate_task_rq(p, new_cpu); p-se.nr_migrations++; perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER tmn.task = p; - tmn.from_cpu = task_cpu(p); + tmn.from_cpu = from_cpu; tmn.to_cpu = new_cpu; atomic_notifier_call_chain(task_migration_notifier, 0, tmn); +#endif } __set_task_cpu(p, new_cpu); -- 2.3.5 Paolo, Please revert the patch -- can fix properly in the host which also conforms the KVM guest/host documented protocol. Radim submitted a patch to kvm@ to split the kvm_write_guest in two with a barrier in between, i think. I'll review that patch. You're thinking of http://article.gmane.org/gmane.linux.kernel.stable/129187, but see Andy's reply: I think there are at least two ways that would work: a) If KVM incremented version as advertised: cpu = getcpu(); pvti = pvti for cpu; ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; I think this is safe because, we're guaranteed that there was an interval (between the two version reads) in which the vcpu we think we're on was running and the kvmclock data was valid and marked stable, and we know that the tsc we read came from that interval. Note: rdtscp isn't needed. If we're stable, is makes no difference which cpu's tsc we actually read. b) If version remains buggy but we use this migrations_from hack: cpu = getcpu(); pvti = pvti for cpu; m1 = pvti-migrations_from; barrier(); ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; /* probably not really needed */ barrier(); if (pvti-migrations_from != m1) retry; This is just like (a), except that we're using a guest kernel hack to ensure that no one migrated off the vcpu during the version-protected critical section and that we were, in fact, on that vcpu at some point during that critical section. Once we've ensured that we were on pvti's associated vcpu for the entire time we were reading it, then we are protected by the existing versioning in the host. (a) is not going to happen until 4.2, and there are too many buggy hosts around so we'd have to define new ABI that lets the guest
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote: On 17/04/2015 15:10, Peter Zijlstra wrote: On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote: On 17/04/2015 12:55, Peter Zijlstra wrote: Also, it looks like you already do exactly this for other things, look at: kvm_sched_in() kvm_arch_vcpu_load() if (unlikely(vcpu-cpu != cpu) ... ) So no, I don't believe for one second you need this. This [...] brings us back to where we were last time. There is _0_ justification for this in the patches, that alone is grounds enough to reject it. Oh, we totally agree on that. I didn't commit that patch, but I already said the commit message was insufficient. Why should the guest task care about the physical cpu of the vcpu; that's a layering fail if ever there was one. It's totally within your right to not read the code, but then please don't try commenting at it. This code: kvm_sched_in() kvm_arch_vcpu_load() if (unlikely(vcpu-cpu != cpu) ... ) runs in the host. The hypervisor obviously cares if the physical CPU of the VCPU changes. It has to tell the source processor (vcpu-cpu) to release the VCPU's data structure and only then it can use it in the target processor (cpu). No layering violation here. The task migration notifier runs in the guest, whenever the VCPU of a task changes. Furthermore, the only thing that migration handler seems to do is increment a variable that is not actually used in that file. It's used in the vDSO, so you cannot increment it in the file that uses it. And frankly, I think the static key is snake oil. The cost of task migration in terms of cache misses and TLB misses is in no way comparable to the cost of filling in a structure on the stack, dereferencing the head of the notifiers list and seeing that it's NULL. The path this notifier is called from has nothing to do with those costs. How not? The task is going to incur those costs, it's not like half a dozen extra instruction make any difference. But anyway... And the fact you're inflicting these costs on _everyone_ for a single x86_64-paravirt case is insane. ... that's a valid objection. Please look at the patch below. I've had enough of this, the below goes into sched/urgent and you can come back with sane patches if and when you're ready. Oh, please, cut the alpha male crap. Paolo --- 8 From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini pbonz...@redhat.com Date: Fri, 17 Apr 2015 14:57:34 +0200 Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER The task migration notifier is only used in x86 paravirt. Make it possible to compile it out. While at it, move some code around to ensure tmn is filled from CPU registers. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/Kconfig| 1 + init/Kconfig| 3 +++ kernel/sched/core.c | 9 - 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d43e7e1c784b..9af252c8698d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST config PARAVIRT bool Enable paravirtualization code + select TASK_MIGRATION_NOTIFIER ---help--- This changes the kernel so it can modify itself when it is run under a hypervisor, potentially improving performance significantly diff --git a/init/Kconfig b/init/Kconfig index 3b9df1aa35db..891917123338 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2016,6 +2016,9 @@ source block/Kconfig config PREEMPT_NOTIFIERS bool +config TASK_MIGRATION_NOTIFIER + bool + config PADATA depends on SMP bool diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9123a82cbb6..c07a53aa543c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) rq_clock_skip_update(rq, true); } +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); void register_task_migration_notifier(struct notifier_block *n) { atomic_notifier_chain_register(task_migration_notifier, n); } +#endif #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER struct task_migration_notifier tmn; + int from_cpu = task_cpu(p); +#endif if (p-sched_class-migrate_task_rq) p-sched_class-migrate_task_rq(p, new_cpu); p-se.nr_migrations++;
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 03:25:28PM -0700, Andy Lutomirski wrote: On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski l...@amacapital.net wrote: Muahaha. The auditors have invaded your system. (I did my little benchmark with a more sensible configuration -- see way below). Can you send the output of: # auditctl -s # auditctl -l # auditctl -s enabled 1 flag 1 pid 822 rate_limit 0 backlog_limit 320 lost 0 backlog 0 backlog_wait_time 6 loginuid_immutable 0 unlocked # auditctl -l No rules Yes, No rules doesn't mean don't audit. Are you, perchance, using Fedora? F21. Yup. I used to just disable auditing in the kernel entirely, but then I ended up deciding that I need to run something closer to the broken Fedora config (selinux in particular) in order to actually optimize the real-world pathname handling situation rather than the _sane_ one. Oh well. I think audit support got enabled at the same time in my kernels because I ended up using the default config and then taking out the truly crazy stuff without noticing AUDITSYSCALL. I lobbied rather heavily, and successfully, to get the default configuration to stop auditing. Unfortunately, the fix wasn't retroactive, so, unless you have a very fresh install, you might want to apply the fix yourself: Is that fix happening in Fedora going forward, though? Like F22? It's in F21, actually. Unfortunately, the audit package is really weird. It installs /etc/audit/rules.d/audit.rules, containing: # This file contains the auditctl rules that are loaded # whenever the audit daemon is started via the initscripts. # The rules are simply the parameters that would be passed # to auditctl. # First rule - delete all -D # This suppresses syscall auditing for all tasks started # with this rule in effect. Remove it if you need syscall # auditing. -a task,never Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't exist) it copies that file to /etc/audit/audit.rules post-install. (No, I don't know why it works this way.) IOW, you might want to copy your /etc/audit/rules.d/audit.rules to /etc/audit/audit.rules. I think you need to reboot to get the full effect. You could apply the rule manually (or maybe restart the audit service), but it will only affect newly-started tasks. Amdy Lumirtowsky thinks he meant to attach a condition to his maintainerish activities: he will do his best to keep the audit code *out* of the low-level stuff, but he's going to try to avoid ever touching the audit code itself, because if he ever had to change it, he might accidentally delete the entire file. Oooh. That would be _such_ a shame. Can we please do it by mistake? Oops, my fingers slipped Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some point? I would love auditing to set some really loud global warning that you've just done a Bad Thing (tm) performance-wise by enabling it. Or even just a big fat warning in dmesg the first time auditing triggers. Back to timing. With kvm-clock, I see: 23.80% timing_test_64 [kernel.kallsyms] [k] pvclock_clocksource_read Oh wow. How can that possibly be sane? Isn't the *whole* point of pvclock_clocksource_read() to be a native rdtsc with scaling? How does it cause that kind of insane pain? An unnecessarily complicated protocol, a buggy host implementation, and an unnecessarily complicated guest implementation :( How about start by removing the unnecessary rdtsc-barrier? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: fix kvmclock write race
On Fri, Apr 17, 2015 at 05:04:29PM -0700, Andy Lutomirski wrote: On Fri, Apr 17, 2015 at 4:38 PM, Marcelo Tosatti mtosa...@redhat.com wrote: From: Radim Krčmář rkrc...@redhat.com As noted by Andy Lutomirski, kvm does not follow the documented version protocol. Fix it. Note: this bug results in a race which can occur if the following three conditions are met: 1) There is KVM guest time update (there is one every 5 minutes). 2) Which races with a thread in the guest in the following way: The execution of these 29 instructions has to take at _least_ 2 seconds (rebalance interval is 1 second). lsl%r9w,%esi mov%esi,%r8d and$0x3f,%esi and$0xfff,%r8d test $0xfc0,%r8d jne0xa12 vread_pvclock+210 shl$0x6,%rsi mov-0xa01000(%rsi),%r10d data32 xchg %ax,%ax data32 xchg %ax,%ax rdtsc shl$0x20,%rdx mov%eax,%eax movsbl -0xa00fe4(%rsi),%ecx or %rax,%rdx sub-0xa00ff8(%rsi),%rdx mov-0xa00fe8(%rsi),%r11d mov%rdx,%rax shl%cl,%rax test %ecx,%ecx js 0xa08 vread_pvclock+200 mov%r11d,%edx movzbl -0xa00fe3(%rsi),%ecx mov-0xa00ff0(%rsi),%r11 mul%rdx shrd $0x20,%rdx,%rax data32 xchg %ax,%ax data32 xchg %ax,%ax lsl%r9w,%edx 3) Scheduler moves the task, while executing these 29 instructions, to a destination processor, then back to the source processor. 4) Source processor, after has been moved back from destination, perceives data out of order as written by processor performing guest time update (item 1), with string mov. Given the rarity of this condition, and the fact it was never observed or reported, reverting pvclock vsyscall on systems whose host is susceptible to the race, seems unnecessary. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759f69a3..8658599e0024 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) guest_hv_clock, sizeof(guest_hv_clock return 0; - /* -* The interface expects us to write an even number signaling that the -* update is finished. Since the guest won't see the intermediate -* state, we just increase by 2 at the end. + /* A guest can read other VCPU's kvmclock; specification says that +* version is odd if data is being modified and even after it is +* consistent. +* We write three times to be sure. +* 1) update version to odd number +* 2) write modified data (version is still odd) +* 3) update version to even number +* +* TODO: optimize +* - only two writes should be enough -- version is first +* - the second write could update just version You're relying on lots of barely-defined behavior here, since I think that both copies could use fast string operations. Those are explicitly unordered internally, so I think you really do need three writes. Correct, 3 writes are needed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: x86: fix kvmclock write race
From: Radim Krčmář rkrc...@redhat.com As noted by Andy Lutomirski, kvm does not follow the documented version protocol. Fix it. Note: this bug results in a race which can occur if the following three conditions are met: 1) There is KVM guest time update (there is one every 5 minutes). 2) Which races with a thread in the guest in the following way: The execution of these 29 instructions has to take at _least_ 2 seconds (rebalance interval is 1 second). lsl%r9w,%esi mov%esi,%r8d and$0x3f,%esi and$0xfff,%r8d test $0xfc0,%r8d jne0xa12 vread_pvclock+210 shl$0x6,%rsi mov-0xa01000(%rsi),%r10d data32 xchg %ax,%ax data32 xchg %ax,%ax rdtsc shl$0x20,%rdx mov%eax,%eax movsbl -0xa00fe4(%rsi),%ecx or %rax,%rdx sub-0xa00ff8(%rsi),%rdx mov-0xa00fe8(%rsi),%r11d mov%rdx,%rax shl%cl,%rax test %ecx,%ecx js 0xa08 vread_pvclock+200 mov%r11d,%edx movzbl -0xa00fe3(%rsi),%ecx mov-0xa00ff0(%rsi),%r11 mul%rdx shrd $0x20,%rdx,%rax data32 xchg %ax,%ax data32 xchg %ax,%ax lsl%r9w,%edx 3) Scheduler moves the task, while executing these 29 instructions, to a destination processor, then back to the source processor. 4) Source processor, after has been moved back from destination, perceives data out of order as written by processor performing guest time update (item 1), with string mov. Given the rarity of this condition, and the fact it was never observed or reported, reverting pvclock vsyscall on systems whose host is susceptible to the race, seems unnecessary. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759f69a3..8658599e0024 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) guest_hv_clock, sizeof(guest_hv_clock return 0; - /* -* The interface expects us to write an even number signaling that the -* update is finished. Since the guest won't see the intermediate -* state, we just increase by 2 at the end. + /* A guest can read other VCPU's kvmclock; specification says that +* version is odd if data is being modified and even after it is +* consistent. +* We write three times to be sure. +* 1) update version to odd number +* 2) write modified data (version is still odd) +* 3) update version to even number +* +* TODO: optimize +* - only two writes should be enough -- version is first +* - the second write could update just version */ - vcpu-hv_clock.version = guest_hv_clock.version + 2; + guest_hv_clock.version += 1; + kvm_write_guest_cached(v-kvm, vcpu-pv_time, + guest_hv_clock, + sizeof(guest_hv_clock)); + + vcpu-hv_clock.version = guest_hv_clock.version; /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ pvclock_flags = (guest_hv_clock.flags PVCLOCK_GUEST_STOPPED); @@ -1684,6 +1696,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) kvm_write_guest_cached(v-kvm, vcpu-pv_time, vcpu-hv_clock, sizeof(vcpu-hv_clock)); + + vcpu-hv_clock.version += 1; + kvm_write_guest_cached(v-kvm, vcpu-pv_time, + vcpu-hv_clock, + sizeof(vcpu-hv_clock)); return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: fix kvmclock write race
On Fri, Apr 17, 2015 at 05:04:29PM -0700, Andy Lutomirski wrote: On Fri, Apr 17, 2015 at 4:38 PM, Marcelo Tosatti mtosa...@redhat.com wrote: From: Radim Krčmář rkrc...@redhat.com As noted by Andy Lutomirski, kvm does not follow the documented version protocol. Fix it. Note: this bug results in a race which can occur if the following three conditions are met: 1) There is KVM guest time update (there is one every 5 minutes). 2) Which races with a thread in the guest in the following way: The execution of these 29 instructions has to take at _least_ 2 seconds (rebalance interval is 1 second). lsl%r9w,%esi mov%esi,%r8d and$0x3f,%esi and$0xfff,%r8d test $0xfc0,%r8d jne0xa12 vread_pvclock+210 shl$0x6,%rsi mov-0xa01000(%rsi),%r10d data32 xchg %ax,%ax data32 xchg %ax,%ax rdtsc shl$0x20,%rdx mov%eax,%eax movsbl -0xa00fe4(%rsi),%ecx or %rax,%rdx sub-0xa00ff8(%rsi),%rdx mov-0xa00fe8(%rsi),%r11d mov%rdx,%rax shl%cl,%rax test %ecx,%ecx js 0xa08 vread_pvclock+200 mov%r11d,%edx movzbl -0xa00fe3(%rsi),%ecx mov-0xa00ff0(%rsi),%r11 mul%rdx shrd $0x20,%rdx,%rax data32 xchg %ax,%ax data32 xchg %ax,%ax lsl%r9w,%edx 3) Scheduler moves the task, while executing these 29 instructions, to a destination processor, then back to the source processor. 4) Source processor, after has been moved back from destination, perceives data out of order as written by processor performing guest time update (item 1), with string mov. Given the rarity of this condition, and the fact it was never observed or reported, reverting pvclock vsyscall on systems whose host is susceptible to the race, seems unnecessary. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759f69a3..8658599e0024 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) guest_hv_clock, sizeof(guest_hv_clock return 0; - /* -* The interface expects us to write an even number signaling that the -* update is finished. Since the guest won't see the intermediate -* state, we just increase by 2 at the end. + /* A guest can read other VCPU's kvmclock; specification says that +* version is odd if data is being modified and even after it is +* consistent. +* We write three times to be sure. +* 1) update version to odd number +* 2) write modified data (version is still odd) +* 3) update version to even number +* +* TODO: optimize +* - only two writes should be enough -- version is first +* - the second write could update just version You're relying on lots of barely-defined behavior here, since I think that both copies could use fast string operations. Those are explicitly unordered internally, so I think you really do need three writes. The memory-ordering model respects the follow principles: 1. Stores within a single string operation may be executed out of order. 2. Stores from separate string operations (for example, stores from consecutive string operations) do not execute out of order. All the stores from an earlier string operation will complete before any store from a later string operation. Personally, if I wanted to optimize this (I'm not convinced it matters), It does not matter. I'd add a write-a-single-word primitive and use that for the version. Anyway, I think this code looks okay as is. Looking forward to see the patchset to have all vcpus reading from vcpu0 pvclock area. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: x86: fix kvmclock write race (v2)
From: Radim Krčmář rkrc...@redhat.com As noted by Andy Lutomirski, kvm does not follow the documented version protocol. Fix it. Note: this bug results in a race which can occur if the following three conditions are met: 1) There is KVM guest time update (there is one every 5 minutes). 2) Which races with a thread in the guest in the following way: The execution of these 29 instructions has to take at _least_ 2 seconds (rebalance interval is 1 second). lsl%r9w,%esi mov%esi,%r8d and$0x3f,%esi and$0xfff,%r8d test $0xfc0,%r8d jne0xa12 vread_pvclock+210 shl$0x6,%rsi mov-0xa01000(%rsi),%r10d data32 xchg %ax,%ax data32 xchg %ax,%ax rdtsc shl$0x20,%rdx mov%eax,%eax movsbl -0xa00fe4(%rsi),%ecx or %rax,%rdx sub-0xa00ff8(%rsi),%rdx mov-0xa00fe8(%rsi),%r11d mov%rdx,%rax shl%cl,%rax test %ecx,%ecx js 0xa08 vread_pvclock+200 mov%r11d,%edx movzbl -0xa00fe3(%rsi),%ecx mov-0xa00ff0(%rsi),%r11 mul%rdx shrd $0x20,%rdx,%rax data32 xchg %ax,%ax data32 xchg %ax,%ax lsl%r9w,%edx 3) Scheduler moves the task, while executing these 29 instructions, to a destination processor, then back to the source processor. 4) Source processor, after has been moved back from destination, perceives data out of order as written by processor performing guest time update (item 1), with string mov. Given the rarity of this condition, and the fact it was never observed or reported, reverting pvclock vsyscall on systems whose host is susceptible to the race, seems an excessive measure. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Cc: sta...@kernel.org diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759..e75fafd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1658,12 +1658,20 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) guest_hv_clock, sizeof(guest_hv_clock return 0; - /* -* The interface expects us to write an even number signaling that the -* update is finished. Since the guest won't see the intermediate -* state, we just increase by 2 at the end. + /* A guest can read other VCPU's kvmclock; specification says that +* version is odd if data is being modified and even after it is +* consistent. +* We write three times to be sure. +* 1) update version to odd number +* 2) write modified data (version is still odd) +* 3) update version to even number */ - vcpu-hv_clock.version = guest_hv_clock.version + 2; + guest_hv_clock.version += 1; + kvm_write_guest_cached(v-kvm, vcpu-pv_time, + guest_hv_clock, + sizeof(guest_hv_clock)); + + vcpu-hv_clock.version = guest_hv_clock.version; /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ pvclock_flags = (guest_hv_clock.flags PVCLOCK_GUEST_STOPPED); @@ -1684,6 +1692,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) kvm_write_guest_cached(v-kvm, vcpu-pv_time, vcpu-hv_clock, sizeof(vcpu-hv_clock)); + + vcpu-hv_clock.version += 1; + kvm_write_guest_cached(v-kvm, vcpu-pv_time, + vcpu-hv_clock, + sizeof(vcpu-hv_clock)); return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe
Since lapic timer handler only wakes up a simple waitqueue, it can be executed from hardirq context. Also handle the case where hrtimer_start_expires fails due to -ETIME, by injecting the interrupt to the guest immediately. Reduces average cyclictest latency by 3us. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/kvm/lapic.c | 42 +++--- 1 file changed, 39 insertions(+), 3 deletions(-) Index: rt-linux/arch/x86/kvm/lapic.c === --- rt-linux.orig/arch/x86/kvm/lapic.c 2015-04-08 20:20:41.0 -0300 +++ rt-linux/arch/x86/kvm/lapic.c 2015-04-08 20:21:16.592739674 -0300 @@ -1034,8 +1034,38 @@ apic-divide_count); } + +static enum hrtimer_restart apic_timer_fn(struct hrtimer *data); + +static void apic_timer_expired(struct hrtimer *data) +{ + int ret, i = 0; + enum hrtimer_restart r; + struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); + + r = apic_timer_fn(data); + + if (r == HRTIMER_RESTART) { + do { + ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS); + if (ret == -ETIME) + hrtimer_add_expires_ns(ktimer-timer, + ktimer-period); + i++; + } while (ret == -ETIME i 10); + + if (ret == -ETIME) { + printk_once(KERN_ERR %s: failed to reprogram timer\n, + __func__); + WARN_ON_ONCE(1); + } + } +} + + static void start_apic_timer(struct kvm_lapic *apic) { + int ret; ktime_t now; atomic_set(apic-lapic_timer.pending, 0); @@ -1065,9 +1095,11 @@ } } - hrtimer_start(apic-lapic_timer.timer, + ret = hrtimer_start(apic-lapic_timer.timer, ktime_add_ns(now, apic-lapic_timer.period), HRTIMER_MODE_ABS); + if (ret == -ETIME) + apic_timer_expired(apic-lapic_timer.timer); apic_debug(%s: bus cycle is % PRId64 ns, now 0x%016 PRIx64 , @@ -1097,8 +1129,10 @@ ns = (tscdeadline - guest_tsc) * 100ULL; do_div(ns, this_tsc_khz); } - hrtimer_start(apic-lapic_timer.timer, + ret = hrtimer_start(apic-lapic_timer.timer, ktime_add_ns(now, ns), HRTIMER_MODE_ABS); + if (ret == -ETIME) + apic_timer_expired(apic-lapic_timer.timer); local_irq_restore(flags); } @@ -1587,6 +1621,7 @@ hrtimer_init(apic-lapic_timer.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); apic-lapic_timer.timer.function = apic_timer_fn; + apic-lapic_timer.timer.irqsafe = 1; /* * APIC is created enabled. This will prevent kvm_lapic_set_base from @@ -1707,7 +1742,8 @@ timer = vcpu-arch.apic-lapic_timer.timer; if (hrtimer_cancel(timer)) - hrtimer_start_expires(timer, HRTIMER_MODE_ABS); + if (hrtimer_start_expires(timer, HRTIMER_MODE_ABS) == -ETIME) + apic_timer_expired(timer); } /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: # cyclictest -m -n -q -p99 -l 100 -h60 -D 1m This patch reduces the average latency in my tests from 14us to 11us. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/arm/kvm/arm.c |4 ++-- arch/arm/kvm/psci.c |4 ++-- arch/powerpc/include/asm/kvm_host.h |4 ++-- arch/powerpc/kvm/book3s_hv.c| 20 ++-- arch/s390/include/asm/kvm_host.h|2 +- arch/s390/kvm/interrupt.c |8 arch/x86/kvm/lapic.c|6 +++--- include/linux/kvm_host.h|4 ++-- virt/kvm/async_pf.c |4 ++-- virt/kvm/kvm_main.c | 16 10 files changed, 36 insertions(+), 36 deletions(-) Index: rt-linux/arch/arm/kvm/arm.c === --- rt-linux.orig/arch/arm/kvm/arm.c2015-04-08 20:20:39.962649422 -0300 +++ rt-linux/arch/arm/kvm/arm.c 2015-04-08 20:20:41.966654408 -0300 @@ -441,9 +441,9 @@ static void vcpu_pause(struct kvm_vcpu *vcpu) { - wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu); + struct swait_head *wq = kvm_arch_vcpu_wq(vcpu); - wait_event_interruptible(*wq, !vcpu-arch.pause); + swait_event_interruptible(*wq, !vcpu-arch.pause); } static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu) Index: rt-linux/arch/arm/kvm/psci.c === --- rt-linux.orig/arch/arm/kvm/psci.c 2015-04-08 20:20:39.962649422 -0300 +++ rt-linux/arch/arm/kvm/psci.c2015-04-08 20:20:41.966654408 -0300 @@ -66,7 +66,7 @@ { struct kvm *kvm = source_vcpu-kvm; struct kvm_vcpu *vcpu = NULL, *tmp; - wait_queue_head_t *wq; + struct swait_head *wq; unsigned long cpu_id; unsigned long context_id; unsigned long mpidr; @@ -123,7 +123,7 @@ smp_mb(); /* Make sure the above is visible */ wq = kvm_arch_vcpu_wq(vcpu); - wake_up_interruptible(wq); + swait_wake_interruptible(wq); return PSCI_RET_SUCCESS; } Index: rt-linux/arch/powerpc/include/asm/kvm_host.h === --- rt-linux.orig/arch/powerpc/include/asm/kvm_host.h 2015-04-08 20:20:39.963649425 -0300 +++ rt-linux/arch/powerpc/include/asm/kvm_host.h2015-04-08 20:20:41.966654408 -0300 @@ -296,7 +296,7 @@ u8 in_guest; struct list_head runnable_threads; spinlock_t lock; - wait_queue_head_t wq; + struct swait_head wq; u64 stolen_tb; u64 preempt_tb; struct kvm_vcpu *runner; @@ -618,7 +618,7 @@ u8 prodded; u32 last_inst; - wait_queue_head_t *wqp; + struct swait_head *wqp; struct kvmppc_vcore *vcore; int ret; int trap; Index: rt-linux/arch/powerpc/kvm/book3s_hv.c === --- rt-linux.orig/arch/powerpc/kvm/book3s_hv.c 2015-04-08 20:20:39.964649427 -0300 +++ rt-linux/arch/powerpc/kvm/book3s_hv.c 2015-04-08 20:20:41.966654408 -0300 @@ -84,11 +84,11 @@ { int me; int cpu = vcpu-cpu; - wait_queue_head_t *wqp; + struct swait_head *wqp; wqp = kvm_arch_vcpu_wq(vcpu); - if (waitqueue_active(wqp)) { - wake_up_interruptible(wqp); + if (swaitqueue_active(wqp)) { + swait_wake_interruptible(wqp); ++vcpu-stat.halt_wakeup; } @@ -639,8 +639,8 @@ tvcpu-arch.prodded = 1; smp_mb(); if (vcpu-arch.ceded) { - if (waitqueue_active(vcpu-wq)) { - wake_up_interruptible(vcpu-wq); + if (swaitqueue_active(vcpu-wq)) { + swait_wake_interruptible(vcpu-wq); vcpu-stat.halt_wakeup++; } } @@ -1357,7 +1357,7 @@ INIT_LIST_HEAD(vcore-runnable_threads); spin_lock_init(vcore-lock); - init_waitqueue_head(vcore-wq); + init_swait_head(vcore-wq); vcore-preempt_tb = TB_NIL; vcore-lpcr = kvm-arch.lpcr; vcore-first_vcpuid = core * threads_per_subcore; @@ -1826,13 +1826,13 @@ */ static
[patch -rt 0/2] use simple waitqueue for kvm vcpu waitqueue (v5)
Sebastian, rebased against v3.18.7-rt2 as requested. The problem: On -RT, an emulated LAPIC timer instance has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: # cyclictest -m -n -q -p99 -l 100 -h60 -D 1m This patch reduces the average latency in my tests from 14us to 11us. v2: improve changelog (Rik van Riel) v3: limit (once) guest triggered printk and WARN_ON (Paolo Bonzini) v4: fix typo (Steven Rostedt) v5: rebase against v3.18.7-rt2. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: vdso: fix pvclock races with task migration
On Thu, Apr 02, 2015 at 08:44:23PM +0200, Radim Krčmář wrote: If we were migrated right after __getcpu, but before reading the migration_count, we wouldn't notice that we read TSC of a different VCPU, nor that KVM's bug made pvti invalid, as only migration_count on source VCPU is increased. Change vdso instead of updating migration_count on destination. Fixes: 0a4e6be9ca17 (x86: kvm: Revert remove sched notifier for cross-cpu migrations) Cc: sta...@vger.kernel.org Signed-off-by: Radim Krčmář rkrc...@redhat.com --- Because it we'll get a complete rewrite, this series does not - remove the outdated 'TODO: We can put [...]' comment - use a proper encapsulation for the inner do-while loop - optimize the outer do-while loop (no need to re-read cpu id on version mismatch) arch/x86/vdso/vclock_gettime.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c index 30933760ee5f..40d2473836c9 100644 --- a/arch/x86/vdso/vclock_gettime.c +++ b/arch/x86/vdso/vclock_gettime.c @@ -99,21 +99,25 @@ static notrace cycle_t vread_pvclock(int *mode) * __getcpu() calls (Gleb). */ - pvti = get_pvti(cpu); + /* Make sure migrate_count will change if we leave the VCPU. */ + do { + pvti = get_pvti(cpu); + migrate_count = pvti-migrate_count; - migrate_count = pvti-migrate_count; + cpu1 = cpu; + cpu = __getcpu() VGETCPU_CPU_MASK; + } while (unlikely(cpu != cpu1)); version = __pvclock_read_cycles(pvti-pvti, ret, flags); /* * Test we're still on the cpu as well as the version. - * We could have been migrated just after the first - * vgetcpu but before fetching the version, so we - * wouldn't notice a version change. + * - We must read TSC of pvti's VCPU. + * - KVM doesn't follow the versioning protocol, so data could + * change before version if we left the VCPU. */ - cpu1 = __getcpu() VGETCPU_CPU_MASK; - } while (unlikely(cpu != cpu1 || - (pvti-pvti.version 1) || + smp_rmb(); + } while (unlikely((pvti-pvti.version 1) || pvti-pvti.version != version || pvti-migrate_count != migrate_count)); -- 2.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
On Mon, Mar 30, 2015 at 04:46:55AM +, Wu, Feng wrote: -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Saturday, March 28, 2015 3:30 AM To: Wu, Feng Cc: h...@zytor.com; t...@linutronix.de; mi...@redhat.com; x...@kernel.org; g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org; j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com; eric.au...@linaro.org; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; kvm@vger.kernel.org Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked On Fri, Mar 27, 2015 at 06:34:14AM +, Wu, Feng wrote: Currently, the following code is executed before local_irq_disable() is called, so do you mean 1)moving local_irq_disable() to the place before it. 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit is set? 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit is set. Here is my understanding about your comments here: - Disable interrupts - Check 'ON' - Set KVM_REQ_EVENT if 'ON' is set Then we can put the above code inside if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) just like it used to be. However, I still have some questions about this comment: 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or other places? See below: If in vcpu_enter_guest(), since currently local_irq_disable() is called after 'KVM_REQ_EVENT' is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is called? local_irq_disable(); *** add code here *** So we need add code like the following here, right? if ('ON' is set) kvm_make_request(KVM_REQ_EVENT, vcpu); Yes. if (vcpu-mode == EXITING_GUEST_MODE || vcpu-requests ^^ Point *1. || need_resched() || signal_pending(current)) { vcpu-mode = OUTSIDE_GUEST_MODE; smp_wmb(); local_irq_enable(); preempt_enable(); vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu); r = 1; goto cancel_injection; } 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is disabled (the related bit in PIR is also set). Yes, we are checking if the HW has set an interrupt in PIR while outside VM (which requires PIR-VIRR transfer by software). If the interrupt it set by hardware after local_irq_disable(), VMX-entry will handle the interrupt and perform the PIR-VIRR transfer and reevaluate interrupts, injecting to guest if necessary, is that correct ? So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly after interrupt is disabled? To replace the costly +*/ + if (kvm_x86_ops-hwapic_irr_update) + kvm_x86_ops-hwapic_irr_update(vcpu, + kvm_lapic_find_highest_irr(vcpu)); Yes, i think so. After adding the checking ON and setting KVM_REQ_EVENT operations listed in my comments above, do you mean we still need to keep the costly code above inside if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {} in function vcpu_enter_guest() as it used to be? If yes, my question is what is the exact purpose of checking ON and setting KVM_REQ_EVENT operations? Here is the code flow in vcpu_enter_guest(): 1. Check KVM_REQ_EVENT, if it is set, sync pir-virr 2. Disable interrupts 3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but it is checked in the step 1, which means, we cannot get any benefits even we set it here, since the pir-virr sync operation was done in step 1, between step 3 and VM-Entry, we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here, the interrupts remaining in PIR cannot be delivered to guest during this VM-Entry, right? Please check point *1 above. The code will go back to if (kvm_check_request(KVM_REQ_EVENT, vcpu) And perform the pir-virr sync. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
On Fri, Mar 27, 2015 at 06:34:14AM +, Wu, Feng wrote: Currently, the following code is executed before local_irq_disable() is called, so do you mean 1)moving local_irq_disable() to the place before it. 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit is set? 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit is set. Here is my understanding about your comments here: - Disable interrupts - Check 'ON' - Set KVM_REQ_EVENT if 'ON' is set Then we can put the above code inside if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) just like it used to be. However, I still have some questions about this comment: 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or other places? See below: If in vcpu_enter_guest(), since currently local_irq_disable() is called after 'KVM_REQ_EVENT' is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is called? local_irq_disable(); *** add code here *** if (vcpu-mode == EXITING_GUEST_MODE || vcpu-requests ^^ || need_resched() || signal_pending(current)) { vcpu-mode = OUTSIDE_GUEST_MODE; smp_wmb(); local_irq_enable(); preempt_enable(); vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu); r = 1; goto cancel_injection; } 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is disabled (the related bit in PIR is also set). Yes, we are checking if the HW has set an interrupt in PIR while outside VM (which requires PIR-VIRR transfer by software). If the interrupt it set by hardware after local_irq_disable(), VMX-entry will handle the interrupt and perform the PIR-VIRR transfer and reevaluate interrupts, injecting to guest if necessary, is that correct ? So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly after interrupt is disabled? To replace the costly +*/ + if (kvm_x86_ops-hwapic_irr_update) + kvm_x86_ops-hwapic_irr_update(vcpu, + kvm_lapic_find_highest_irr(vcpu)); Yes, i think so. I might miss something in your comments, if so please point out. Thanks a lot! Thanks, Feng if (kvm_x86_ops-hwapic_irr_update) kvm_x86_ops-hwapic_irr_update(vcpu, kvm_lapic_find_highest_irr(vcpu)); kvm_lapic_find_highest_irr(vcpu) eats some cache (4 cachelines) versus 1 cacheline for reading ON bit. Please remove blocked and wakeup_cpu, they should not be necessary. Why do you think wakeup_cpu is not needed, when vCPU is blocked, wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU is woken up, it can run on a different cpu, so we need wakeup_cpu to find the right list to wake up the vCPU. If the vCPU was moved it should have updated IRTE destination field to the pCPU which it has moved to? Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed would be updated accordingly. When vCPU is blocked. To wake up the blocked vCPU, we need to find which list the vCPU is blocked on, and this is what wakeup_cpu used for? Right, perhaps prev_vcpu is a better name. Do you mean prev_pcpu? Yes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM fixes for 4.0-rc5
Linus, Please pull from git://git.kernel.org/pub/scm/virt/kvm/kvm.git master To receive the following PPC KVM bug fixes Marcelo Tosatti (1): Merge tag 'signed-for-4.0' of git://github.com/agraf/linux-2.6 Paul Mackerras (3): KVM: PPC: Book3S HV: Fix spinlock/mutex ordering issue in kvmppc_set_lpcr() KVM: PPC: Book3S HV: Endian fix for accessing VPA yield count KVM: PPC: Book3S HV: Fix instruction emulation arch/powerpc/kvm/book3s_hv.c|8 arch/powerpc/kvm/book3s_hv_rmhandlers.S |1 + 2 files changed, 5 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote: On Wed, Mar 25, 2015 at 4:13 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote: On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote: On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote: 2015-03-25 12:08+0100, Radim Krčmář: Reverting the patch protects us from any migration, but I don't think we need to care about changing VCPUs as long as we read a consistent data from kvmclock. (VCPU can change outside of this loop too, so it doesn't matter if we return a value not fit for this VCPU.) I think we could drop the second __getcpu if our kvmclock was being handled better; maybe with a patch like the one below: The second __getcpu is not neccessary, but I forgot about rdtsc. We need to either use rtdscp, know the host has synchronized tsc, or monitor VCPU migrations. Only the last one works everywhere. The vdso code is only used if host has synchronized tsc. But you have to handle the case where host goes from synchronized tsc to unsynchronized tsc (see the clocksource notifier in the host side). Can't we change the host to freeze all vcpus and clear the stable bit on all of them if this happens? This would simplify and speed up vclock_gettime. --Andy Seems interesting to do on 512-vcpus, but sure, could be done. If you have a 512-vcpu system that switches between stable and unstable more than once per migration, then I expect that you have serious problems and this is the least of your worries. Personally, I'd *much* rather we just made vcpu 0's pvti authoritative if we're stable. If nothing else, I'm not even remotely convinced that the current scheme gives monotonic timing due to skew between when the updates happen on different vcpus. Can you write down the problem ? I can try. Suppose we start out with all vcpus agreeing on their pvti and perfect invariant TSCs. Now the host updates its frequency (due to NTP or whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti. They'll disagree on the time, and one of them will be ahead until vcpu 1's pvti gets updated. The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs to be visible at one time, for all vcpus. * That is, when timespec0 != timespec1, M N. Unfortunately that is * not * always the case (the difference between two distinct xtime instances * might be smaller then the difference between corresponding TSC reads, * when updating guest vcpus pvclock areas). * * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Thu, Mar 26, 2015 at 04:28:37PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote: 2015-03-26 11:51-0700, Andy Lutomirski: On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote: Suppose we start out with all vcpus agreeing on their pvti and perfect invariant TSCs. Now the host updates its frequency (due to NTP or whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti. They'll disagree on the time, and one of them will be ahead until vcpu 1's pvti gets updated. The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs to be visible at one time, for all vcpus. * That is, when timespec0 != timespec1, M N. Unfortunately that is * not * always the case (the difference between two distinct xtime instances * might be smaller then the difference between corresponding TSC reads, * when updating guest vcpus pvclock areas). * * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep. Yuck. So we have per cpu timing data, but the protocol is only usable for monotonic timing because we forcibly freeze all vcpus when we update the nominally per cpu data. The obvious guest implementations are still unnecessarily slow, though. It would be nice if the guest could get away without using any getcpu operation at all. Even if we fixed the host to increment version as advertised, I think we can't avoid two getcpu ops. We need one before rdtsc to figure out which pvti to look at, Yes. and we need another to make sure that we were actually on that cpu at the time we did rdtsc. (Rdtscp doesn't help -- we need to check version before rdtsc, and we don't know what version to check until we do a getcpu.). Exactly, reading cpuid after rdtsc doesn't do that though, we could have migrated back between those reads. rtdscp would allow us to check that we read tsc of pvti's cpu. (It doesn't get rid of that first read.) The migration hook has the same issue -- we need to check the migration count, then confirm we're on that cpu, then check the migration count again, and we can't do that until we know what cpu we're on. True; the revert has a bug -- we need to check cpuid for the second time before rdtsc. (Migration hook is there just because we don't know which cpu executed rdtsc.) One way or another, I'm planning on completely rewriting the vdso code. An early draft is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980 but I can't finish it until the KVM side shakes out. I think there are at least two ways that would work: a) If KVM incremented version as advertised: All for it. cpu = getcpu(); pvti = pvti for cpu; ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; I think this is safe because, we're guaranteed that there was an interval (between the two version reads) in which the vcpu we think we're on was running and the kvmclock data was valid and marked stable, and we know that the tsc we read came from that interval. Note: rdtscp isn't needed. If we're stable, is makes no difference which cpu's tsc we actually read. Yes, can't see a problem with that. b) If version remains buggy but we use this migrations_from hack: There is no reason for version to remain buggy. cpu = getcpu(); pvti = pvti for cpu; m1 = pvti-migrations_from; barrier(); ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; /* probably not really needed */ barrier(); if (pvti-migrations_from != m1) retry; This is just like (a), except that we're using a guest kernel hack to ensure that no one migrated off the vcpu during the version-protected critical section
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote: 2015-03-23 20:21-0300, Marcelo Tosatti: The following point: 2. per-CPU pvclock time info is updated if the underlying CPU changes. Is not true anymore since KVM: x86: update pvclock area conditionally, on cpu migration. Add task migration notification back. Problem noticed by Andy Lutomirski. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com CC: sta...@kernel.org # 3.11+ Revert contains a bug that got pointed out in the discussion: diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c do { cpu = __getcpu() VGETCPU_CPU_MASK; pvti = get_pvti(cpu); We can migrate to 'other cpu' here. + migrate_count = pvti-migrate_count; + version = __pvclock_read_cycles(pvti-pvti, ret, flags); And migrate back to 'cpu' here. Migrating back will increase pvti-migrate_count, right ? rdtsc was executed on different cpu, so pvti and tsc might not be in sync, but migrate_count hasn't changed. cpu1 = __getcpu() VGETCPU_CPU_MASK; (Reading cpuid here is useless.) } while (unlikely(cpu != cpu1 || (pvti-pvti.version 1) || - pvti-pvti.version != version)); + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); We can workaround the bug with, cpu = __getcpu() VGETCPU_CPU_MASK; pvti = get_pvti(cpu); migrate_count = pvti-migrate_count; if (cpu != (__getcpu() VGETCPU_CPU_MASK)) continue; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Thu, Mar 26, 2015 at 03:24:10PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 3:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote: 2015-03-23 20:21-0300, Marcelo Tosatti: The following point: 2. per-CPU pvclock time info is updated if the underlying CPU changes. Is not true anymore since KVM: x86: update pvclock area conditionally, on cpu migration. Add task migration notification back. Problem noticed by Andy Lutomirski. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com CC: sta...@kernel.org # 3.11+ Revert contains a bug that got pointed out in the discussion: diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c do { cpu = __getcpu() VGETCPU_CPU_MASK; pvti = get_pvti(cpu); We can migrate to 'other cpu' here. + migrate_count = pvti-migrate_count; + version = __pvclock_read_cycles(pvti-pvti, ret, flags); And migrate back to 'cpu' here. Migrating back will increase pvti-migrate_count, right ? I thought it only increased the count when we migrated away. Right. --Andy rdtsc was executed on different cpu, so pvti and tsc might not be in sync, but migrate_count hasn't changed. cpu1 = __getcpu() VGETCPU_CPU_MASK; (Reading cpuid here is useless.) } while (unlikely(cpu != cpu1 || (pvti-pvti.version 1) || - pvti-pvti.version != version)); + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); We can workaround the bug with, cpu = __getcpu() VGETCPU_CPU_MASK; pvti = get_pvti(cpu); migrate_count = pvti-migrate_count; if (cpu != (__getcpu() VGETCPU_CPU_MASK)) continue; Looks good, please submit a fix. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote: 2015-03-26 11:51-0700, Andy Lutomirski: On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote: Suppose we start out with all vcpus agreeing on their pvti and perfect invariant TSCs. Now the host updates its frequency (due to NTP or whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti. They'll disagree on the time, and one of them will be ahead until vcpu 1's pvti gets updated. The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs to be visible at one time, for all vcpus. * That is, when timespec0 != timespec1, M N. Unfortunately that is * not * always the case (the difference between two distinct xtime instances * might be smaller then the difference between corresponding TSC reads, * when updating guest vcpus pvclock areas). * * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep. Yuck. So we have per cpu timing data, but the protocol is only usable for monotonic timing because we forcibly freeze all vcpus when we update the nominally per cpu data. The obvious guest implementations are still unnecessarily slow, though. It would be nice if the guest could get away without using any getcpu operation at all. Even if we fixed the host to increment version as advertised, I think we can't avoid two getcpu ops. We need one before rdtsc to figure out which pvti to look at, Yes. and we need another to make sure that we were actually on that cpu at the time we did rdtsc. (Rdtscp doesn't help -- we need to check version before rdtsc, and we don't know what version to check until we do a getcpu.). Exactly, reading cpuid after rdtsc doesn't do that though, we could have migrated back between those reads. rtdscp would allow us to check that we read tsc of pvti's cpu. (It doesn't get rid of that first read.) The migration hook has the same issue -- we need to check the migration count, then confirm we're on that cpu, then check the migration count again, and we can't do that until we know what cpu we're on. True; the revert has a bug -- we need to check cpuid for the second time before rdtsc. (Migration hook is there just because we don't know which cpu executed rdtsc.) One way or another, I'm planning on completely rewriting the vdso code. An early draft is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980 but I can't finish it until the KVM side shakes out. I think there are at least two ways that would work: a) If KVM incremented version as advertised: All for it. cpu = getcpu(); pvti = pvti for cpu; ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; I think this is safe because, we're guaranteed that there was an interval (between the two version reads) in which the vcpu we think we're on was running and the kvmclock data was valid and marked stable, and we know that the tsc we read came from that interval. Note: rdtscp isn't needed. If we're stable, is makes no difference which cpu's tsc we actually read. Yes, can't see a problem with that. b) If version remains buggy but we use this migrations_from hack: There is no reason for version to remain buggy. cpu = getcpu(); pvti = pvti for cpu; m1 = pvti-migrations_from; barrier(); ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; /* probably not really needed */ barrier(); if (pvti-migrations_from != m1) retry; This is just like (a), except that we're using a guest kernel hack to ensure that no one migrated off the vcpu during the version-protected critical section and that we were, in fact, on that vcpu at some point during that critical section. Once we've ensured that we were on pvti's associated vcpu for the entire time we were reading it, then we are protected by the existing versioning in the host. If, on the other hand, we could rely on having all of these things in sync, then this complication goes away, and we go down from two getcpu ops to zero. (Yeah, we should look what are the drawbacks of doing it differently.) If the versioning were fixed, I think we could almost get away with: pvti = pvti for vcpu 0; ver1
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote: On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote: 2015-03-26 11:51-0700, Andy Lutomirski: On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote: Suppose we start out with all vcpus agreeing on their pvti and perfect invariant TSCs. Now the host updates its frequency (due to NTP or whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti. They'll disagree on the time, and one of them will be ahead until vcpu 1's pvti gets updated. The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs to be visible at one time, for all vcpus. * That is, when timespec0 != timespec1, M N. Unfortunately that is * not * always the case (the difference between two distinct xtime instances * might be smaller then the difference between corresponding TSC reads, * when updating guest vcpus pvclock areas). * * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep. Yuck. So we have per cpu timing data, but the protocol is only usable for monotonic timing because we forcibly freeze all vcpus when we update the nominally per cpu data. The obvious guest implementations are still unnecessarily slow, though. It would be nice if the guest could get away without using any getcpu operation at all. Even if we fixed the host to increment version as advertised, I think we can't avoid two getcpu ops. We need one before rdtsc to figure out which pvti to look at, Yes. and we need another to make sure that we were actually on that cpu at the time we did rdtsc. (Rdtscp doesn't help -- we need to check version before rdtsc, and we don't know what version to check until we do a getcpu.). Exactly, reading cpuid after rdtsc doesn't do that though, we could have migrated back between those reads. rtdscp would allow us to check that we read tsc of pvti's cpu. (It doesn't get rid of that first read.) The migration hook has the same issue -- we need to check the migration count, then confirm we're on that cpu, then check the migration count again, and we can't do that until we know what cpu we're on. True; the revert has a bug -- we need to check cpuid for the second time before rdtsc. (Migration hook is there just because we don't know which cpu executed rdtsc.) One way or another, I'm planning on completely rewriting the vdso code. An early draft is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980 but I can't finish it until the KVM side shakes out. I think there are at least two ways that would work: a) If KVM incremented version as advertised: All for it. cpu = getcpu(); pvti = pvti for cpu; ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; I think this is safe because, we're guaranteed that there was an interval (between the two version reads) in which the vcpu we think we're on was running and the kvmclock data was valid and marked stable, and we know that the tsc we read came from that interval. Note: rdtscp isn't needed. If we're stable, is makes no difference which cpu's tsc we actually read. Yes, can't see a problem with that. b) If version remains buggy but we use this migrations_from hack: There is no reason for version to remain buggy. cpu = getcpu(); pvti = pvti for cpu; m1 = pvti-migrations_from; barrier(); ver1 = pvti-version; check stable bit; rdtsc_barrier, rdtsc, read scale, shift, etc. if (getcpu() != cpu) retry; if (pvti-version != ver1) retry; /* probably not really needed */ barrier(); if (pvti-migrations_from != m1) retry; This is just like (a), except that we're using a guest kernel hack to ensure that no one migrated off the vcpu during the version-protected critical section and that we were, in fact, on that vcpu at some point during that critical section. Once we've ensured that we were on pvti's associated vcpu for the entire time we were reading it, then we are protected by the existing versioning in the host. If, on the other hand, we could rely on having all
Re: [PATCH v2 06/12] KVM: mark kvm-buses as empty once they were destroyed
On Wed, Mar 25, 2015 at 05:09:13PM +, Marc Zyngier wrote: On 23/03/15 15:58, Andre Przywara wrote: In kvm_destroy_vm() we call kvm_io_bus_destroy() pretty early, especially before calling kvm_arch_destroy_vm(). To avoid unregistering devices from the already destroyed bus, let's mark the bus with NULL to let other users know it has been destroyed already. This avoids a crash on a VM shutdown with the VGIC using the kvm_io_bus later (the unregistering is in there to be able to roll back a faulting init). Signed-off-by: Andre Przywara andre.przyw...@arm.com That seems sensible, but I don't see why nobody else hits that. What are we doing differently? Otherwise, Reviewed-by: Marc Zyngier marc.zyng...@arm.com Paolo, Marcelo, can we have your Ack on this? Thanks, M. --- virt/kvm/kvm_main.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8c7ab0b..6f164eb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -604,8 +604,10 @@ static void kvm_destroy_vm(struct kvm *kvm) list_del(kvm-vm_list); spin_unlock(kvm_lock); kvm_free_irq_routing(kvm); - for (i = 0; i KVM_NR_BUSES; i++) + for (i = 0; i KVM_NR_BUSES; i++) { kvm_io_bus_destroy(kvm-buses[i]); + kvm-buses[i] = NULL; + } kvm_coalesced_mmio_free(kvm); #if defined(CONFIG_MMU_NOTIFIER) defined(KVM_ARCH_WANT_MMU_NOTIFIER) mmu_notifier_unregister(kvm-mmu_notifier, kvm-mm); -- Jazz is not dead. It just smells funny... Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: nVMX: Add support for rdtscp
On Mon, Mar 23, 2015 at 07:27:19PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If the guest CPU is supposed to support rdtscp and the host has rdtscp enabled in the secondary execution controls, we can also expose this feature to L1. Just extend nested_vmx_exit_handled to properly route EXIT_REASON_RDTSCP. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote: 2015-03-25 12:08+0100, Radim Krčmář: Reverting the patch protects us from any migration, but I don't think we need to care about changing VCPUs as long as we read a consistent data from kvmclock. (VCPU can change outside of this loop too, so it doesn't matter if we return a value not fit for this VCPU.) I think we could drop the second __getcpu if our kvmclock was being handled better; maybe with a patch like the one below: The second __getcpu is not neccessary, but I forgot about rdtsc. We need to either use rtdscp, know the host has synchronized tsc, or monitor VCPU migrations. Only the last one works everywhere. The vdso code is only used if host has synchronized tsc. But you have to handle the case where host goes from synchronized tsc to unsynchronized tsc (see the clocksource notifier in the host side). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/12] KVM: x86: remove now unneeded include directory from Makefile
On Wed, Mar 25, 2015 at 05:01:52PM +, Marc Zyngier wrote: On 23/03/15 15:58, Andre Przywara wrote: virt/kvm was never really a good include directory for anything else than locally included headers. With the move of iodev.h there is no need anymore to add this directory the compiler's include path, so remove it from the x86 kvm Makefile. Signed-off-by: Andre Przywara andre.przyw...@arm.com Reviewed-by: Marc Zyngier marc.zyng...@arm.com Paolo, Marcelo: can we have your Ack on this? Thanks, M. Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/12] KVM: move iodev.h from virt/kvm/ to include/kvm
On Wed, Mar 25, 2015 at 05:00:02PM +, Marc Zyngier wrote: On 23/03/15 15:58, Andre Przywara wrote: iodev.h contains definitions for the kvm_io_bus framework. This is needed both by the generic KVM code in virt/kvm as well as by architecture specific code under arch/. Putting the header file in virt/kvm and using local includes in the architecture part seems at least dodgy to me, so let's move the file into include/kvm, so that a more natural #include kvm/iodev.h can be used by all of the code. This also solves a problem later when using struct kvm_io_device in arm_vgic.h. Fixing up the FSF address in the GPL header and a wrong include path on the way. Signed-off-by: Andre Przywara andre.przyw...@arm.com Acked-by: Christoffer Dall christoffer.d...@linaro.org Reviewed-by: Marc Zyngier marc.zyng...@arm.com Paolo, Marcelo: can we have your Ack on this? Thanks, Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/12] KVM: mark kvm-buses as empty once they were destroyed
On Wed, Mar 25, 2015 at 05:09:13PM +, Marc Zyngier wrote: On 23/03/15 15:58, Andre Przywara wrote: In kvm_destroy_vm() we call kvm_io_bus_destroy() pretty early, especially before calling kvm_arch_destroy_vm(). To avoid unregistering devices from the already destroyed bus, let's mark the bus with NULL to let other users know it has been destroyed already. This avoids a crash on a VM shutdown with the VGIC using the kvm_io_bus later (the unregistering is in there to be able to roll back a faulting init). Signed-off-by: Andre Przywara andre.przyw...@arm.com That seems sensible, but I don't see why nobody else hits that. What are we doing differently? It should be valid to call kvm_io_bus_unregister_dev after kvm_io_bus_destroy. Are you patching it to handle NULL kvm-buses[bus_idx] ? Otherwise, Reviewed-by: Marc Zyngier marc.zyng...@arm.com Paolo, Marcelo, can we have your Ack on this? Thanks, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote: On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote: 2015-03-25 12:08+0100, Radim Krčmář: Reverting the patch protects us from any migration, but I don't think we need to care about changing VCPUs as long as we read a consistent data from kvmclock. (VCPU can change outside of this loop too, so it doesn't matter if we return a value not fit for this VCPU.) I think we could drop the second __getcpu if our kvmclock was being handled better; maybe with a patch like the one below: The second __getcpu is not neccessary, but I forgot about rdtsc. We need to either use rtdscp, know the host has synchronized tsc, or monitor VCPU migrations. Only the last one works everywhere. The vdso code is only used if host has synchronized tsc. But you have to handle the case where host goes from synchronized tsc to unsynchronized tsc (see the clocksource notifier in the host side). Can't we change the host to freeze all vcpus and clear the stable bit on all of them if this happens? This would simplify and speed up vclock_gettime. --Andy Seems interesting to do on 512-vcpus, but sure, could be done. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
On Mon, Mar 16, 2015 at 11:42:06AM +, Wu, Feng wrote: Do you have any reason why having the code at vcpu_put/vcpu_load is better than the proposal to have the code at kvm_vcpu_block? I think your proposal is good, I just want to better understand your idea here.:) Reduce the overhead of vcpu sched in / vcpu sched out, basically. One thing, even we put the code to kvm_vcpu_block, we still need to add code at vcpu_put/vcpu_load for the preemption case like what I did now. Global vector is a limited resources in the system, and this involves common x86 interrupt code changes. I am not sure we can allocate so many dedicated global vector for KVM usage. Why not? Have KVM use all free vectors (so if vectors are necessary for other purposes, people should shrink the KVM vector pool). If we want to allocate more global vector for this usage, we need hpa's input about it. Peter, what is your opinion? Peter? BTW the Intel docs talk about that (one vector per vCPU). Yes, the Spec talks about this, but it is more complex using one vector per vCPU. It seems there is a bunch free: commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4 Author: Alex Shi alex@intel.com Date: Thu Jun 28 09:02:23 2012 +0800 x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR Can you add only vcpus which have posted IRTEs that point to this pCPU to the HLT'ed vcpu lists? (so for example, vcpus without assigned devices are not part of the list). Is it easy to find whether a vCPU (or the associated domain) has assigned devices? If so, we can only add those vCPUs with assigned devices. When configuring IRTE, at kvm_arch_vfio_update_pi_irte? Yes. + static int __init vmx_init(void) { int r, i, msr; @@ -9429,6 +9523,8 @@ static int __init vmx_init(void) update_ple_window_actual_max(); + wakeup_handler_callback = wakeup_handler; + return 0; out7: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0033df3..1551a46 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_vcpu_reload_apic_access_page(vcpu); } + /* + * Since posted-interrupts can be set by VT-d HW now, in this + * case, KVM_REQ_EVENT is not set. We move the following + * operations out of the if statement. + */ + if (kvm_lapic_enabled(vcpu)) { + /* + * Update architecture specific hints for APIC + * virtual interrupt delivery. + */ + if (kvm_x86_ops-hwapic_irr_update) + kvm_x86_ops-hwapic_irr_update(vcpu, + kvm_lapic_find_highest_irr(vcpu)); + } + This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler. I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help much, if vCPU is running in ROOT mode, and VT-d hardware issues an notification event, POSTED_INTR_VECTOR interrupt handler will be called. If vCPU is in root mode, remapping HW will find IRTE configured with vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will VM-exit, and execute the interrupt handler wakeup_handler. Right? There are two cases: Case 1: vCPU is blocked, so it is in root mode, this is what you described above. Case 2, vCPU is running in root mode, such as, handling vm-exits, in this case, the notification vector is 'POSTED_INTR_VECTOR', and if external interrupts from assigned devices happen, the handled of 'POSTED_INTR_VECTOR' will be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine doesn't need do real things, since the pending interrupts in PIR will be synced to vIRR before VM-Entry (this code have already been there when enabling CPU-side posted-interrupt along with APICv). Like what I said before, it is a little hard to get vCPU related information in it, even if we get, it is not accurate and may harm the performance.(need search) So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the notification event for 'POSTED_INTR_VECTOR'. The point of this comment is that you can keep the if (kvm_x86_ops-hwapic_irr_update) kvm_x86_ops-hwapic_irr_update(vcpu, kvm_lapic_find_highest_irr(vcpu)); Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as wakeup_handler sets KVM_REQ_EVENT. Please see above. OK can you set KVM_REQ_EVENT in case the ON bit is set, after disabling
Re: [PULL 0/3] 4.0 patch queue 2015-03-25
On Wed, Mar 25, 2015 at 10:58:54PM +0100, Alexander Graf wrote: Hi Paolo, This is my current patch queue for 4.0. Please pull. Alex The following changes since commit f710a12d73dfa1c3a5d2417f2482b970f03bb850: Merge tag 'kvm-arm-fixes-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm (2015-03-16 20:08:56 -0300) are available in the git repository at: git://github.com/agraf/linux-2.6.git tags/signed-for-4.0 for you to fetch changes up to 2bf27601c7b50b6ced72f27304109dc52eb52919: KVM: PPC: Book3S HV: Fix instruction emulation (2015-03-20 11:42:33 +0100) Pulled, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 0/3] 4.0 patch queue 2015-03-25
On Wed, Mar 25, 2015 at 10:58:54PM +0100, Alexander Graf wrote: Hi Paolo, This is my current patch queue for 4.0. Please pull. Alex The following changes since commit f710a12d73dfa1c3a5d2417f2482b970f03bb850: Merge tag 'kvm-arm-fixes-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm (2015-03-16 20:08:56 -0300) are available in the git repository at: git://github.com/agraf/linux-2.6.git tags/signed-for-4.0 for you to fetch changes up to 2bf27601c7b50b6ced72f27304109dc52eb52919: KVM: PPC: Book3S HV: Fix instruction emulation (2015-03-20 11:42:33 +0100) Pulled, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote: On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote: On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote: 2015-03-25 12:08+0100, Radim Krčmář: Reverting the patch protects us from any migration, but I don't think we need to care about changing VCPUs as long as we read a consistent data from kvmclock. (VCPU can change outside of this loop too, so it doesn't matter if we return a value not fit for this VCPU.) I think we could drop the second __getcpu if our kvmclock was being handled better; maybe with a patch like the one below: The second __getcpu is not neccessary, but I forgot about rdtsc. We need to either use rtdscp, know the host has synchronized tsc, or monitor VCPU migrations. Only the last one works everywhere. The vdso code is only used if host has synchronized tsc. But you have to handle the case where host goes from synchronized tsc to unsynchronized tsc (see the clocksource notifier in the host side). Can't we change the host to freeze all vcpus and clear the stable bit on all of them if this happens? This would simplify and speed up vclock_gettime. --Andy Seems interesting to do on 512-vcpus, but sure, could be done. If you have a 512-vcpu system that switches between stable and unstable more than once per migration, then I expect that you have serious problems and this is the least of your worries. Personally, I'd *much* rather we just made vcpu 0's pvti authoritative if we're stable. If nothing else, I'm not even remotely convinced that the current scheme gives monotonic timing due to skew between when the updates happen on different vcpus. Can you write down the problem ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM fixes for 4.0-rc5
Linus, Please pull from git://git.kernel.org/pub/scm/virt/kvm/kvm.git master To receive the following: fix for higher-order page allocation failures, fix Xen-on-KVM with x2apic, L1 crash with unrestricted guest mode (nested VMX). Igor Mammedov (1): kvm: avoid page allocation failure in kvm_set_memory_region() Radim Krčmář (2): KVM: nVMX: mask unrestricted_guest if disabled on L0 KVM: x86: call irq notifiers with directed EOI arch/x86/kvm/ioapic.c |4 +++- arch/x86/kvm/lapic.c |3 +-- arch/x86/kvm/vmx.c|7 +-- virt/kvm/kvm_main.c | 14 +++--- 4 files changed, 16 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations
On Tue, Mar 24, 2015 at 04:34:12PM +0100, Radim Krčmář wrote: 2015-03-23 20:21-0300, Marcelo Tosatti: The following point: 2. per-CPU pvclock time info is updated if the underlying CPU changes. Is not true anymore since KVM: x86: update pvclock area conditionally, on cpu migration. I think that the revert doesn't fix point 2.: KVM: x86: update pvclock [...] changed the host to skip clock update on physical CPU change, but guest's task migration notifier isn't tied to it at all. per-CPU pvclock time info is updated if the underlying CPU changes is the same as always perform clock update on physical CPU change. That was a requirement for the original patch, to drop migration notifiers. (Guest can have all tasks pinned, so the revert changed nothing.) Add task migration notification back. Problem noticed by Andy Lutomirski. What is the problem? Thanks. The problem is this: T1) guest thread1 on vcpu1. T2) guest thread1 on vcpu2. T3) guest thread1 on vcpu1. Inside a pvclock read loop. Since the writes by hypervisor of pvclock area are not ordered, you cannot rely on version being updated _before_ the rest of pvclock data. (in the case above, has the physical cpu changed check, inside the guests thread1, obviously fails). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL 00/11] KVM: s390: Features and fixes for 4.1 (kvm/next)
On Wed, Mar 18, 2015 at 12:43:58PM +0100, Christian Borntraeger wrote: Paolo, Marcelo, here is the followup pull request. As Marcelo has not yet pushed out queue or next to git.kernel.org, this request is based on the previous s390 pull request and should merge without conflicts. For details see tag description. Christian The following changes since commit 13211ea7b47db3d8ee2ff258a9a973a6d3aa3d43: KVM: s390: Enable vector support for capable guest (2015-03-06 13:49:35 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git tags/kvm-s390-next-20150318 for you to fetch changes up to 18280d8b4bcd4a2b174ee3cd748166c6190acacb: KVM: s390: represent SIMD cap in kvm facility (2015-03-17 16:33:14 +0100) Pulled, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: kvm: Revert remove sched notifier for cross-cpu migrations
The following point: 2. per-CPU pvclock time info is updated if the underlying CPU changes. Is not true anymore since KVM: x86: update pvclock area conditionally, on cpu migration. Add task migration notification back. Problem noticed by Andy Lutomirski. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com CC: sta...@kernel.org # 3.11+ diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index d6b078e..25b1cc0 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -95,6 +95,7 @@ unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, struct pvclock_vsyscall_time_info { struct pvclock_vcpu_time_info pvti; + u32 migrate_count; } __attribute__((__aligned__(SMP_CACHE_BYTES))); #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 2f355d2..e5ecd20 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -141,7 +141,46 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } +static struct pvclock_vsyscall_time_info *pvclock_vdso_info; + +static struct pvclock_vsyscall_time_info * +pvclock_get_vsyscall_user_time_info(int cpu) +{ + if (!pvclock_vdso_info) { + BUG(); + return NULL; + } + + return pvclock_vdso_info[cpu]; +} + +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu) +{ + return pvclock_get_vsyscall_user_time_info(cpu)-pvti; +} + #ifdef CONFIG_X86_64 +static int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, + void *v) +{ + struct task_migration_notifier *mn = v; + struct pvclock_vsyscall_time_info *pvti; + + pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu); + + /* this is NULL when pvclock vsyscall is not initialized */ + if (unlikely(pvti == NULL)) + return NOTIFY_DONE; + + pvti-migrate_count++; + + return NOTIFY_DONE; +} + +static struct notifier_block pvclock_migrate = { + .notifier_call = pvclock_task_migrate, +}; + /* * Initialize the generic pvclock vsyscall state. This will allocate * a/some page(s) for the per-vcpu pvclock information, set up a @@ -155,12 +194,17 @@ int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE); + pvclock_vdso_info = i; + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, __pa(i) + (idx*PAGE_SIZE), PAGE_KERNEL_VVAR); } + + register_task_migration_notifier(pvclock_migrate); + return 0; } #endif diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c index 9793322..3093376 100644 --- a/arch/x86/vdso/vclock_gettime.c +++ b/arch/x86/vdso/vclock_gettime.c @@ -82,18 +82,15 @@ static notrace cycle_t vread_pvclock(int *mode) cycle_t ret; u64 last; u32 version; + u32 migrate_count; u8 flags; unsigned cpu, cpu1; /* -* Note: hypervisor must guarantee that: -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info. -* 2. that per-CPU pvclock time info is updated if the -*underlying CPU changes. -* 3. that version is increased whenever underlying CPU -*changes. -* +* When looping to get a consistent (time-info, tsc) pair, we +* also need to deal with the possibility we can switch vcpus, +* so make sure we always re-fetch time-info for the current vcpu. */ do { cpu = __getcpu() VGETCPU_CPU_MASK; @@ -104,6 +101,8 @@ static notrace cycle_t vread_pvclock(int *mode) pvti = get_pvti(cpu); + migrate_count = pvti-migrate_count; + version = __pvclock_read_cycles(pvti-pvti, ret, flags); /* @@ -115,7 +114,8 @@ static notrace cycle_t vread_pvclock(int *mode) cpu1 = __getcpu() VGETCPU_CPU_MASK; } while (unlikely(cpu != cpu1 || (pvti-pvti.version 1) || - pvti-pvti.version != version)); + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) *mode = VCLOCK_NONE; diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432..be98910 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -176,6 +176,14 @@ extern void get_iowait_load(unsigned long *nr_waiters, unsigned long *load); extern void calc_global_load(unsigned long ticks); extern void update_cpu_load_nohz(void
Re: [PATCH] KVM: x86: call irq notifiers with directed EOI
On Wed, Mar 18, 2015 at 07:38:22PM +0100, Radim Krčmář wrote: kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled. We need to do that for irq notifiers. (Like with edge interrupts.) Fix it by skipping EOI broadcast only. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211 Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/ioapic.c | 4 +++- arch/x86/kvm/lapic.c | 3 +-- 2 files changed, 4 insertions(+), 3 deletions(-) Applied to master, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: inline kvm_ioapic_handles_vector()
On Thu, Mar 19, 2015 at 09:52:41PM +0100, Radim Krčmář wrote: An overhead from function call is not appropriate for its size and frequency of execution. Suggested-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Radim Krčmář rkrc...@redhat.com --- I'm not very fond of that smp_rmb(): there is no real synchronization against update_handled_vectors(), Yes, because the guest OS should provide synchronization (it should shutdown interrupts before attempting to modify IOAPIC table). The smp_wmb is necessary. so the only point I see is to drop cached value of handled_vectors, which seems like bad use of LFENCE. test_bit has volatile on *addr, so don't see why the smp_rmb is necessary at all. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: kvm: rename migrate_count variable
As thats more indicative of the variables usage. Suggested by Andy Lutomirski. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 25b1cc0..1c1b474 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -95,7 +95,7 @@ unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, struct pvclock_vsyscall_time_info { struct pvclock_vcpu_time_info pvti; - u32 migrate_count; + u32 migrate_from_count; } __attribute__((__aligned__(SMP_CACHE_BYTES))); #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index e5ecd20..8eaf04b 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -172,7 +172,7 @@ static int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, if (unlikely(pvti == NULL)) return NOTIFY_DONE; - pvti-migrate_count++; + pvti-migrate_from_count++; return NOTIFY_DONE; } diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c index 3093376..ef8bb76 100644 --- a/arch/x86/vdso/vclock_gettime.c +++ b/arch/x86/vdso/vclock_gettime.c @@ -82,7 +82,7 @@ static notrace cycle_t vread_pvclock(int *mode) cycle_t ret; u64 last; u32 version; - u32 migrate_count; + u32 migrate_from_count; u8 flags; unsigned cpu, cpu1; @@ -101,7 +101,7 @@ static notrace cycle_t vread_pvclock(int *mode) pvti = get_pvti(cpu); - migrate_count = pvti-migrate_count; + migrate_from_count = pvti-migrate_from_count; version = __pvclock_read_cycles(pvti-pvti, ret, flags); @@ -115,7 +115,7 @@ static notrace cycle_t vread_pvclock(int *mode) } while (unlikely(cpu != cpu1 || (pvti-pvti.version 1) || pvti-pvti.version != version || - pvti-migrate_count != migrate_count)); + pvti-migrate_from_count != migrate_from_count)); if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) *mode = VCLOCK_NONE; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: avoid page allocation failure in kvm_set_memory_region()
On Fri, Mar 20, 2015 at 09:51:26AM +, Igor Mammedov wrote: KVM guest can fail to startup with following trace on host: qemu-system-x86: page allocation failure: order:4, mode:0x40d0 Call Trace: dump_stack+0x47/0x67 warn_alloc_failed+0xee/0x150 __alloc_pages_direct_compact+0x14a/0x150 __alloc_pages_nodemask+0x776/0xb80 alloc_kmem_pages+0x3a/0x110 kmalloc_order+0x13/0x50 kmemdup+0x1b/0x40 __kvm_set_memory_region+0x24a/0x9f0 [kvm] kvm_set_ioapic+0x130/0x130 [kvm] kvm_set_memory_region+0x21/0x40 [kvm] kvm_vm_ioctl+0x43f/0x750 [kvm] Failure happens when attempting to allocate pages for 'struct kvm_memslots', however it doesn't have to be present in physically contiguous (kmalloc-ed) address space, change allocation to kvm_kvzalloc() so that it will be vmalloc-ed when its size is more then a page. Signed-off-by: Igor Mammedov imamm...@redhat.com Igor, two things: 1) kvm_create_vm should also use vmalloc r = -ENOMEM; kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); if (!kvm-memslots) goto out_err_no_srcu; 2) there are additional places where its necessary to use proper freeing function, i believe: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ce7888a..651ff2d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -522,7 +522,7 @@ out_err_no_srcu: out_err_no_disable: for (i = 0; i KVM_NR_BUSES; i++) kfree(kvm-buses[i]); - kfree(kvm-memslots); + kvfree(kvm-memslots); kvm_arch_free_vm(kvm); return ERR_PTR(r); } @@ -570,7 +570,7 @@ static void kvm_free_physmem(struct kvm *kvm) kvm_for_each_memslot(memslot, slots) kvm_free_physmem_slot(kvm, memslot, NULL); - kfree(kvm-memslots); + kvfree(kvm-memslots); } static void kvm_destroy_devices(struct kvm *kvm) @@ -909,7 +922,7 @@ int __kvm_set_memory_region(struct kvm *kvm, kvm_arch_commit_memory_region(kvm, mem, old, change); kvm_free_physmem_slot(kvm, old, new); - kfree(old_memslots); + kvfree(old_memslots); /* * IOMMU mapping: New slots need to be mapped. Old slots need to be --- TODO: - work on follow up patches to allocate space for actual amount of memory_slots instead of possible maximum. --- virt/kvm/kvm_main.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a2214d9..7ed1f5c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -871,10 +871,10 @@ int __kvm_set_memory_region(struct kvm *kvm, goto out_free; } - slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots), - GFP_KERNEL); + slots = kvm_kvzalloc(sizeof(struct kvm_memslots)); if (!slots) goto out_free; + memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) { slot = id_to_memslot(slots, mem-slot); @@ -936,7 +936,7 @@ int __kvm_set_memory_region(struct kvm *kvm, return 0; out_slots: - kfree(slots); + kvfree(slots); out_free: kvm_free_physmem_slot(kvm, new, old); out: -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 4/9] KVM: s390: Add MEMOP ioctls for reading/writing guest memory
On Thu, Mar 19, 2015 at 02:15:18PM +0100, Thomas Huth wrote: Date: Wed, 18 Mar 2015 21:23:48 -0300 From: Marcelo Tosatti mtosa...@redhat.com On Mon, Mar 16, 2015 at 09:51:40AM +0100, Christian Borntraeger wrote: From: Thomas Huth th...@linux.vnet.ibm.com On s390, we've got to make sure to hold the IPTE lock while accessing logical memory. So let's add an ioctl for reading and writing logical memory to provide this feature for userspace, too. ... What i was wondering is why you can't translate the address in the kernel and let userspace perform the actual read/write? The idea here is to protect the read/write access with the ipte-lock, too. That way, the whole address translation _and_ the read/write access are protected together against invalidate-page-table operations from other CPUs, so the whole memory access looks atomic for other VCPUs. And since we do not want to expose the ipte lock directly to user space, both has to be done in the kernel. We already had a long internal discussion about this in our team, and indeed, if the ipte-lock would be the only problem that we face on s390, we might also come up with a solution where the memory read/write access is done in userspace instead. However, for full architecture compliance, we later have got to support the so-called storage keys during memory accesses, too, and this can hardly be done accurately and safely from userspace. So I'm afraid, it's somewhat ugly that we've got to provide an ioctl here to read/write the guest memory, but it's the only feasible solution that I could think of. I see, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: call irq notifiers with directed EOI
On Wed, Mar 18, 2015 at 07:38:22PM +0100, Radim Krčmář wrote: kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled. We need to do that for irq notifiers. (Like with edge interrupts.) Fix it by skipping EOI broadcast only. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211 Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/ioapic.c | 4 +++- arch/x86/kvm/lapic.c | 3 +-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index b1947e0f3e10..46d4449772bc 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -422,6 +422,7 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, struct kvm_ioapic *ioapic, int vector, int trigger_mode) { int i; + struct kvm_lapic *apic = vcpu-arch.apic; for (i = 0; i IOAPIC_NUM_PINS; i++) { union kvm_ioapic_redirect_entry *ent = ioapic-redirtbl[i]; @@ -443,7 +444,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, kvm_notify_acked_irq(ioapic-kvm, KVM_IRQCHIP_IOAPIC, i); spin_lock(ioapic-lock); - if (trigger_mode != IOAPIC_LEVEL_TRIG) + if (trigger_mode != IOAPIC_LEVEL_TRIG || + kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_DIRECTED_EOI) continue; Don't you have to handle kvm_ioapic_eoi_inject_work as well? ASSERT(ent-fields.trig_mode == IOAPIC_LEVEL_TRIG); This assert can now fail? diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd4e34de24c7..4ee827d7bf36 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -833,8 +833,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2) static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) { - if (!(kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_DIRECTED_EOI) - kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) { + if (kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) { int trigger_mode; if (apic_test_vector(vector, apic-regs + APIC_TMR)) trigger_mode = IOAPIC_LEVEL_TRIG; -- 2.3.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch v5] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq
On Fri, Mar 13, 2015 at 09:14:35AM -0600, James Sullivan wrote: This patch adds a check for RH=1 in kvm_set_msi_irq. Currently the DM bit is the only thing used to decide irq-dest_mode (logical when DM set, physical when unset). Documentation indicates that the DM bit will be 'ignored' when the RH bit is unset, and physical destination mode is used in this case. Fixed this to set irq-dest_mode to APIC_DEST_LOGICAL just in case both RH=1/DM=1. This patch doesn't completely handle RH=1; if RH=1 then the delivery will behave as in low priority mode (deliver the interrupt to only the lowest priority processor), but the delivery mode may still used to specify the semantics of the delivery beyond its destination. I will be trying and comparing a few options to handle this fully (extension of struct kvm_lapic_irq, introduction of MSI specific delivery functions or helpers, etc) and hope to have some patches to show in the near future. Signed-off-by: James Sullivan sullivan.jame...@gmail.com The documentation states the following: * When RH is 0, the interrupt is directed to the processor listed in the Destination ID field. * If RH is 0, then the DM bit is ignored and the message is sent ahead independent of whether the physical or logical destination mode is used. However, from the POV of a device writing to memory to generate an MSI interrupt, there is no (or i can't see any) other information that can be used to infer logical or physical mode for the interrupt message. Before your patch: (dm, rh) = (0, 0) = irq-dest_mode = 0 (dm, rh) = (0, 1) = irq-dest_mode = 0 (dm, rh) = (1, 0) = irq-dest_mode = 1 (dm, rh) = (1, 1) = irq-dest_mode = 1 After your patch: (dm, rh) = (0, 0) = irq-dest_mode = 0 (dm, rh) = (0, 1) = irq-dest_mode = 0 (dm, rh) = (1, 0) = irq-dest_mode = 0 (dm, rh) = (1, 1) = irq-dest_mode = 1 Am i missing some explicit documentation that refers to (dm, rh) = (1, 0) = irq-dest_mode = 0 ? See native_compose_msi_msg: msg-address_lo = MSI_ADDR_BASE_LO | ((apic-irq_dest_mode == 0) ? MSI_ADDR_DEST_MODE_PHYSICAL : MSI_ADDR_DEST_MODE_LOGICAL) | ((apic-irq_delivery_mode != dest_LowestPrio) ? MSI_ADDR_REDIRECTION_CPU : MSI_ADDR_REDIRECTION_LOWPRI) | MSI_ADDR_DEST_ID(dest); So it does configure DM = MSI_ADDR_DEST_MODE_LOGICAL and RH = MSI_ADDR_REDIRECTION_LOWPRI. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XP machine freeze
On Mon, Mar 16, 2015 at 04:10:40PM +0100, Saso Slavicic wrote: Hi, I'm fairly experienced with KVM (Centos 5/6), running about a dozen servers with 20-30 different (Linux MS platform) systems. I have one Windows XP machine that acts very strangely - it freezes. I get ping timeout for the VM from my monitoring and the machine spins 2 or 3 cores using all the cpu. Now the interesting thing that happens is that once you open the console, it suddenly starts working again. You can see the clock catching up as it was frozen in time and everything works normally once the timer catches up. It usually happens probably about once a month, although it happened yesterday and today again. This machine is on Centos 6, qemu-kvm-0.12.1.2-2.448.el6_6, kernel 2.6.32-504.3.3.el6.x86_64. I was able to do some debugging when the machine was frozen, so I got some things to work with: # virsh qemu-monitor-command --hmp DBserver 'info cpus' * CPU #0: pc=0x80501fdd thread_id=32595 CPU #1: pc=0x806e7a9b thread_id=32596 CPU #2: pc=0xba2da162 (halted) thread_id=32597 CPU #3: pc=0xba2da162 (halted) thread_id=32598 Now, in both yesterday's and today's event the CPU0 was stopped at 0x80501fdd. I've disassembled the function and got this: 0x80501fb5: int3 0x80501fb6: mov%edi,%edi 0x80501fb8: push %ebp 0x80501fb9: mov%esp,%ebp 0x80501fbb: push %esi 0x80501fbc: mov%fs:0x20,%eax 0x80501fc2: mov0x8(%ebp),%ecx 0x80501fc5: lea-0x1(%ecx),%esi 0x80501fc8: test %esi,%ecx 0x80501fca: lea0x7ec(%eax),%edx 0x80501fd0: pop%esi 0x80501fd1: je 0x80501fdd 0x80501fd3: lea0x7a0(%eax),%edx 0x80501fd9: jmp0x80501fdd *0x80501fdb: pause 0x80501fdd: cmpl $0x0,(%edx) 0x80501fe0: jne0x80501fdb 0x80501fe2: pop%ebp 0x80501fe3: ret$0x4 0x80501fe6: int3 Mov %edi,%edi is clearly the start of some function. From what I've been able to understand, the code fetches _KPRCB structure (%fs:0x20) and then does a spinlock between fdb and fe0 checking for PacketBarrier (?) in EDX (0xffdff8c0). Now, $pc always shows fdd address, shouldn't it jump between fdb and fe0, it seems as if it was stuck at fdd? # virsh qemu-monitor-command --hmp DBserver 'info registers' EAX=ffdff120 EBX=c06ddf58 ECX=000e EDX=ffdff8c0 ESI=be6e3921 EDI=c06ddf60 EBP=ba4ff708 ESP=ba4ff708 EIP=80501fdd EFL=0202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 00c0f300 DPL=3 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0023 00c0f300 DPL=3 DS [-WA] FS =0030 ffdff000 1fff 00c09300 DPL=0 DS [-WA] GS = 000f LDT= 000f TR =0028 80042000 20ab 8b00 DPL=0 TSS32-busy GDT= 8003f000 03ff IDT= 8003f400 07ff CR0=8001003b CR2=dbbec000 CR3=0b3c0020 CR4=06f8 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 FCW=027f FSW=0020 [ST=0] FTW=00 MXCSR=1fa0 FPR0=8053632b003c1658 c048 FPR1=e1e0c048bf80f6ab 76f8 FPR2=e1e0 0023 FPR3=0b017c30003c1658 FPR4=003bba1a7604 1e64 FPR5=0007268c 003b FPR6=0202001b 2684 FPR7=e3e0a9b4e1b50de4 ca0b XMM00=00a1fc950020027f XMM01=1fa01c4c0001 XMM02=c0488053632b003c1658 XMM03=76f8e1e0c048bf80f6ab XMM04=0023e1e0 XMM05=0b017c30003c1658 XMM06=1e64003bba1a7604 XMM07=003b0007268c Clearly, the address in EDX is not 0: [root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0' ffdff8c0: 0x0e [root@linux ~]# virt-manager [root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0' ffdff8c0: 0x00 However as soon as the VM console is opened and machine starts, the address in EDX is set to 0 and the loop is broken. Does anybody recognize what function that is? What could possibly happen that opening the console and moving the mouse a little, unfreezes the machine? VM has .81 virtio drivers from Fedora repo at the moment. Generate a Windows dump? https://support.microsoft.com/en-us/kb/254649 https://support.microsoft.com/en-us/kb/972110 Step 7: Generate a complete crash dump file or a kernel crash dump file by using an NMI on a Windows-based system (you can inject NMIs via QEMU monitor). The configuration of the machine is pretty standard: !-- WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE OVERWRITTEN AND LOST. Changes to this xml configuration should be made using: virsh edit
Re: [Patch v5] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq
On Wed, Mar 18, 2015 at 06:59:22PM -0600, James Sullivan wrote: The documentation states the following: * When RH is 0, the interrupt is directed to the processor listed in the Destination ID field. * If RH is 0, then the DM bit is ignored and the message is sent ahead independent of whether the physical or logical destination mode is used. However, from the POV of a device writing to memory to generate an MSI interrupt, there is no (or i can't see any) other information that can be used to infer logical or physical mode for the interrupt message. Before your patch: (dm, rh) = (0, 0) = irq-dest_mode = 0 (dm, rh) = (0, 1) = irq-dest_mode = 0 (dm, rh) = (1, 0) = irq-dest_mode = 1 (dm, rh) = (1, 1) = irq-dest_mode = 1 After your patch: (dm, rh) = (0, 0) = irq-dest_mode = 0 (dm, rh) = (0, 1) = irq-dest_mode = 0 (dm, rh) = (1, 0) = irq-dest_mode = 0 (dm, rh) = (1, 1) = irq-dest_mode = 1 Am i missing some explicit documentation that refers to (dm, rh) = (1, 0) = irq-dest_mode = 0 ? From the IA32 manual (Vol. 3, 10.11.2): * When RH is 0, the interrupt is directed to the processor listed in the Destination ID field. * When RH is 1 and the physical destination mode is used, the Destination ID field must not be set to FFH; it must point to a processor that is present and enabled to receive the interrupt. * When RH is 1 and the logical destination mode is active in a system using a flat addressing model, the Destination ID field must be set so that bits set to 1 identify processors that are present and enabled to receive the interrupt. * If RH is set to 1 and the logical destination mode is active in a system using cluster addressing model, then Destination ID field must not be set to FFH; the processors identified with this field must be present and enabled to receive the interrupt. My interpretation of this is that RH=0 indicates that the Dest. ID field contains an APIC ID, and as such destination mode is physical. When RH=1, depending on the value of DM, we either use physical or logical dest mode. The result of this is that logical dest mode is set just when RH=1/DM=1, as far as I understand. See native_compose_msi_msg: msg-address_lo = MSI_ADDR_BASE_LO | ((apic-irq_dest_mode == 0) ? MSI_ADDR_DEST_MODE_PHYSICAL : MSI_ADDR_DEST_MODE_LOGICAL) | ((apic-irq_delivery_mode != dest_LowestPrio) ? MSI_ADDR_REDIRECTION_CPU : MSI_ADDR_REDIRECTION_LOWPRI) | MSI_ADDR_DEST_ID(dest); So it does configure DM = MSI_ADDR_DEST_MODE_LOGICAL and RH = MSI_ADDR_REDIRECTION_LOWPRI. ...and yet this is a good counterexample against my argument :) What I think I'll do is revert this particular change so that dest_mode is set independently of RH. While I'm not entirely convinced that this is the intended interpretation, I do think that consistency with the existing logic is probably desirable for the time being. If I can get closure on the matter I'll re-submit that change, but for the time being I will undo it. -James Just write MSI-X table entries on real hardware (say: modify native_compose_msi_msg or MSI-X equivalent), with all RH/DM combinations, and see what behaviour is comes up? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch v5] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq
On Wed, Mar 18, 2015 at 06:59:22PM -0600, James Sullivan wrote: The documentation states the following: * When RH is 0, the interrupt is directed to the processor listed in the Destination ID field. * If RH is 0, then the DM bit is ignored and the message is sent ahead independent of whether the physical or logical destination mode is used. However, from the POV of a device writing to memory to generate an MSI interrupt, there is no (or i can't see any) other information that can be used to infer logical or physical mode for the interrupt message. Before your patch: (dm, rh) = (0, 0) = irq-dest_mode = 0 (dm, rh) = (0, 1) = irq-dest_mode = 0 (dm, rh) = (1, 0) = irq-dest_mode = 1 (dm, rh) = (1, 1) = irq-dest_mode = 1 After your patch: (dm, rh) = (0, 0) = irq-dest_mode = 0 (dm, rh) = (0, 1) = irq-dest_mode = 0 (dm, rh) = (1, 0) = irq-dest_mode = 0 (dm, rh) = (1, 1) = irq-dest_mode = 1 Am i missing some explicit documentation that refers to (dm, rh) = (1, 0) = irq-dest_mode = 0 ? From the IA32 manual (Vol. 3, 10.11.2): * When RH is 0, the interrupt is directed to the processor listed in the Destination ID field. * When RH is 1 and the physical destination mode is used, the Destination ID field must not be set to FFH; it must point to a processor that is present and enabled to receive the interrupt. * When RH is 1 and the logical destination mode is active in a system using a flat addressing model, the Destination ID field must be set so that bits set to 1 identify processors that are present and enabled to receive the interrupt. * If RH is set to 1 and the logical destination mode is active in a system using cluster addressing model, then Destination ID field must not be set to FFH; the processors identified with this field must be present and enabled to receive the interrupt. My interpretation of this is that RH=0 indicates that the Dest. ID field contains an APIC ID, and as such destination mode is physical. When RH=1, depending on the value of DM, we either use physical or logical dest mode. The result of this is that logical dest mode is set just when RH=1/DM=1, as far as I understand. See native_compose_msi_msg: msg-address_lo = MSI_ADDR_BASE_LO | ((apic-irq_dest_mode == 0) ? MSI_ADDR_DEST_MODE_PHYSICAL : MSI_ADDR_DEST_MODE_LOGICAL) | ((apic-irq_delivery_mode != dest_LowestPrio) ? MSI_ADDR_REDIRECTION_CPU : MSI_ADDR_REDIRECTION_LOWPRI) | MSI_ADDR_DEST_ID(dest); So it does configure DM = MSI_ADDR_DEST_MODE_LOGICAL and RH = MSI_ADDR_REDIRECTION_LOWPRI. ...and yet this is a good counterexample against my argument :) What I think I'll do is revert this particular change so that dest_mode is set independently of RH. While I'm not entirely convinced that this is the intended interpretation, Where would the logical/physical information come from, if not from the DM bit ? I do think that consistency with the existing logic is probably desirable for the time being. If I can get closure on the matter I'll re-submit that change, but for the time being I will undo it. -James -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()
On Tue, Mar 17, 2015 at 09:58:21AM +0100, Paolo Bonzini wrote: On 17/03/2015 08:19, Takuya Yoshikawa wrote: When all bits in mask are not set, kvm_arch_mmu_enable_log_dirty_pt_masked() has nothing to do. But since it needs to be called from the generic code, it cannot be inlined, and a few function calls, two when PML is enabled, are wasted. Since it is common to see many pages remain clean, e.g. framebuffers can stay calm for a long time, it is worth eliminating this overhead. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- virt/kvm/kvm_main.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a109370..420d8cf 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1061,9 +1061,11 @@ int kvm_get_dirty_log_protect(struct kvm *kvm, mask = xchg(dirty_bitmap[i], 0); dirty_bitmap_buffer[i] = mask; - offset = i * BITS_PER_LONG; - kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, - mask); + if (mask) { + offset = i * BITS_PER_LONG; + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, + offset, mask); + } } spin_unlock(kvm-mmu_lock); Good catch! Reviewed-by: Paolo Bonzini pbonz...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: Fix confusing message if no exit handlers are installed
On Mon, Mar 16, 2015 at 05:18:25PM -0400, Bandan Das wrote: I hit this path on a AMD box and thought someone was playing a April Fool's joke on me. Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html