Re: [PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2016-01-04 Thread Marcelo Tosatti
On Sun, Dec 20, 2015 at 03:05:41AM -0800, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 45ns on my machine.
> With this change, it takes 29ns, which is almost as fast as the pure TSC
> implementation.
> 
> Reviewed-by: Paolo Bonzini 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/vdso/vclock_gettime.c | 81 
> 
>  1 file changed, 46 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> b/arch/x86/entry/vdso/vclock_gettime.c
> index ca94fa649251..c325ba1bdddf 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
> *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
>   cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>   /*
> -  * Note: hypervisor must guarantee that:
> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -  * 2. that per-CPU pvclock time info is updated if the
> -  *underlying CPU changes.
> -  * 3. that version is increased whenever underlying CPU
> -  *changes.
> +  * Note: The kernel and hypervisor must guarantee that cpu ID
> +  * number maps 1:1 to per-CPU pvclock time info.
> +  *
> +  * Because the hypervisor is entirely unaware of guest userspace
> +  * preemption, it cannot guarantee that per-CPU pvclock time
> +  * info is updated if the underlying CPU changes or that that
> +  * version is increased whenever underlying CPU changes.
>*
> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
> +  * atomic as seen by *all* vCPUs.  This is an even stronger
> +  * guarantee than we get with a normal seqlock.
> +  *
> +  * On Xen, we don't appear to have that guarantee, but Xen still
> +  * supplies a valid seqlock using the version field.
> +
> +  * We only do pvclock vdso timing at all if
> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +  * mean that all vCPUs have matching pvti and that the TSC is
> +  * synced, so we can just look at vCPU 0's pvti.
>*/
> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> -  * This will save a couple of cycles by getting rid of
> -  * __getcpu() calls (Gleb).
> -  */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(>pvti, , );
> -
> - /*
> -  * Test we're still on the cpu as well as the version.
> -  * We could have been migrated just after the first
> -  * vgetcpu but before fetching the version, so we
> -  * wouldn't notice a version change.
> -  */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> -   (pvti->pvti.version & 1) ||
> -   pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>   *mode = VCLOCK_NONE;
> + return 0;
> + }
> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + tsc = rdtsc_ordered();
> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));

Andy,

What happens if PVCLOCK_TSC_STABLE_BIT is disabled here?

> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);
>  
>   /* refer to tsc.c read_tsc() comment for rationale */
>   last = gtod->cycle_last;
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line 

Re: [PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2016-01-04 Thread Marcelo Tosatti
On Mon, Jan 04, 2016 at 02:33:12PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 4, 2016 at 12:26 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Sun, Dec 20, 2015 at 03:05:41AM -0800, Andy Lutomirski wrote:
> >> From: Andy Lutomirski <l...@amacapital.net>
> >>
> >> The pvclock vdso code was too abstracted to understand easily and
> >> excessively paranoid.  Simplify it for a huge speedup.
> >>
> >> This opens the door for additional simplifications, as the vdso no
> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >>
> >> Before, vclock_gettime using kvm-clock took about 45ns on my machine.
> >> With this change, it takes 29ns, which is almost as fast as the pure TSC
> >> implementation.
> >>
> >> Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
> >> Signed-off-by: Andy Lutomirski <l...@amacapital.net>
> >> ---
> >>  arch/x86/entry/vdso/vclock_gettime.c | 81 
> >> 
> >>  1 file changed, 46 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> >> b/arch/x86/entry/vdso/vclock_gettime.c
> >> index ca94fa649251..c325ba1bdddf 100644
> >> --- a/arch/x86/entry/vdso/vclock_gettime.c
> >> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> >> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
> >> *get_pvti(int cpu)
> >>
> >>  static notrace cycle_t vread_pvclock(int *mode)
> >>  {
> >> - const struct pvclock_vsyscall_time_info *pvti;
> >> + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
> >>   cycle_t ret;
> >> - u64 last;
> >> - u32 version;
> >> - u8 flags;
> >> - unsigned cpu, cpu1;
> >> -
> >> + u64 tsc, pvti_tsc;
> >> + u64 last, delta, pvti_system_time;
> >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >>
> >>   /*
> >> -  * Note: hypervisor must guarantee that:
> >> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> -  * 2. that per-CPU pvclock time info is updated if the
> >> -  *underlying CPU changes.
> >> -  * 3. that version is increased whenever underlying CPU
> >> -  *changes.
> >> +  * Note: The kernel and hypervisor must guarantee that cpu ID
> >> +  * number maps 1:1 to per-CPU pvclock time info.
> >> +  *
> >> +  * Because the hypervisor is entirely unaware of guest userspace
> >> +  * preemption, it cannot guarantee that per-CPU pvclock time
> >> +  * info is updated if the underlying CPU changes or that that
> >> +  * version is increased whenever underlying CPU changes.
> >>*
> >> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> +  * atomic as seen by *all* vCPUs.  This is an even stronger
> >> +  * guarantee than we get with a normal seqlock.
> >> +  *
> >> +  * On Xen, we don't appear to have that guarantee, but Xen still
> >> +  * supplies a valid seqlock using the version field.
> >> +
> >> +  * We only do pvclock vdso timing at all if
> >> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> +  * mean that all vCPUs have matching pvti and that the TSC is
> >> +  * synced, so we can just look at vCPU 0's pvti.
> >>*/
> >> - do {
> >> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> -  * This will save a couple of cycles by getting rid of
> >> -  * __getcpu() calls (Gleb).
> >> -  */
> >> -
> >> - pvti = get_pvti(cpu);
> >> -
> >> - version = __pvclock_read_cycles(>pvti, , );
> >> -
> >> - /*
> >> -  * Test we're still on the cpu as well as the version.
> >> -  * We could have been migrated just after the first
> >> -  * vgetcpu but before fetching the version, so we
> >> -  * wouldn't notice a version change.
> >> -  */
> >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> - } while (unlikely(cpu != cpu1 ||
> >> -   (pvti->pvti.version & 1) ||
> >&g

Re: kvmclock doesn't work, help?

2015-12-23 Thread Marcelo Tosatti
On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
> >> [cc: John Stultz -- maybe you have ideas on how this should best
> >> integrate with the core code]
> >>
> >> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> wrote:
> 
> >> > Can you write an actual proposal (with details) that accomodates the
> >> > issue described at "Assuming a stable TSC across physical CPUS, and a
> >> > stable TSC" ?
> >> >
> >> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
> >> > realtime guests.
> >>
> >> This shouldn't require many details, and I don't think there's an ABI
> >> change.  The rules are:
> >>
> >> When the overall system timebase changes (e.g. when the selected
> >> clocksource changes or when update_pvclock_gtod is called), the KVM
> >> host would:
> >>
> >> optionally: preempt_disable();  /* for performance */
> >>
> >> for all vms {
> >>
> >>   for all registered pvti structures {
> >> pvti->version++;  /* should be odd now */
> >>   }
> >
> > pvti is userspace data, so you have to pin it before?
> 
> Yes.
> 
> Fortunately, most systems probably only have one page of pvti
> structures, I think (unless there are a ton of vcpus), so the
> performance impact should be negligible.
> 
> >
> >>   /* Note: right now, any vcpu that tries to access pvti will start
> >> infinite looping.  We should add cpu_relax() to the guests. */
> >>
> >>   for all registered pvti structures {
> >> update everything except pvti->version;
> >>   }
> >>
> >>   for all registered pvti structures {
> >> pvti->version++;  /* should be even now */
> >>   }
> >>
> >>   cond_resched();
> >> }
> >>
> >> Is this enough detail?  This should work with all existing guests,
> >> too, unless there's a buggy guest out there that actually fails to
> >> double-check version.
> >
> > What is the advantage of this over the brute force method, given
> > that guests will busy spin?
> >
> > (busy spin is equally problematic as IPI for realtime guests).
> 
> I disagree.  It's never been safe to call clock_gettime from an RT
> task and expect a guarantee of real-time performance.  We could fix
> that, but it's not even safe on non-KVM.

The problem is how long the IPI (or busy spinning in case of version
above) interrupts the vcpu.

> Sending an IPI *always* stalls the task.  Taking a lock (which is
> effectively what this is doing) only stalls the tasks that contend for
> the lock, which, most of the time, means that nothing stalls.
> 
> Also, if the host disables preemption or otherwise boosts its priority
> while version is odd, then the actual stall will be very short, in
> contrast to an IPI-induced stall, which will be much, much longer.
> 
> --Andy

1) The updates are rare.
2) There are no user complaints about the IPI mechanism.

Don't see a reason to change this.

For the suspend issue, though, there are complaints (guests on 
laptops which fail to use masterclock). 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-21 Thread Marcelo Tosatti
On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
> [cc: John Stultz -- maybe you have ideas on how this should best
> integrate with the core code]
> 
> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
> >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> wrote:
> >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> >> wrote:
> >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti 
> >> >> >> <mtosa...@redhat.com> wrote:
> >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski 
> >> >> >> >> <l...@amacapital.net> wrote:
> >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini 
> >> >> >> >> > <pbonz...@redhat.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >> >> >>> > RAW TSC NTP corrected TSC
> >> >> >> >> >>> > t0  10  10
> >> >> >> >> >>> > t1  20  19.99
> >> >> >> >> >>> > t2  30  29.98
> >> >> >> >> >>> > t3  40  39.97
> >> >> >> >> >>> > t4  50  49.96
> >> >> >
> >> >> > (1)
> >> >> >
> >> >> >> >> >>> >
> >> >> >> >> >>> > ...
> >> >> >> >> >>> >
> >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >> >> >>> > you can see what will happen.
> >> >> >> >> >>>
> >> >> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >> >> >>
> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  
> >> >> >> >> >> After
> >> >> >> >> >> resume, the TSC certainly increases at the same rate as 
> >> >> >> >> >> before, but the
> >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased 
> >> >> >> >> >> slower
> >> >> >> >> >> than the guest kvmclock.
> >> >> >> >> >
> >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >> >> >
> >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after 
> >> >> >> >> > resume to
> >> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> >> >> > backwards.
> >> >> >> >> >
> >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied 
> >> >> >> >> > on top
> >> >> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >> >> >
> >> >> >> >> > I still feel like I'm missing something very basic here.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> OK, I think I get it.
> >> >> >> >>
> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the 
> >> >> >> >> host's
> >> >> >> >> correction to the guest.  If it did, indeed, propagate the 
> >> >> >> >> correction
> >> >> >> >> then, after resume, the host's new system_time would ma

Re: kvmclock doesn't work, help?

2015-12-18 Thread Marcelo Tosatti
On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> wrote:
> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> >> wrote:
> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski 
> >> >> >> <l...@amacapital.net> wrote:
> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini 
> >> >> >> > <pbonz...@redhat.com> wrote:
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >> >>> > RAW TSC NTP corrected TSC
> >> >> >> >>> > t0  10  10
> >> >> >> >>> > t1  20  19.99
> >> >> >> >>> > t2  30  29.98
> >> >> >> >>> > t3  40  39.97
> >> >> >> >>> > t4  50  49.96
> >> >
> >> > (1)
> >> >
> >> >> >> >>> >
> >> >> >> >>> > ...
> >> >> >> >>> >
> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >> >>> > you can see what will happen.
> >> >> >> >>>
> >> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >> >>
> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  
> >> >> >> >> After
> >> >> >> >> resume, the TSC certainly increases at the same rate as before, 
> >> >> >> >> but the
> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased 
> >> >> >> >> slower
> >> >> >> >> than the guest kvmclock.
> >> >> >> >
> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >> >
> >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume 
> >> >> >> > to
> >> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> >> > backwards.
> >> >> >> >
> >> >> >> > If it's the guest's, then the guest's NTP correction is applied on 
> >> >> >> > top
> >> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >> >
> >> >> >> > I still feel like I'm missing something very basic here.
> >> >> >> >
> >> >> >>
> >> >> >> OK, I think I get it.
> >> >> >>
> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> >> correction to the guest.  If it did, indeed, propagate the correction
> >> >> >> then, after resume, the host's new system_time would match the 
> >> >> >> guest's
> >> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> >> think there would be a problem.
> >> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> >> would actually do this.
> >> >> >
> >> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >> >
> >> >> > timer_interrupt()
> >> >> > offset = read clocksource since last timer interrupt
> >> >> > accumulate_to_systemclock(offset)
> >> >> >
> >> >> > The frequency correction of NTP in the host can be applied to
> >> >> > kvmclock, which will be visible to the 

Re: kvmclock doesn't work, help?

2015-12-18 Thread Marcelo Tosatti
On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> wrote:
> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> 
> >> >> wrote:
> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> 
> >> >> > wrote:
> >> >> >>
> >> >> >>
> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >>> > RAW TSC NTP corrected TSC
> >> >> >>> > t0  10  10
> >> >> >>> > t1  20  19.99
> >> >> >>> > t2  30  29.98
> >> >> >>> > t3  40  39.97
> >> >> >>> > t4  50  49.96
> >
> > (1)
> >
> >> >> >>> >
> >> >> >>> > ...
> >> >> >>> >
> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >>> > you can see what will happen.
> >> >> >>>
> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >>
> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> >> resume, the TSC certainly increases at the same rate as before, but 
> >> >> >> the
> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> >> than the guest kvmclock.
> >> >> >
> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >
> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> >> > backwards.
> >> >> >
> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >
> >> >> > I still feel like I'm missing something very basic here.
> >> >> >
> >> >>
> >> >> OK, I think I get it.
> >> >>
> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> correction to the guest.  If it did, indeed, propagate the correction
> >> >> then, after resume, the host's new system_time would match the guest's
> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> think there would be a problem.
> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> would actually do this.
> >> >
> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >
> >> > timer_interrupt()
> >> > offset = read clocksource since last timer interrupt
> >> > accumulate_to_systemclock(offset)
> >> >
> >> > The frequency correction of NTP in the host can be applied to
> >> > kvmclock, which will be visible to the guest
> >> > at "read clocksource since last timer interrupt"
> >> > (kvmclock_clocksource_read function).
> >>
> >> pvclock_clocksource_read?  That seems to do the same thing as all the
> >> other clocksource access functions.
> >>
> >> >
> >> > This does not mean that the NTP correction in the host is propagated
> >> > to the guests system clock directly.
> >> >
> >> > (For example, the guest can run NTP which is free to do further
> >> > adjustments at "accumulate_to_systemclock(offset)" time).
> >>
> >> Of course.  But I expected that, in the absence of NTP on the guest,
> >> that the guest would track the host's *corrected* time.
> >>
> >> >
> >> >> If, on the other hand, the host's NTP correction is not supposed to
&g

Re: kvmclock doesn't work, help?

2015-12-17 Thread Marcelo Tosatti
On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski  wrote:
> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini  wrote:
> >>
> >>
> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >>> > RAW TSC NTP corrected TSC
> >>> > t0  10  10
> >>> > t1  20  19.99
> >>> > t2  30  29.98
> >>> > t3  40  39.97
> >>> > t4  50  49.96
> >>> >
> >>> > ...
> >>> >
> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >>> > you can see what will happen.
> >>>
> >>> Sure, but why would you ever switch from one to the other?
> >>
> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> resume, the TSC certainly increases at the same rate as before, but the
> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> than the guest kvmclock.
> >
> > Wait, are we talking about the host's NTP or the guest's NTP?
> >
> > If it's the host's, then wouldn't systemtime be reset after resume to
> > the NTP corrected value?  If so, the guest wouldn't see time go
> > backwards.
> >
> > If it's the guest's, then the guest's NTP correction is applied on top
> > of kvmclock, and this shouldn't matter.
> >
> > I still feel like I'm missing something very basic here.
> >
> 
> OK, I think I get it.
> 
> Marcelo, I thought that kvmclock was supposed to propagate the host's
> correction to the guest.  If it did, indeed, propagate the correction
> then, after resume, the host's new system_time would match the guest's
> idea of it (after accounting for the guest's long nap), and I don't
> think there would be a problem.
> That being said, I can't find the code in the masterclock stuff that
> would actually do this.

Guest clock is maintained by guest timekeeping code, which does:

timer_interrupt() 
offset = read clocksource since last timer interrupt
accumulate_to_systemclock(offset)

The frequency correction of NTP in the host can be applied to 
kvmclock, which will be visible to the guest 
at "read clocksource since last timer interrupt" 
(kvmclock_clocksource_read function).

This does not mean that the NTP correction in the host is propagated
to the guests system clock directly.

(For example, the guest can run NTP which is free to do further
adjustments at "accumulate_to_systemclock(offset)" time).

> If, on the other hand, the host's NTP correction is not supposed to
> propagate to the guest, 

This is optional. There is a module option to control this, in fact.

Its nice to have, because then you can execute a guest without NTP
(say without network connection), and have a kvmclock (kvmclock is a
clocksource, not a guest system clock) which is NTP corrected.

> then shouldn't KVM just update system_time on
> resume to whatever the guest would think it had (which I think would
> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
> shifted by some per-guest constant offset).
> 
> --Andy

Sure, you could add a correction to compensate and make sure 
the guest clock does not see time backwards.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-17 Thread Marcelo Tosatti
GOn Mon, Dec 14, 2015 at 02:31:10PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 11/12/2015 22:57, Andy Lutomirski wrote:
> >> > I'm still not seeing the issue.
> >> >
> >> > The formula is:
> >> >
> >> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> >> > pvti->tsc_shift) + pvti->system_time
> >> >
> >> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> >> > after suspend/resume, you would also need to update system_time.
> >> >
> >> > I don't see what this has to do with suspend/resume or with whether
> >> > the effective scale factor is greater than or less than one.  The only
> >> > suspend/resume interaction I can see is that, if the host allows the
> >> > guest-observed TSC value to jump (which is arguably a bug, what that's
> >> > not important here), it needs to update pvti before resuming the
> >> > guest.
> >>
> >> Which is not an issue, since freezing obviously gets all CPUs out of
> >> guest mode.
> >>
> >> Marcelo, can you provide an example with made-up values for tsc and pvti?
> >
> > I meant "systemtime" at ^.
> >
> > guest visible clock = systemtime (updated at time 0, guest initialization) 
> > + scaled tsc reads=LARGE VALUE.
> >   ^^
> > guest reads clock to memory at location A = scaled tsc read.
> > -> suspend resume event
> > guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled 
> > tsc reads=0.
> > guest reads clock to memory at location B.
> >
> > So before the suspend/resume event, the clock is the RAW TSC values
> > (scaled by kvmclock, but the frequency of the RAW TSC).
> >
> > After suspend/resume event, the clock is updated from the host
> > via get_kernel_ns(), which reads the corrected NTP frequency TSC.
> >
> > So you switch the timebase, from a clock running at a given frequency,
> > to a clock running at another frequency (effective frequency).
> >
> > Example:
> >
> > RAW TSC NTP corrected TSC
> > t0  10  10
> > t1  20  19.99
> > t2  30  29.98
> > t3  40  39.97
> > t4  50  49.96
> >
> > ...
> >
> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> > you can see what will happen.
> 
> Sure, but why would you ever switch from one to the other? 

Because thats what happens when you ask kvmclock to update from system
time (which is a reliable clock, resistant to suspend/resume issues).

>  I'm still not seeing the scenario under which this discontinuity is
> visible to anything other than the kvmclock code itself.

Host userspace can see if it uses TSC and clock_gettime()
and expects them to run hand in hand.

> The only things that need to be monotonic are the output from
> vread_pvclock and the in-kernel equivalent, I think.
> 
> --Andy

clock_gettime as well, should be monotonic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-17 Thread Marcelo Tosatti
On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <l...@amacapital.net> 
> >> wrote:
> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonz...@redhat.com> 
> >> > wrote:
> >> >>
> >> >>
> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >>> > RAW TSC NTP corrected TSC
> >> >>> > t0  10  10
> >> >>> > t1  20  19.99
> >> >>> > t2  30  29.98
> >> >>> > t3  40  39.97
> >> >>> > t4  50  49.96

(1)

> >> >>> >
> >> >>> > ...
> >> >>> >
> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >>> > you can see what will happen.
> >> >>>
> >> >>> Sure, but why would you ever switch from one to the other?
> >> >>
> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> than the guest kvmclock.
> >> >
> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >
> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> > backwards.
> >> >
> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> > of kvmclock, and this shouldn't matter.
> >> >
> >> > I still feel like I'm missing something very basic here.
> >> >
> >>
> >> OK, I think I get it.
> >>
> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> correction to the guest.  If it did, indeed, propagate the correction
> >> then, after resume, the host's new system_time would match the guest's
> >> idea of it (after accounting for the guest's long nap), and I don't
> >> think there would be a problem.
> >> That being said, I can't find the code in the masterclock stuff that
> >> would actually do this.
> >
> > Guest clock is maintained by guest timekeeping code, which does:
> >
> > timer_interrupt()
> > offset = read clocksource since last timer interrupt
> > accumulate_to_systemclock(offset)
> >
> > The frequency correction of NTP in the host can be applied to
> > kvmclock, which will be visible to the guest
> > at "read clocksource since last timer interrupt"
> > (kvmclock_clocksource_read function).
> 
> pvclock_clocksource_read?  That seems to do the same thing as all the
> other clocksource access functions.
> 
> >
> > This does not mean that the NTP correction in the host is propagated
> > to the guests system clock directly.
> >
> > (For example, the guest can run NTP which is free to do further
> > adjustments at "accumulate_to_systemclock(offset)" time).
> 
> Of course.  But I expected that, in the absence of NTP on the guest,
> that the guest would track the host's *corrected* time.
> 
> >
> >> If, on the other hand, the host's NTP correction is not supposed to
> >> propagate to the guest,
> >
> > This is optional. There is a module option to control this, in fact.
> >
> > Its nice to have, because then you can execute a guest without NTP
> > (say without network connection), and have a kvmclock (kvmclock is a
> > clocksource, not a guest system clock) which is NTP corrected.
> 
> Can you point to how this works?  I found kvm_guest_time_update, whch
> is called under circumstances that I haven't untangled.  I can't
> really tell what it's trying to do.

Documentation/virtual/kvm/timekeeping.txt.

> In any case, this still seems much more convoluted than it has to be.
> In the case in which the host has a stable TSC (tsc is selected in the
> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> the time on the last few generations of CPUs, then the core
> timekeeping code is already exposing

Re: kvmclock doesn't work, help?

2015-12-14 Thread Marcelo Tosatti
On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >>
> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >>
> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 0
> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >>
> >> If I boot a UP guest, tracing says:
> >>
> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >> I suspect, but I haven't verified, that this is fallout from:
> >>
> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> Author: Marcelo Tosatti <mtosa...@redhat.com>
> >> Date:   Wed May 14 12:43:24 2014 -0300
> >>
> >> KVM: x86: disable master clock if TSC is reset during suspend
> >>
> >> Updating system_time from the kernel clock once master clock
> >> has been enabled can result in time backwards event, in case
> >> kernel clock frequency is lower than TSC frequency.
> >>
> >> Disable master clock in case it is necessary to update it
> >> from the resume path.
> >>
> >> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
> >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> >>
> >>
> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> now, and not in a good way.  It's far too tangled with the vclock
> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> well thought out (even in principle in an ABI sense), and it's never
> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> to solve.
> >>
> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> ignoring kvmclock entirely)
> >> seems to work rather more reliably and
> >> considerably faster than a kvmclock guest.
> >>
> >> --Andy
> >>
> >> --
> >> Andy Lutomirski
> >> AMA Capital Management, LLC
> >
> > Andy,
> >
> > I am all for solving practical problems rather than pleasing aesthetic
> > pleasure.
> >
> >> Updating system_time from the kernel clock once master clock
> >> has been enabled can result in time backwards event, in case
> >> kernel clock frequency is lower than TSC frequency.
> >>
> >> Disable master clock in case it is necessary to update it
> >> from the resume path.
> >
> >> once master clock
> >> has been enabled can result in time backwards event, in case
> >> kernel clock frequency is lower than TSC frequency.
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
> >
>

Re: kvmclock doesn't work, help?

2015-12-14 Thread Marcelo Tosatti
On Mon, Dec 14, 2015 at 10:07:21AM -0800, Andy Lutomirski wrote:
> On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti <mtosa...@redhat.com> wrote:
> > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti <mtosa...@redhat.com> 
> >> wrote:
> >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >> >>
> >> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >> >>
> >> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 0
> >> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>
> >> >>
> >> >> If I boot a UP guest, tracing says:
> >> >>
> >> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>
> >> >> I suspect, but I haven't verified, that this is fallout from:
> >> >>
> >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> >> Author: Marcelo Tosatti <mtosa...@redhat.com>
> >> >> Date:   Wed May 14 12:43:24 2014 -0300
> >> >>
> >> >> KVM: x86: disable master clock if TSC is reset during suspend
> >> >>
> >> >> Updating system_time from the kernel clock once master clock
> >> >> has been enabled can result in time backwards event, in case
> >> >> kernel clock frequency is lower than TSC frequency.
> >> >>
> >> >> Disable master clock in case it is necessary to update it
> >> >> from the resume path.
> >> >>
> >> >> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
> >> >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> >> >>
> >> >>
> >> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> >> now, and not in a good way.  It's far too tangled with the vclock
> >> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> >> well thought out (even in principle in an ABI sense), and it's never
> >> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> >> to solve.
> >> >>
> >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> >> ignoring kvmclock entirely)
> >> >> seems to work rather more reliably and
> >> >> considerably faster than a kvmclock guest.
> >> >>
> >> >> --Andy
> >> >>
> >> >> --
> >> >> Andy Lutomirski
> >> >> AMA

Re: kvmclock doesn't work, help?

2015-12-14 Thread Marcelo Tosatti
On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
> 
> 
> On 11/12/2015 22:57, Andy Lutomirski wrote:
> > I'm still not seeing the issue.
> > 
> > The formula is:
> > 
> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> > pvti->tsc_shift) + pvti->system_time
> > 
> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> > after suspend/resume, you would also need to update system_time.
> > 
> > I don't see what this has to do with suspend/resume or with whether
> > the effective scale factor is greater than or less than one.  The only
> > suspend/resume interaction I can see is that, if the host allows the
> > guest-observed TSC value to jump (which is arguably a bug, what that's
> > not important here), it needs to update pvti before resuming the
> > guest.
> 
> Which is not an issue, since freezing obviously gets all CPUs out of
> guest mode.
> 
> Marcelo, can you provide an example with made-up values for tsc and pvti?

I meant "systemtime" at ^.

guest visible clock = systemtime (updated at time 0, guest initialization) + 
scaled tsc reads=LARGE VALUE.
  ^^
guest reads clock to memory at location A = scaled tsc read.
-> suspend resume event 
guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc 
reads=0.
guest reads clock to memory at location B.

So before the suspend/resume event, the clock is the RAW TSC values
(scaled by kvmclock, but the frequency of the RAW TSC). 

After suspend/resume event, the clock is updated from the host
via get_kernel_ns(), which reads the corrected NTP frequency TSC.

So you switch the timebase, from a clock running at a given frequency,
to a clock running at another frequency (effective frequency).

Example:

RAW TSC NTP corrected TSC
t0  10  10
t1  20  19.99
t2  30  29.98
t3  40  39.97
t4  50  49.96

...

if you suddenly switch from RAW TSC to NTP corrected TSC, 
you can see what will happen.

Does that make sense?

> > suspend/resume event.
> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc
> > reads=0.


> 
> > Can you clarify concretely what goes wrong here?
> > 
> > (I'm also at a bit of a loss as to why this needs both system_time and
> > tsc_timestamp.  They're redundant in the sense that you could set
> > tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >>
> > tsc_shift to system_time without changing the result of the
> > calculation.)
> 
> You would have to ensure that all elements of pvti are rounded correctly
> whenever the base TSC is updated.  Doable, but it does seem simpler to
> keep subtract-TSC and add-nanoseconds separate.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-11 Thread Marcelo Tosatti
On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> 
> If I boot an SMP (2 vcpus) guest, tracing says:
> 
>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 0
>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> 
> If I boot a UP guest, tracing says:
> 
>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> I suspect, but I haven't verified, that this is fallout from:
> 
> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> Author: Marcelo Tosatti <mtosa...@redhat.com>
> Date:   Wed May 14 12:43:24 2014 -0300
> 
> KVM: x86: disable master clock if TSC is reset during suspend
> 
> Updating system_time from the kernel clock once master clock
> has been enabled can result in time backwards event, in case
> kernel clock frequency is lower than TSC frequency.
> 
> Disable master clock in case it is necessary to update it
> from the resume path.
> 
> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> 
> 
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.
>
> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) 
> seems to work rather more reliably and
> considerably faster than a kvmclock guest.
> 
> --Andy
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

Andy,

I am all for solving practical problems rather than pleasing aesthetic
pleasure. 

> Updating system_time from the kernel clock once master clock
> has been enabled can result in time backwards event, in case
> kernel clock frequency is lower than TSC frequency.
> 
> Disable master clock in case it is necessary to update it
> from the resume path.

> once master clock
> has been enabled can result in time backwards event, in case
> kernel clock frequency is lower than TSC frequency.

guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.

If the effective frequency of the kernel clock is lower (for example
due to NTP correcting the TSC frequency of the system), and you resume 
and update the system, the following happens:

guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc 
reads=LARGE VALUE.
suspend/resume event.
guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-11 Thread Marcelo Tosatti
On Wed, Dec 09, 2015 at 02:27:36PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini  wrote:
> >
> >
> > On 09/12/2015 22:49, Andy Lutomirski wrote:
> >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini  wrote:
> >>>
> >>>
> >>> On 09/12/2015 22:10, Andy Lutomirski wrote:
>  Can we please stop making kvmclock more complex?  It's a beast right
>  now, and not in a good way.  It's far too tangled with the vclock
>  machinery on both the host and guest sides, the pvclock stuff is not
>  well thought out (even in principle in an ABI sense), and it's never
>  been clear to my what problem exactly the kvmclock stuff is supposed
>  to solve.
> >>>
> >>> It's supposed to solve the problem that:
> >>>
> >>> - not all hosts have a working TSC
> >>
> >> Fine, but we don't need any vdso integration for that.
> >
> > Well, you still want a fast time source.  That was a given. :)
> 
> If the host can't figure out how to give *itself* a fast time source,
> I'd be surprised if KVM can manage to give the guest a fast, reliable
> time source.
> 
> >
> >>> - even if they all do, virtual machines can be migrated (or
> >>> saved/restored) to a host with a different TSC frequency
> >>>
> >>> - any MMIO- or PIO-based mechanism to access the current time is orders
> >>> of magnitude slower than the TSC and less precise too.
> >>
> >> Yup.  But TSC by itself gets that benefit, too.
> >
> > Yes, the problem is if you want to solve all three of them.  The first
> > two are solved by the ACPI PM timer with a decent resolution (70
> > ns---much faster anyway than an I/O port access).  The third is solved
> > by TSC.  To solve all three, you need kvmclock.
> 
> Still confused.  Is kvmclock really used in cases where even the host
> can't pull of working TSC?
> 
> >
>  I'm somewhat tempted to suggest that we delete kvmclock entirely and
>  start over.  A correctly functioning KVM guest using TSC (i.e.
>  ignoring kvmclock entirely) seems to work rather more reliably and
>  considerably faster than a kvmclock guest.
> >>>
> >>> If all your hosts have a working TSC and you don't do migration or
> >>> save/restore, that's a valid configuration.  It's not a good default,
> >>> however.
> >>
> >> Er?
> >>
> >> kvmclock is still really quite slow and buggy.
> >
> > Unless it takes 3-4000 clock cycles for a gettimeofday, which it
> > shouldn't even with vdso disabled, it's definitely not slower than PIO.
> >
> >> And the patch I identified is definitely a problem here:
> >>
> >> [  136.131241] KVM: disabling fast timing permanently due to inability
> >> to recover from suspend
> >>
> >> I got that on the host with this whitespace-damaged patch:
> >>
> >> if (backwards_tsc) {
> >> u64 delta_cyc = max_tsc - local_tsc;
> >> +   if (!backwards_tsc_observed)
> >> +   pr_warn("KVM: disabling fast timing
> >> permanently due to inability to recover from suspend\n");
> >>
> >> when I suspended and resumed.
> >>
> >> Can anyone explain what problem
> >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve?  On
> >> brief inspection, it just seems to be incorrect.  Shouldn't KVM's
> >> normal TSC logic handle that case right?  After all, all vcpus should
> >> be paused when we resume from suspend.  At worst, we should just need
> >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus.  (Actually,
> >> shouldn't we do that regardless of which way the TSC jumped on
> >> suspend/resume?  After all, the jTSC-to-wall-clock offset is quite
> >> likely to change except on the very small handful of CPUs (if any)
> >> that keep the TSC running in S3 and hibernate.
> >
> > I don't recall the details of that patch, so Marcelo will have to answer
> > this, or Alex too since he chimed in the original thread.  At least it
> > should be made conditional on the existence of a VM at suspend time (and
> > the master clock stuff should be made per VM, as I suggested at
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html).
> >
> > It would indeed be great if the master clock could be dropped.  But I'm
> > definitely missing some of the subtle details. :(
> 
> Me, too.
> 
> Anyway, see the attached untested patch.  Marcelo?
> 
> --Andy

Read the last email, about the problem.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-11 Thread Marcelo Tosatti
On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> 
> If I boot an SMP (2 vcpus) guest, tracing says:
> 
>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 0
>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> 
> If I boot a UP guest, tracing says:
> 
>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> masterclock 0 hostclock tsc offsetmatched 1
> 
> I suspect, but I haven't verified, that this is fallout from:
> 
> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> Author: Marcelo Tosatti <mtosa...@redhat.com>
> Date:   Wed May 14 12:43:24 2014 -0300
> 
> KVM: x86: disable master clock if TSC is reset during suspend
> 
> Updating system_time from the kernel clock once master clock
> has been enabled can result in time backwards event, in case
> kernel clock frequency is lower than TSC frequency.
> 
> Disable master clock in case it is necessary to update it
> from the resume path.
> 
> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> 
> 
> Can we please stop making kvmclock more complex?  It's a beast right
> now, and not in a good way.  It's far too tangled with the vclock
> machinery on both the host and guest sides, the pvclock stuff is not
> well thought out (even in principle in an ABI sense), and it's never
> been clear to my what problem exactly the kvmclock stuff is supposed
> to solve.
> 
> 
> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> start over.  A correctly functioning KVM guest using TSC (i.e.
> ignoring kvmclock entirely) seems to work rather more reliably and
> considerably faster than a kvmclock guest.
> 
> --Andy

Users can do that, if they want. "clocksource=tsc" kernel option.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

2015-11-15 Thread Marcelo Tosatti
On Fri, Nov 13, 2015 at 07:47:28PM -0200, Marcelo Tosatti wrote:
> On Thu, Nov 12, 2015 at 08:52:45PM +0900, Takuya Yoshikawa wrote:
> > kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
> > nearly the same as the for_each_rmap_spte macro.  The only difference
> > is that is_shadow_present_pte() checks cannot be placed there because
> > kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
> > whose entry is not set yet.
> > 
> > By calling mark_unsync() separately for the parent and adding the parent
> > pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
> > works with no problem.
> > 
> > Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya...@lab.ntt.co.jp>
> > ---
> >  arch/x86/kvm/mmu.c | 36 +---
> >  1 file changed, 13 insertions(+), 23 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index e8cfdc4..1691171 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long 
> > *pte_list)
> > }
> >  }
> >  
> > -typedef void (*pte_list_walk_fn) (u64 *spte);
> > -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> > -{
> > -   struct pte_list_desc *desc;
> > -   int i;
> > -
> > -   if (!*pte_list)
> > -   return;
> > -
> > -   if (!(*pte_list & 1))
> > -   return fn((u64 *)*pte_list);
> > -
> > -   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> > -   while (desc) {
> > -   for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> > -   fn(desc->sptes[i]);
> > -   desc = desc->more;
> > -   }
> > -}
> > -
> >  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> > struct kvm_memory_slot *slot)
> >  {
> > @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page 
> > *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> >  static void mark_unsync(u64 *spte);
> >  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
> >  {
> > -   pte_list_walk(>parent_ptes, mark_unsync);
> > +   u64 *sptep;
> > +   struct rmap_iterator iter;
> > +
> > +   for_each_rmap_spte(>parent_ptes, , sptep) {
> > +   mark_unsync(sptep);
> > +   }
> >  }
> >  
> >  static void mark_unsync(u64 *spte)
> > @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> > kvm_vcpu *vcpu,
> 
> Faulting a spte, and one of the levels of sptes, either
> 
> 
>   spte-1
>   spte-2
>   spte-3
> 
> has present bit clear. So we're searching for a guest page to shadow, with
> gfn "gfn".
> 
> > if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
> > break;
> 
> If a shadow for gfn exists, but is unsync, sync guest-page ---to--> kvm
> sptes.
> 
> > -   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
> 
> add "gfn" (actually its "struct kvm_mmu_page *sp" pointer) to
> the parent.
> > if (sp->unsync_children) {
> > kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > kvm_mmu_mark_parents_unsync(sp);
> 
> kvm_mmu_mark_parents_unsync relied on the links from current level all
> the way to top level to mark all levels unsync, so that on guest entry,
> KVM_REQ_MMU_SYNC is processed and any level is brought from guest -->
> kvm pages. This now fails, because you removed "mmu_page_add_parent_pte"
> (the link is not formed all the way to root).
> 
> Unless i am missing something, this is not correct.

The actual issue is this: a higher level page that had, under its children,
no out of sync pages, now, due to your addition, a child that is unsync:

initial state:
level1 

final state:

level1 -x-> level2 -x-> level3

Where -x-> are the links created by this pagefault fixing round.

If _any_ page under you is unsync (not necessarily the ones this
pagefault is accessing), you have to mark parents unsync.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

2015-11-13 Thread Marcelo Tosatti
On Thu, Nov 12, 2015 at 08:52:45PM +0900, Takuya Yoshikawa wrote:
> kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
> nearly the same as the for_each_rmap_spte macro.  The only difference
> is that is_shadow_present_pte() checks cannot be placed there because
> kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
> whose entry is not set yet.
> 
> By calling mark_unsync() separately for the parent and adding the parent
> pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
> works with no problem.
> 
> Signed-off-by: Takuya Yoshikawa 
> ---
>  arch/x86/kvm/mmu.c | 36 +---
>  1 file changed, 13 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e8cfdc4..1691171 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long 
> *pte_list)
>   }
>  }
>  
> -typedef void (*pte_list_walk_fn) (u64 *spte);
> -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> -{
> - struct pte_list_desc *desc;
> - int i;
> -
> - if (!*pte_list)
> - return;
> -
> - if (!(*pte_list & 1))
> - return fn((u64 *)*pte_list);
> -
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> - fn(desc->sptes[i]);
> - desc = desc->more;
> - }
> -}
> -
>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>   struct kvm_memory_slot *slot)
>  {
> @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
> kvm_vcpu *vcpu,
>  static void mark_unsync(u64 *spte);
>  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  {
> - pte_list_walk(>parent_ptes, mark_unsync);
> + u64 *sptep;
> + struct rmap_iterator iter;
> +
> + for_each_rmap_spte(>parent_ptes, , sptep) {
> + mark_unsync(sptep);
> + }
>  }
>  
>  static void mark_unsync(u64 *spte)
> @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,

Faulting a spte, and one of the levels of sptes, either


spte-1
spte-2
spte-3

has present bit clear. So we're searching for a guest page to shadow, with
gfn "gfn".

>   if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
>   break;

If a shadow for gfn exists, but is unsync, sync guest-page ---to--> kvm
sptes.

> - mmu_page_add_parent_pte(vcpu, sp, parent_pte);

add "gfn" (actually its "struct kvm_mmu_page *sp" pointer) to
the parent.
>   if (sp->unsync_children) {
>   kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>   kvm_mmu_mark_parents_unsync(sp);

kvm_mmu_mark_parents_unsync relied on the links from current level all
the way to top level to mark all levels unsync, so that on guest entry,
KVM_REQ_MMU_SYNC is processed and any level is brought from guest -->
kvm pages. This now fails, because you removed "mmu_page_add_parent_pte"
(the link is not formed all the way to root).

Unless i am missing something, this is not correct.

> - } else if (sp->unsync)
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + } else if (sp->unsync) {
>   kvm_mmu_mark_parents_unsync(sp);
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + }
> + mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>  
>   __clear_sp_write_flooding_count(sp);
>   trace_kvm_mmu_get_page(sp, false);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes

2015-11-13 Thread Marcelo Tosatti
On Thu, Nov 12, 2015 at 08:53:43PM +0900, Takuya Yoshikawa wrote:
> At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
> placed right after the call to detect unrelated sptes which must not be
> found in the reverse-mapping list.
> 
> Move this check in rmap_get_first/next() so that all call sites, not
> just the users of the for_each_rmap_spte() macro, will be checked the
> same way.  In addition, change the BUG_ON to WARN_ON since killing the
> whole host is the last thing that KVM should try.

It should be a BUG_ON, if KVM continues it will corrupt (more) memory.

> One thing to keep in mind is that kvm_mmu_unlink_parents() also uses
> rmap_get_first() to handle parent sptes.  The change will not break it
> because parent sptes are present, at least until drop_parent_pte()
> actually unlinks them, and not mmio-sptes.
> 
> Signed-off-by: Takuya Yoshikawa 
> ---
>  Documentation/virtual/kvm/mmu.txt |  4 ++--
>  arch/x86/kvm/mmu.c| 26 +-
>  2 files changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt 
> b/Documentation/virtual/kvm/mmu.txt
> index 3a4d681..daf9c0f 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -203,10 +203,10 @@ Shadow pages contain the following information:
>  page cannot be destroyed.  See role.invalid.
>parent_ptes:
>  The reverse mapping for the pte/ptes pointing at this page's spt. If
> -parent_ptes bit 0 is zero, only one spte points at this pages and
> +parent_ptes bit 0 is zero, only one spte points at this page and
>  parent_ptes points at this single spte, otherwise, there exists multiple
>  sptes pointing at this page and (parent_ptes & ~0x1) points at a data
> -structure with a list of parent_ptes.
> +structure with a list of parent sptes.
>unsync:
>  If true, then the translations in this page may not match the guest's
>  translation.  This is equivalent to the state of the tlb when a pte is
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 1691171..ee7b101 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1079,17 +1079,23 @@ struct rmap_iterator {
>   */
>  static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
>  {
> + u64 *sptep;
> +
>   if (!rmap)
>   return NULL;
>  
>   if (!(rmap & 1)) {
>   iter->desc = NULL;
> - return (u64 *)rmap;
> + sptep = (u64 *)rmap;
> + goto out;
>   }
>  
>   iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
>   iter->pos = 0;
> - return iter->desc->sptes[iter->pos];
> + sptep = iter->desc->sptes[iter->pos];
> +out:
> + WARN_ON(!is_shadow_present_pte(*sptep));
> + return sptep;
>  }
>  
>  /*
> @@ -1099,14 +1105,14 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
> rmap_iterator *iter)
>   */
>  static u64 *rmap_get_next(struct rmap_iterator *iter)
>  {
> + u64 *sptep;
> +
>   if (iter->desc) {
>   if (iter->pos < PTE_LIST_EXT - 1) {
> - u64 *sptep;
> -
>   ++iter->pos;
>   sptep = iter->desc->sptes[iter->pos];
>   if (sptep)
> - return sptep;
> + goto out;
>   }
>  
>   iter->desc = iter->desc->more;
> @@ -1114,17 +1120,20 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>   if (iter->desc) {
>   iter->pos = 0;
>   /* desc->sptes[0] cannot be NULL */
> - return iter->desc->sptes[iter->pos];
> + sptep = iter->desc->sptes[iter->pos];
> + goto out;
>   }
>   }
>  
>   return NULL;
> +out:
> + WARN_ON(!is_shadow_present_pte(*sptep));
> + return sptep;
>  }
>  
>  #define for_each_rmap_spte(_rmap_, _iter_, _spte_)   \
>  for (_spte_ = rmap_get_first(*_rmap_, _iter_);   \
> - _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;});  \
> - _spte_ = rmap_get_next(_iter_))
> + _spte_; _spte_ = rmap_get_next(_iter_))
>  
>  static void drop_spte(struct kvm *kvm, u64 *sptep)
>  {
> @@ -1338,7 +1347,6 @@ static bool kvm_zap_rmapp(struct kvm *kvm, unsigned 
> long *rmapp)
>   bool flush = false;
>  
>   while ((sptep = rmap_get_first(*rmapp, ))) {
> - BUG_ON(!(*sptep & PT_PRESENT_MASK));
>   rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);
>  
>   drop_spte(kvm, sptep);
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from 

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Marcelo Tosatti
On Wed, Oct 28, 2015 at 06:05:00PM +0100, Paolo Bonzini wrote:
> 
> 
> On 28/10/2015 17:00, Alex Williamson wrote:
> > > Alex, would it make sense to use the IRQ bypass infrastructure always,
> > > not just for VT-d, to do the MSI injection directly from the VFIO
> > > interrupt handler and bypass the eventfd?  Basically this would add an
> > > RCU-protected list of consumers matching the token to struct
> > > irq_bypass_producer, and a
> > > 
> > >   int (*inject)(struct irq_bypass_consumer *);
> > > 
> > > callback to struct irq_bypass_consumer.  If any callback returns true,
> > > the eventfd is not signaled.
> >
> > Yeah, that might be a good idea, it's probably more plausible than
> > making the eventfd_signal() code friendly to call from hard interrupt
> > context.  On the vfio side can we use request_threaded_irq() directly
> > for this?
> 
> I don't know if that gives you a non-threaded IRQ with the real-time
> kernel...  CCing Marcelo to get some insight.

The vfio interrupt handler (threaded or not) runs at a higher priority
than the vcpu thread. So don't worry about -RT.

About bypass: the smaller number of instructions between device ISR and
injection of interrupt to guest, the better, as that will translate
directly to reduction in interrupt latency times, which is important, as
it determines 

1. how often you can switch from pollmode to ACPI C-states.
2. whether the realtime workload is virtualizable.

The answer to properties of request_threaded_irq() is: don't know.

> > Making the hard irq handler return IRQ_HANDLED if we can use
> > the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd.
> > I think we need some way to get back to irq thread context to use
> > eventfd_signal().
> 
> The irqfd is already able to schedule a work item, because it runs with
> interrupts disabled, so I think we can always return IRQ_HANDLED.
> 
> There's another little complication.  Right now, only x86 has
> kvm_set_msi_inatomic.  We should merge kvm_set_msi_inatomic,
> kvm_set_irq_inatomic and kvm_arch_set_irq.
> 
> Some cleanups are needed there; the flow between the functions is really
> badly structured because the API grew somewhat by accretion.  I'll get
> to it next week or on the way back to Italy.
> 
> > Would we ever not want to use the direct bypass
> > manager path if available?  Thanks,
> 
> I don't think so.  KVM always registers itself as a consumer, even if
> there is no VT-d posted interrupts.  add_producer simply returns -EINVAL
> then.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] target-i386: save/restore vcpu's TSC rate during migration

2015-10-23 Thread Marcelo Tosatti
On Thu, Oct 22, 2015 at 04:45:21PM -0200, Eduardo Habkost wrote:
> On Tue, Oct 20, 2015 at 03:22:51PM +0800, Haozhong Zhang wrote:
> > This patchset enables QEMU to save/restore vcpu's TSC rate during the
> > migration. When cooperating with KVM which supports TSC scaling, guest
> > programs can observe a consistent guest TSC rate even though they are
> > migrated among machines with different host TSC rates.
> > 
> > A pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' are added to
> > control the migration of vcpu's TSC rate.
> 
> The requirements and goals aren't clear to me. I see two possible use
> cases, here:
> 
> 1) Best effort to keep TSC frequency constant if possible (but not
>aborting migration if not possible). This would be an interesting
>default, but a bit unpredictable.
> 2) Strictly ensuring TSC frequency stays constant on migration (and
>aborting migration if not possible). This would be an useful feature,
>but can't be enabled by default unless both hosts have the same TSC
>frequency or support TSC scaling.

Only destination needs to support TSC scaling, to match the frequency
of the incoming host.

The KVM code for this feature has submitted or integrated? 

> Which one(s) you are trying to implement?
> 
> In other words, what is the right behavior when KVM_SET_TSC_KHZ fails or
> KVM_CAP_TSC_CONTROL is not available? We can't answer that question if
> the requirements and goals are not clear.
> 
> Once we know what exactly is the goal, we could enable the new mode with
> a single option, instead of raw options to control migration stream
> loading/saving.

Windows and Linux guests have paravirt clocks and/or options to
disable direct TSC usage for timekeeping purposes. So disabling
migration seems overkill.

> 
> 
> >  * By default, the migration of vcpu's TSC rate is enabled only on
> >pc-*-2.5 and newer machine types. If the cpu option 'save-tsc-freq'
> >is present, the vcpu's TSC rate will be migrated from older machine
> >types as well.
> >  * Another cpu option 'load-tsc-freq' controls whether the migrated
> >vcpu's TSC rate is used. By default, QEMU will not use the migrated
> >TSC rate if this option is not present. Otherwise, QEMU will use
> >the migrated TSC rate and override the TSC rate given by the cpu
> >option 'tsc-freq'.
> > 
> > Changes in v2:
> >  * Add a pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' to
> >control the migration of vcpu's TSC rate.
> >  * Move all logic of setting TSC rate to target-i386.
> >  * Remove the duplicated TSC setup in kvm_arch_init_vcpu().
> > 
> > Haozhong Zhang (3):
> >   target-i386: add a subsection for migrating vcpu's TSC rate
> >   target-i386: calculate vcpu's TSC rate to be migrated
> >   target-i386: load the migrated vcpu's TSC rate
> > 
> >  include/hw/i386/pc.h  |  5 +
> >  target-i386/cpu.c |  2 ++
> >  target-i386/cpu.h |  3 +++
> >  target-i386/kvm.c | 61 
> > +++
> >  target-i386/machine.c | 19 
> >  5 files changed, 81 insertions(+), 9 deletions(-)
> > 
> > -- 
> > 2.4.8
> > 
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: move steal time initialization to vcpu entry time

2015-10-14 Thread Marcelo Tosatti

As reported at https://bugs.launchpad.net/qemu/+bug/1494350, 
it is possible to have vcpu->arch.st.last_steal initialized 
from a thread other than vcpu thread, say the iothread, via 
KVM_SET_MSRS.

Which can cause an overflow later (when subtracting from vcpu threads
sched_info.run_delay).

To avoid that, move steal time accumulation to vcpu entry time, 
before copying steal time data to guest.

Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f0f6ec..0e0332e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2030,6 +2030,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
+   accumulate_steal_time(vcpu);
+
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
 
@@ -2182,12 +2184,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
if (!(data & KVM_MSR_ENABLED))
break;
 
-   vcpu->arch.st.last_steal = current->sched_info.run_delay;
-
-   preempt_disable();
-   accumulate_steal_time(vcpu);
-   preempt_enable();
-
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
break;
@@ -2830,7 +2826,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vcpu->cpu = cpu;
}
 
-   accumulate_steal_time(vcpu);
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"

2015-09-23 Thread Marcelo Tosatti
On Tue, Sep 22, 2015 at 09:52:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 22/09/2015 21:01, Marcelo Tosatti wrote:
> > On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote:
> >> Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
> >> MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
> >> to its custom enhancements to kvmclock, but KVM never declared the MSR
> >> only for one-shot initialization.  (Doc says that only one write is
> >> needed.)
> >>
> >> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> >> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> >>
> >> Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
> >> ---
> >>  arch/x86/include/asm/pvclock-abi.h | 1 +
> >>  arch/x86/kvm/x86.c | 4 
> >>  2 files changed, 1 insertion(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/pvclock-abi.h 
> >> b/arch/x86/include/asm/pvclock-abi.h
> >> index 655e07a48f6c..67f08230103a 100644
> >> --- a/arch/x86/include/asm/pvclock-abi.h
> >> +++ b/arch/x86/include/asm/pvclock-abi.h
> >> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
> >>  
> >>  #define PVCLOCK_TSC_STABLE_BIT(1 << 0)
> >>  #define PVCLOCK_GUEST_STOPPED (1 << 1)
> >> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
> >>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
> >>  #endif /* __ASSEMBLY__ */
> >>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 18d59b584dee..34d33f4757d2 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >>vcpu->pvclock_set_guest_stopped_request = false;
> >>}
> >>  
> >> -  pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> >> -
> >>/* If the host uses TSC clocksource, then it is stable */
> >>if (use_master_clock)
> >>pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> >> @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> >> msr_data *msr_info)
> >>>requests);
> >>  
> >>ka->boot_vcpu_runs_old_kvmclock = tmp;
> >> -
> >> -  ka->kvmclock_offset = -get_kernel_ns();
> >>}
> >>  
> >>vcpu->arch.time = data;
> >> -- 
> >> 2.5.2
> > 
> > ACK
> 
> So I suppose you changed your mind :) but can you explain the reasoning?
> 
> Paolo

The patch is correct. Overflow (issue raised) is only an issue 
without PVCLOCK_TSC_STABLE_BIT.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO

2015-09-22 Thread Marcelo Tosatti
On Fri, Sep 18, 2015 at 05:54:29PM +0200, Radim Krčmář wrote:
> Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore.
> The purpose of that flags was to start counting system time from 0 when
> the KVM clock has been initialized.
> We can achieve the same by selecting one read as the initial point.
> 
> A simple subtraction will work unless the KVM clock count overflows
> earlier (has smaller width) than scheduler's cycle count.  We should be
> safe till x86_128.
> 
> Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors,
> setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might
> regress on older ones.
> 
> I presume we don't need to change kvm_clock_read instead of introducing
> kvm_sched_clock_read.  A problem could arise in case sched_clock is
> expected to return the same value as get_cycles, but we should have
> merged those clocks in that case.
> 
> Signed-off-by: Radim Krčmář 
> ---
>  arch/x86/kernel/kvmclock.c | 46 
> +++---
>  1 file changed, 35 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 2c7aafa70702..ef5b3d2cecce 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -32,6 +32,7 @@
>  static int kvmclock = 1;
>  static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
>  static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
> +static cycle_t kvm_sched_clock_offset;
>  
>  static int parse_no_kvmclock(char *arg)
>  {
> @@ -92,6 +93,29 @@ static cycle_t kvm_clock_get_cycles(struct clocksource *cs)
>   return kvm_clock_read();
>  }
>  
> +static cycle_t kvm_sched_clock_read(void)
> +{
> + return kvm_clock_read() - kvm_sched_clock_offset;
> +}
> +
> +static inline void kvm_sched_clock_init(bool stable)
> +{
> + if (!stable) {
> + pv_time_ops.sched_clock = kvm_clock_read;
> + return;
> + }
> +
> + kvm_sched_clock_offset = kvm_clock_read();
> + pv_time_ops.sched_clock = kvm_sched_clock_read;
> + set_sched_clock_stable();
> +
> + printk("kvm-clock: using sched offset of %llu cycles\n",
> + kvm_sched_clock_offset);
> +
> + BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> +  sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
> +}
> +
>  /*
>   * If we don't do that, there is the possibility that the guest
>   * will calibrate under heavy load - thus, getting a lower lpj -
> @@ -248,7 +272,17 @@ void __init kvmclock_init(void)
>   memblock_free(mem, size);
>   return;
>   }
> - pv_time_ops.sched_clock = kvm_clock_read;
> +
> + if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
> + pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
> +
> + cpu = get_cpu();
> + vcpu_time = _clock[cpu].pvti;
> + flags = pvclock_read_flags(vcpu_time);
> +
> + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> + put_cpu();
> +
>   x86_platform.calibrate_tsc = kvm_get_tsc_khz;
>   x86_platform.get_wallclock = kvm_get_wallclock;
>   x86_platform.set_wallclock = kvm_set_wallclock;
> @@ -265,16 +299,6 @@ void __init kvmclock_init(void)
>   kvm_get_preset_lpj();
>   clocksource_register_hz(_clock, NSEC_PER_SEC);
>   pv_info.name = "KVM";
> -
> - if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
> - pvclock_set_flags(~0);
> -
> - cpu = get_cpu();
> - vcpu_time = _clock[cpu].pvti;
> - flags = pvclock_read_flags(vcpu_time);
> - if (flags & PVCLOCK_COUNTS_FROM_ZERO)
> - set_sched_clock_stable();
> - put_cpu();
>  }
>  
>  int __init kvm_setup_vsyscall_timeinfo(void)
> -- 
> 2.5.2

ACK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"

2015-09-22 Thread Marcelo Tosatti
On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim Krčmář wrote:
> PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it.
> All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW).
> The reverted patch treated the MSR write as a one-shot initializer:
> any write from VCPU 0 would reset system_time.
> 
> And this is what broke for Linux guests:
> * Onlining/hotplugging of VCPU 0
> 
>   VCPU has to assign an address to KVM clock before use, which is done
>   with MSR_KVM_SYSTEM_TIME_NEW.  Linux has an idea that time should not
>   jump backward, so any `sleep` won't return before |system_time at the
>   point of offline| elapses since the online.  Be sure to run ntp.
> 
> * S3 and S4 resume
> 
>   If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze
>   for |system_time at the point of suspend|, because pvclock ensures
>   monoticity and kvmclock did not think about it.
> 
>   If you have stable clock, execution will resume immediately, but
>   restoring KVM clock writes to the MSR and dmesg starts to count from
>   zero.  It's better than the onlining, but not what we want either.
> 
> * Boot of SLES 10 guest
> 
>   SLES 10 has a custom implementation of kvm clock that calls
>   MSR_KVM_SYSTEM_TIME before every read to enhance precision ...
>   Two things are happening at the same time:
>   1) The guest periodically receives an interrupt that is handled by
>  main_timer_handler():
>  a) get time using the kvm clock:
> 1) write the address to MSR_KVM_SYSTEM_TIME
> 2) read tsc and pvclock (tsc_offset, system_time)
> 3) time = tsc - tsc_offset + system_time
>  b) compute time since the last main_timer_handler()
>  c) bump jiffies if enough time has elapsed
>   2) the guest wants to calibrate loops per jiffy [1]:
>  a) read tsc
>  b) loop till jiffies increase
>  c) compute lpj
> 
>   Because (1a1) always resets the system_time to 0, we read the same
>   value over and over so the condition for (1c) is never true and
>   jiffies remain constant.  A hang happens in (2b) as it is the first
>   place that depends on jiffies.
> 
> 
> We could make hypervisor workaround for this, but that is just asking
> for more trouble.  Luckily, reverting does not break to guests that
> learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways.
> Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which
> is a good trade-off for not regressing.
> 
> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Radim Krčmář 
> ---
>  v1: Extended commit message based on a discussion with Marcelo
> 
>  arch/x86/include/asm/pvclock-abi.h | 1 +
>  arch/x86/kvm/x86.c | 4 
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pvclock-abi.h 
> b/arch/x86/include/asm/pvclock-abi.h
> index 655e07a48f6c..67f08230103a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>  
>  #define PVCLOCK_TSC_STABLE_BIT   (1 << 0)
>  #define PVCLOCK_GUEST_STOPPED(1 << 1)
> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bca39f0fdb3..71731994d897 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>   vcpu->pvclock_set_guest_stopped_request = false;
>   }
>  
> - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> -
>   /* If the host uses TSC clocksource, then it is stable */
>   if (use_master_clock)
>   pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   >requests);
>  
>   ka->boot_vcpu_runs_old_kvmclock = tmp;
> -
> - ka->kvmclock_offset = -get_kernel_ns();
>   }
>  
>   vcpu->arch.time = data;
> -- 
> 2.5.3

NACK, please use original patchset.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"

2015-09-22 Thread Marcelo Tosatti
On Fri, Sep 18, 2015 at 05:54:30PM +0200, Radim Krčmář wrote:
> Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
> MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
> to its custom enhancements to kvmclock, but KVM never declared the MSR
> only for one-shot initialization.  (Doc says that only one write is
> needed.)
> 
> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> 
> Signed-off-by: Radim Krčmář 
> ---
>  arch/x86/include/asm/pvclock-abi.h | 1 +
>  arch/x86/kvm/x86.c | 4 
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pvclock-abi.h 
> b/arch/x86/include/asm/pvclock-abi.h
> index 655e07a48f6c..67f08230103a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>  
>  #define PVCLOCK_TSC_STABLE_BIT   (1 << 0)
>  #define PVCLOCK_GUEST_STOPPED(1 << 1)
> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 18d59b584dee..34d33f4757d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1707,8 +1707,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>   vcpu->pvclock_set_guest_stopped_request = false;
>   }
>  
> - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> -
>   /* If the host uses TSC clocksource, then it is stable */
>   if (use_master_clock)
>   pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> @@ -2006,8 +2004,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   >requests);
>  
>   ka->boot_vcpu_runs_old_kvmclock = tmp;
> -
> - ka->kvmclock_offset = -get_kernel_ns();
>   }
>  
>   vcpu->arch.time = data;
> -- 
> 2.5.2

ACK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

2015-09-21 Thread Marcelo Tosatti
> So either:
> 
> Proceed with guest solution:
> -> Make sure the overflow can't happen (and write down why not in the
> code). Don't assume a small delta between kvmclock values of vcpus.
> -> Handle stable -> non-stable kvmclock transition.
> -> kvmclock counts from zero should not depend on stable kvmclock
> (because nohz_full should work on hpet host systems).
> 
> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
> -> Figure out whats wrong with different kvmclock values on hotplug, 
> and fix it.

Find data which allows you to differentiate between hotplug of pCPU-0
and system initialization.

Easy one: whether MSR_KVM_SYSTEM_TIME_NEW contains valid data (that is
kvmclock is enabled) for vCPUs other than vCPU-0.

This can't be the case on system initialization (otherwise the host will
be corrupting guest memory), and must be the case when hotplugging
vCPU-0.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

2015-09-21 Thread Marcelo Tosatti
On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >> 
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time;  we used to allow that, which means an
> >> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> > 
> >> > Because this behaviour was not defined.
> >> 
> >> It is defined by implementation.
> >> 
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false? 
> >> > The patch would be much simpler.
> >> 
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> > 
> > Yes patch 1.
> 
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
>  patch 2 = [2/2].)
> 
> > Do you have an example of another guest that is broken? 
> 
> Yes, the hot-plug in any relatively recent Linux guest.
> 
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
> 
> Yeah, I even read the SUSE implementation, sad times.
> 
> >> I really don't like it 
> > 
> > Because it changes behaviour that was previously unspecified?
> 
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
> 
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> > 
> > Can't unplug VCPU 0 i think.
> 
> You can.
> 
> >> It doesn't work well ;)
> > 
> > Can you hot-unplug vcpu 0? 
> 
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >> 
> >> I don't think that is a problem.
> >> 
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why 
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> >x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> >KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
> 
> (Thanks for the commits, split values are the core design that avoids
>  modifying rdtsc, 

Which is broken (thats the point).

> so I'll be grad to read its details.)
> 
> >> >  2) You rely on monotonicity across vcpus to perform 
> >> > the 'minus delta that was read on vcpu0' calculation, but 
> >> > monotonicity across vcpus can fail during runtime
> >> >(say host clocksource goes tsc->hpet due to tsc instability).
> >> 
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> > 
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> > 
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | 
> > vcpu1sched_clock
> > 1  1  1 
> > 2  2  21
> 
> KVM sched clock not used before this point (I even moved the code to
> make

Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

2015-09-21 Thread Marcelo Tosatti
On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
> >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> >> RFC because I haven't explored many potential problems or tested it.
> > 
> > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> > haven't explored potential problems or tested it? Sorry can't parse it.
> > 
> >> 
> >> [1/2] uses a different algorithm in the guest to start counting from 0.
> >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> >> 
> >> A viable alternative would be to implement opt-in features in kvm clock.
> >> 
> >> And because we probably only broke one old user (the infamous SLES 10), a
> >> workaround like this is also possible: (but I'd rather not do that)
> > 
> > Please describe why SLES 10 breaks in detail: the state of the guest and
> > the host before the patch, the state of the guest and host after the
> > patch.
> 
> 1) The guest periodically receives an interrupt that is handled by
>main_timer_handler():
>a) get time using the kvm clock:
>   1) write the address to MSR_KVM_SYSTEM_TIME
>   2) read tsc and pvclock (tsc_offset, system_time)
>   3) time = tsc - tsc_offset + system_time
>b) compute time since the last main_timer_handler()
>c) bump jiffies if enough time has elapsed
> 2) the guest wants to calibrate loops per jiffy [1]:
>a) read tsc
>b) loop till jiffies increase
>c) compute lpj
> 
> Because (1a1) always resets the system_time to 0, we read the same value
> over and over so the condition for (1c) is never true and jiffies remain
> constant.  This is the problem.  A hang happens in (2b) as it is the
> first place that depends on jiffies.
> 
> > What does SLES10 expect?
> 
> That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.
> 
> > Is it counting from zero that breaks SLES10?
> 
> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> while still keeping system time;  we used to allow that, which means an
> ABI breakage.  (And we can't even say that guest's behaviour is against
> the spec ...)

Because this behaviour was not defined.

Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
boot_vcpu_runs_old_kvmclock == false? 
The patch would be much simpler.

The problem is, "selecting one read as the initial point" is inherently
racy: that delta is relative to one moment (kvmclock read) at one vcpu,
but must be applied to all vcpus.

Besides:

1) Stable sched clock in guest does not depend on
   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
2) You rely on monotonicity across vcpus to perform 
   the 'minus delta that was read on vcpu0' calculation, but 
   monotonicity across vcpus can fail during runtime
   (say host clocksource goes tsc->hpet due to tsc instability).


> 
> 
> ---
> 1: I also did diassembly, but the reproducer is easier to paste
>(couldn't find debuginfo)
># qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
> -serial stdio -monitor /dev/null -append 'console=ttyS0'
>   
>   and you can get a bit further when setting loops per jiffy manually,
> -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'
> 
>   The dmesg for failing run is
> Initializing CPU#0
> PID hash table entries: 512 (order: 9, 16384 bytes)
> kvm-clock: cpu 0, msr 0:3f6041, boot clock
> kvm_get_tsc_khz: cpu 0, msr 0:e001
> time.c: Using tsc for timekeeping HZ 250
> time.c: Using 100.00 MHz WALL KVM GTOD KVM timer.
> time.c: Detected 2591.580 MHz processor.
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> Checking aperture...
> Nosave address range: 0009f000 - 000a
> Nosave address range: 000a - 000f
> Nosave address range: 000f - 0010
> Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 
> 812k data, 188k init)
> [Infinitely querying kvm clock here ...]
> 
>   With '-cpu kvm64,-kvmclock', the next line is
> Calibrating delay using timer specific routine.. 5199.75 BogoMIPS 
> (lpj=10399519)
> 
>   With 'lpj=10399519',
> Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
> [Manages to get stuck later, in default_idle.]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

2015-09-21 Thread Marcelo Tosatti
On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> 2015-09-21 12:52-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >>> Is it counting from zero that breaks SLES10?
> >> 
> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> while still keeping system time;  we used to allow that, which means an
> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> the spec ...)
> > 
> > Because this behaviour was not defined.
> 
> It is defined by implementation.
> 
> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> > boot_vcpu_runs_old_kvmclock == false? 
> > The patch would be much simpler.
> 
> 
> If you mean the hunk in cover letter, I don't like it because we presume
> that no other guests were broken.

Yes patch 1.

Do you have an example of another guest that is broken? 

Really, assuming things about the value of the msr read when you
write to the MSR is very awkward. That SuSE code is incorrect, it fails
on other occasions as well (see
54750f2cf042c42b4223d67b1bd20138464bde0e).

> I really don't like it 

Because it changes behaviour that was previously unspecified?

> so I thought about other problems with
> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?

Can't unplug VCPU 0 i think.

> It doesn't work well ;)

Can you hot-unplug vcpu 0? 

> We don't want to guess what the guest wants so I'd go for the opt-in
> paravirt feature unless counting from zero can be done in guest alone.

The problem it is tricky (to do the counting inside the guest).

Its much cleaner and less complicated if the host starts counting from
zero.

> 
> > The problem is, "selecting one read as the initial point" is inherently
> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> > but must be applied to all vcpus.
> 
> I don't think that is a problem.
> 
> Kvmclock has a notion of a global system_time in nanoseconds (one value
> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> to propagate system_time into guest VCPUs as precisely as possible with
> the help of TSC.

Different pairs of values (system_time, tsc reads) are fundamentally
broken. This is why 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
x86, paravirt: Add a global synchronization point for pvclock

Exists.

Then to guarantee monotonicity you need to stop those updates
(or perform the pair update in lock step):

commit d828199e84447795c6669ff0e6c6d55eb9beeff6
KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag


> sched_clock uses kvmclock to get nanoseconds since the system was
> brought up and [1/2] only works with this abstracted ns count.
> A poorly synchronized initial read is irrelevant because all VCPUs will
> be using the same constant offset.
> (We can never know the precise time anyway.)
> 
> > Besides:
> > 
> > 1) Stable sched clock in guest does not depend on
> >KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
> 
> Yes, thanks, I will remove that requirement in v1.  (We'd need to
> improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)
> 
> Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
> there is now one possible unsigned overflow: in case the clock was very
> imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
> time of initialization.  It's very unlikely that we'll ever reach legal
> overflow so I can add a condition there.
> 
> > 2) You rely on monotonicity across vcpus to perform 
> >the 'minus delta that was read on vcpu0' calculation, but 
> >monotonicity across vcpus can fail during runtime
> >(say host clocksource goes tsc->hpet due to tsc instability).
> 
> That could be a problem, but I'm adding a VCPU independent constant to
> all reads -- does the new code rely on monoticity in places where the
> old one didn't?

The problem is overflow:
kvmclock non-monotonic across vcpus AND delta subtraction:

ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | 
vcpu1sched_clock
11 1
22 21   
33 02   -1
44 130
55 241
66 352
77 4   

Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

2015-09-20 Thread Marcelo Tosatti
On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> RFC because I haven't explored many potential problems or tested it.

The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
haven't explored potential problems or tested it? Sorry can't parse it.

> 
> [1/2] uses a different algorithm in the guest to start counting from 0.
> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> 
> A viable alternative would be to implement opt-in features in kvm clock.
> 
> And because we probably only broke one old user (the infamous SLES 10), a
> workaround like this is also possible: (but I'd rather not do that)

Please describe why SLES 10 breaks in detail: the state of the guest and
the host before the patch, the state of the guest and host after the
patch.

What does SLES10 expect?
Is it counting from zero that breaks SLES10? 

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a60bdbccff51..ae9049248aaf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>  
>   ka->boot_vcpu_runs_old_kvmclock = tmp;
>  
> - ka->kvmclock_offset = -get_kernel_ns();
> + if (!ka->boot_vcpu_runs_old_kvmclock)
> + ka->kvmclock_offset = -get_kernel_ns();
>   }
>  
>   vcpu->arch.time = data;
> 
> 
> Radim Krčmář (2):
>   x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
>   Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
> system MSR"
> 
>  arch/x86/include/asm/pvclock-abi.h |  1 +
>  arch/x86/kernel/kvmclock.c | 46 
> +-
>  arch/x86/kvm/x86.c |  4 
>  3 files changed, 36 insertions(+), 15 deletions(-)
> 
> -- 
> 2.5.2
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 4.1-rc8

2015-06-16 Thread Marcelo Tosatti

Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following KVM bug fix, which restores 
APIC migration functionality.


Radim Krčmář (1):
  KVM: x86: fix lapic.timer_mode on restore

 arch/x86/kvm/lapic.c |   26 --
 1 file changed, 16 insertions(+), 10 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-06-05 Thread Marcelo Tosatti
On Tue, Apr 14, 2015 at 07:37:44AM +, Wu, Feng wrote:
 
 
  -Original Message-
  From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
  Sent: Tuesday, March 31, 2015 7:56 AM
  To: Wu, Feng
  Cc: h...@zytor.com; t...@linutronix.de; mi...@redhat.com; x...@kernel.org;
  g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
  j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com;
  eric.au...@linaro.org; linux-ker...@vger.kernel.org;
  io...@lists.linux-foundation.org; kvm@vger.kernel.org
  Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
  is blocked
  
  On Mon, Mar 30, 2015 at 04:46:55AM +, Wu, Feng wrote:
  
  
-Original Message-
From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
Sent: Saturday, March 28, 2015 3:30 AM
To: Wu, Feng
Cc: h...@zytor.com; t...@linutronix.de; mi...@redhat.com;
  x...@kernel.org;
g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com;
eric.au...@linaro.org; linux-ker...@vger.kernel.org;
io...@lists.linux-foundation.org; kvm@vger.kernel.org
Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
  vCPU
is blocked
   
On Fri, Mar 27, 2015 at 06:34:14AM +, Wu, Feng wrote:
   Currently, the following code is executed before 
   local_irq_disable() is
called,
   so do you mean 1)moving local_irq_disable() to the place before 
   it. 2)
  after
  interrupt
   is disabled, set KVM_REQ_EVENT in case the ON bit is set?
 
  2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
  is set.

 Here is my understanding about your comments here:
 - Disable interrupts
 - Check 'ON'
 - Set KVM_REQ_EVENT if 'ON' is set

 Then we can put the above code inside  if
(kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) 
 just like it used to be. However, I still have some questions about 
 this
comment:

 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(),
  or
other places?
   
See below:
   
 If in vcpu_enter_guest(), since currently local_irq_disable() is 
 called after
'KVM_REQ_EVENT'
 is checked, is it helpful to set KVM_REQ_EVENT after 
 local_irq_disable() is
called?
   
local_irq_disable();
   
*** add code here ***
  
   So we need add code like the following here, right?
  
 if ('ON' is set)
 kvm_make_request(KVM_REQ_EVENT, vcpu);
  
 
 Hi Marcelo,
 
 I changed the code as above, then I found that the ping latency was extremely 
 big, (70ms - 400ms).
 I digged into it and got the root cause. We cannot use checking-on as the 
 judgment, since 'ON'
 can be cleared by hypervisor software in lots of places. In this case, 
 KVM_REQ_EVENT cannot be
 set when we check 'ON' bit, hence the interrupts are not injected to the 
 guest in time.
 
 Please refer to the following code, in which 'ON' bit can be cleared:
 
 apic_find_highest_irr () -- vmx_sync_pir_to_irr () -- pi_test_and_clear_on()
 
 Searching from the code step by step, apic_find_highest_irr() can be called 
 by many other guys.
 
 Thanks,

Ok then, ignore my suggestion.

Can you resend the latest version please ?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch v2 0/3] kvmclock: allow stable sched clock

2015-05-28 Thread Marcelo Tosatti
kvmclock provides the behaviour sched_clock users expect.
Mark it as stable allowing nohz_full in guests.
See individual patches for more details.

v2: address comments from Paolo:
- set COUNTS_FROM_ZERO unconditionally (host).
- rely on KVM_FEATURE_CLOCKSOURCE_STABLE_BIT (guest).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch v2 3/3] KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR

2015-05-28 Thread Marcelo Tosatti
Initialize kvmclock base, on kvmclock system MSR write time,
so that the guest sees kvmclock counting from zero.

This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/x86/kvm/x86.c |4 
 1 file changed, 4 insertions(+)

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c 2015-05-28 19:18:12.621372286 -0300
+++ kvm/arch/x86/kvm/x86.c  2015-05-28 19:19:17.738268690 -0300
@@ -1700,6 +1700,8 @@
vcpu-pvclock_set_guest_stopped_request = false;
}
 
+   /* pvclock counts from zero */
+   pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
/* If the host uses TSC clocksource, then it is stable */
if (use_master_clock)
pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
@@ -2282,6 +2284,8 @@
vcpu-requests);
 
ka-boot_vcpu_runs_old_kvmclock = tmp;
+
+   ka-kvmclock_offset = -get_kernel_ns();
}
 
vcpu-arch.time = data;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch v2 2/3] x86: kvmclock: set scheduler clock stable

2015-05-28 Thread Marcelo Tosatti
From: Luiz Capitulino lcapitul...@redhat.com

If you try to enable NOHZ_FULL on a guest today, you'll get
the following error when the guest tries to deactivate the
scheduler tick:

 WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 
can_stop_full_tick+0xb9/0x290()
 NO_HZ FULL will not work with unstable sched clock
 CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: events flush_to_ldisc
  8162a0c7 88011f583e88 814e6ba0 0002
  88011f583ed8 88011f583ec8 8104d095 88011f583eb8
   0003 0001 0001
 Call Trace:
  IRQ  [814e6ba0] dump_stack+0x4f/0x7b
  [8104d095] warn_slowpath_common+0x85/0xc0
  [8104d146] warn_slowpath_fmt+0x46/0x50
  [810bd2a9] can_stop_full_tick+0xb9/0x290
  [810bd9ed] tick_nohz_irq_exit+0x8d/0xb0
  [810511c5] irq_exit+0xc5/0x130
  [814f180a] smp_apic_timer_interrupt+0x4a/0x60
  [814eff5e] apic_timer_interrupt+0x6e/0x80
  EOI  [814ee5d1] ? _raw_spin_unlock_irqrestore+0x31/0x60
  [8108bbc8] __wake_up+0x48/0x60
  [8134836c] n_tty_receive_buf_common+0x49c/0xba0
  [8134a6bf] ? tty_ldisc_ref+0x1f/0x70
  [81348a84] n_tty_receive_buf2+0x14/0x20
  [8134b390] flush_to_ldisc+0xe0/0x120
  [81064d05] process_one_work+0x1d5/0x540
  [81064c81] ? process_one_work+0x151/0x540
  [81065191] worker_thread+0x121/0x470
  [81065070] ? process_one_work+0x540/0x540
  [8106b4df] kthread+0xef/0x110
  [8106b3f0] ? __kthread_parkme+0xa0/0xa0
  [814ef4f2] ret_from_fork+0x42/0x70
  [8106b3f0] ? __kthread_parkme+0xa0/0xa0
 ---[ end trace 06e3507544a38866 ]---

However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Luiz Capitulino lcapitul...@redhat.com

---
 arch/x86/kernel/kvmclock.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

Index: kvm/arch/x86/kernel/kvmclock.c
===
--- kvm.orig/arch/x86/kernel/kvmclock.c 2015-05-27 18:00:53.616391551 -0300
+++ kvm/arch/x86/kernel/kvmclock.c  2015-05-28 19:19:13.878274636 -0300
@@ -24,6 +24,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/memblock.h
+#include linux/sched.h
 
 #include asm/x86_init.h
 #include asm/reboot.h
@@ -217,8 +218,10 @@
 
 void __init kvmclock_init(void)
 {
+   struct pvclock_vcpu_time_info *vcpu_time;
unsigned long mem;
-   int size;
+   int size, cpu;
+   u8 flags;
 
size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
 
@@ -264,7 +267,14 @@
pv_info.name = KVM;
 
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
-   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+   pvclock_set_flags(~0);
+
+   cpu = get_cpu();
+   vcpu_time = hv_clock[cpu].pvti;
+   flags = pvclock_read_flags(vcpu_time);
+   if (flags  PVCLOCK_COUNTS_FROM_ZERO)
+   set_sched_clock_stable();
+   put_cpu();
 }
 
 int __init kvm_setup_vsyscall_timeinfo(void)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch v2 1/3] x86: kvmclock: add flag to indicate pvclock counts from zero

2015-05-28 Thread Marcelo Tosatti
Setting sched clock stable for kvmclock causes the printk timestamps
to not start from zero, which is different from baremetal and 
can possibly break userspace. Add a flag to indicate that 
hypervisor sets clock base at zero when kvmclock is initialized.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/x86/include/asm/pvclock-abi.h |1 +
 1 file changed, 1 insertion(+)

Index: kvm/arch/x86/include/asm/pvclock-abi.h
===
--- kvm.orig/arch/x86/include/asm/pvclock-abi.h 2014-11-06 23:59:14.615913334 
-0200
+++ kvm/arch/x86/include/asm/pvclock-abi.h  2015-05-27 17:40:53.435192771 
-0300
@@ -41,5 +41,6 @@
 
 #define PVCLOCK_TSC_STABLE_BIT (1  0)
 #define PVCLOCK_GUEST_STOPPED  (1  1)
+#define PVCLOCK_COUNTS_FROM_ZERO (1  2)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/3] x86: kvmclock: set scheduler clock stable

2015-05-27 Thread Marcelo Tosatti
From: Luiz Capitulino lcapitul...@redhat.com

If you try to enable NOHZ_FULL on a guest today, you'll get
the following error when the guest tries to deactivate the
scheduler tick:

 WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 
can_stop_full_tick+0xb9/0x290()
 NO_HZ FULL will not work with unstable sched clock
 CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: events flush_to_ldisc
  8162a0c7 88011f583e88 814e6ba0 0002
  88011f583ed8 88011f583ec8 8104d095 88011f583eb8
   0003 0001 0001
 Call Trace:
  IRQ  [814e6ba0] dump_stack+0x4f/0x7b
  [8104d095] warn_slowpath_common+0x85/0xc0
  [8104d146] warn_slowpath_fmt+0x46/0x50
  [810bd2a9] can_stop_full_tick+0xb9/0x290
  [810bd9ed] tick_nohz_irq_exit+0x8d/0xb0
  [810511c5] irq_exit+0xc5/0x130
  [814f180a] smp_apic_timer_interrupt+0x4a/0x60
  [814eff5e] apic_timer_interrupt+0x6e/0x80
  EOI  [814ee5d1] ? _raw_spin_unlock_irqrestore+0x31/0x60
  [8108bbc8] __wake_up+0x48/0x60
  [8134836c] n_tty_receive_buf_common+0x49c/0xba0
  [8134a6bf] ? tty_ldisc_ref+0x1f/0x70
  [81348a84] n_tty_receive_buf2+0x14/0x20
  [8134b390] flush_to_ldisc+0xe0/0x120
  [81064d05] process_one_work+0x1d5/0x540
  [81064c81] ? process_one_work+0x151/0x540
  [81065191] worker_thread+0x121/0x470
  [81065070] ? process_one_work+0x540/0x540
  [8106b4df] kthread+0xef/0x110
  [8106b3f0] ? __kthread_parkme+0xa0/0xa0
  [814ef4f2] ret_from_fork+0x42/0x70
  [8106b3f0] ? __kthread_parkme+0xa0/0xa0
 ---[ end trace 06e3507544a38866 ]---

However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Luiz Capitulino lcapitul...@redhat.com

---
 arch/x86/kernel/kvmclock.c |   17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Index: kvm/arch/x86/kernel/kvmclock.c
===
--- kvm.orig/arch/x86/kernel/kvmclock.c 2015-05-27 18:00:53.616391551 -0300
+++ kvm/arch/x86/kernel/kvmclock.c  2015-05-27 22:43:14.474432962 -0300
@@ -24,6 +24,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/memblock.h
+#include linux/sched.h
 
 #include asm/x86_init.h
 #include asm/reboot.h
@@ -217,8 +218,10 @@
 
 void __init kvmclock_init(void)
 {
+   struct pvclock_vcpu_time_info *vcpu_time;
unsigned long mem;
-   int size;
+   int size, cpu;
+   u8 flags;
 
size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
 
@@ -263,8 +266,18 @@
clocksource_register_hz(kvm_clock, NSEC_PER_SEC);
pv_info.name = KVM;
 
+   flags = PVCLOCK_COUNTS_FROM_ZERO;
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
-   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+   flags |= PVCLOCK_TSC_STABLE_BIT;
+
+   pvclock_set_flags(flags);
+
+   cpu = get_cpu();
+   vcpu_time = hv_clock[cpu].pvti;
+   flags = pvclock_read_flags(vcpu_time);
+   if (flags  PVCLOCK_COUNTS_FROM_ZERO)
+   set_sched_clock_stable();
+   put_cpu();
 }
 
 int __init kvm_setup_vsyscall_timeinfo(void)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/3] x86: kvmclock: add flag to indicate pvclock counts from zero

2015-05-27 Thread Marcelo Tosatti
Setting sched clock stable for kvmclock causes the printk timestamps
to not start from zero, which is different from baremetal and 
can possibly break userspace. Add a flag to indicate that 
hypervisor sets clock base at zero when kvmclock is initialized.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/x86/include/asm/pvclock-abi.h |1 +
 1 file changed, 1 insertion(+)

Index: kvm/arch/x86/include/asm/pvclock-abi.h
===
--- kvm.orig/arch/x86/include/asm/pvclock-abi.h 2014-11-06 23:59:14.615913334 
-0200
+++ kvm/arch/x86/include/asm/pvclock-abi.h  2015-05-27 17:40:53.435192771 
-0300
@@ -41,5 +41,6 @@
 
 #define PVCLOCK_TSC_STABLE_BIT (1  0)
 #define PVCLOCK_GUEST_STOPPED  (1  1)
+#define PVCLOCK_COUNTS_FROM_ZERO (1  2)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 0/3] kvmclock: allow stable sched clock

2015-05-27 Thread Marcelo Tosatti
kvmclock provides the behaviour sched_clock users expect.
Mark it as stable allowing nohz_full in guests.
See individual patches for more details.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/3] KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR

2015-05-27 Thread Marcelo Tosatti
Initialize kvmclock base, on kvmclock system MSR write time,
so that the guest sees kvmclock counting from zero.

This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/x86/kvm/x86.c |5 +
 1 file changed, 5 insertions(+)

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c 2015-05-27 17:40:46.948189811 -0300
+++ kvm/arch/x86/kvm/x86.c  2015-05-27 22:43:47.340413347 -0300
@@ -1703,6 +1703,8 @@
/* If the host uses TSC clocksource, then it is stable */
if (use_master_clock)
pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
+   if (ka-kvmclk_counts_from_zero)
+   pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
 
vcpu-hv_clock.flags = pvclock_flags;
 
@@ -2282,6 +2284,9 @@
vcpu-requests);
 
ka-boot_vcpu_runs_old_kvmclock = tmp;
+
+   ka-kvmclock_offset = -get_kernel_ns();
+   ka-kvmclk_counts_from_zero = true;
}
 
vcpu-arch.time = data;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR

2015-05-23 Thread Marcelo Tosatti
Initialize kvmclock base, on kvmclock system MSR write time, 
so that the guest sees kvmclock counting from zero.

This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759..ea40d24 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2188,6 +2188,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
vcpu-requests);
 
ka-boot_vcpu_runs_old_kvmclock = tmp;
+
+   ka-kvmclock_offset = -get_kernel_ns();
}
 
vcpu-arch.time = data;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR

2015-05-21 Thread Marcelo Tosatti

Initialize kvmclock base, on kvmclock system MSR write time, 
so that the guest sees kvmclock counting from zero.

This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759..ea40d24 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2188,6 +2188,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
vcpu-requests);
 
ka-boot_vcpu_runs_old_kvmclock = tmp;
+
+   ka-kvmclock_offset = get_kernel_ns();
}
 
vcpu-arch.time = data;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm: odd time values since kvmclock: set scheduler clock stable

2015-05-21 Thread Marcelo Tosatti
On Mon, May 18, 2015 at 10:13:03PM -0400, Sasha Levin wrote:
 On 05/18/2015 10:02 PM, Sasha Levin wrote:
  On 05/18/2015 08:13 PM, Marcelo Tosatti wrote:
  GOn Mon, May 18, 2015 at 07:45:41PM -0400, Sasha Levin wrote:
  On 05/18/2015 06:39 PM, Marcelo Tosatti wrote:
  On Tue, May 12, 2015 at 07:17:24PM -0400, Sasha Levin wrote:
  Hi all,
 
  I'm seeing odd jump in time values during boot of a KVM guest:
 
  [...]
  [0.00] tsc: Detected 2260.998 MHz processor
  [3376355.247558] Calibrating delay loop (skipped) preset value..
  [...]
 
  I've bisected it to:
 
  Paolo, Sasha,
 
  Although this might seem undesirable, there is no requirement 
  for sched_clock to initialize at 0:
 
  
   *
   * There is no strict promise about the base, although it tends to 
  start
   * at 0 on boot (but people really shouldn't rely on that).
   *
  
 
  Sasha, are you seeing any problem other than the apparent time jump?
 
  Nope, but I've looked at it again and it seems that it jumps to the 
  host's
  clock (that is, in the example above the 3376355 value was the host's 
  clock
  value).
 
 
  Thanks,
  Sasha
  Sasha, thats right. Its the host monotonic clock.
  
  It's worth figuring out if (what) userspace breaks on that. I know it says 
  that
  you shouldn't rely on that, but I'd happily place a bet on at least one 
  userspace
  treating it as seconds since boot or something similar.
 
 Didn't need to go far... In the guest:
 
 # date
 Tue May 19 02:11:46 UTC 2015
 # echo hi  /dev/kmsg
 [3907533.080112] hi
 # dmesg -T
 [Fri Jul  3 07:33:41 2015] hi

Sasha,

Can you give the suggested patch (hypervisor patch...) a try please?
(with a patched guest, obviously).

KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system
MSR


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: x86: add module parameter to disable periodic kvmclock sync

2015-05-12 Thread Marcelo Tosatti

The periodic kvmclock sync can be an undesired source of latencies.

When running cyclictest on a guest, a latency spike is visible.
With kvmclock periodic sync disabled, the spike is gone.

Guests should use ntp which means the propagations of ntp corrections
from the host clock are unnecessary.

v2:
- Make parameter read-only (Radim)
- Return early on kvmclock_sync_fn (Andrew)

Reported-and-tested-by: Luiz Capitulino lcapitul...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 986b3f5..4c0c954 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -99,6 +99,9 @@ module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
 unsigned int min_timer_period_us = 500;
 module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
 
+static bool __read_mostly kvmclock_periodic_sync = true;
+module_param(kvmclock_periodic_sync, bool, S_IRUGO);
+
 bool kvm_has_tsc_control;
 EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
 u32  kvm_max_guest_tsc_khz;
@@ -1768,6 +1771,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
   kvmclock_sync_work);
struct kvm *kvm = container_of(ka, struct kvm, arch);
 
+   if (!kvmclock_periodic_sync)
+   return;
+
schedule_delayed_work(kvm-arch.kvmclock_update_work, 0);
schedule_delayed_work(kvm-arch.kvmclock_sync_work,
KVMCLOCK_SYNC_PERIOD);
@@ -7109,6 +7115,9 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
kvm_write_tsc(vcpu, msr);
vcpu_put(vcpu);
 
+   if (!kvmclock_periodic_sync)
+   return;
+
schedule_delayed_work(kvm-arch.kvmclock_sync_work,
KVMCLOCK_SYNC_PERIOD);
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvmclock: set scheduler clock stable

2015-04-28 Thread Marcelo Tosatti
On Thu, Apr 23, 2015 at 05:12:42PM -0400, Luiz Capitulino wrote:
 If you try to enable NOHZ_FULL on a guest today, you'll get
 the following error when the guest tries to deactivate the
 scheduler tick:
 
  WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 
 can_stop_full_tick+0xb9/0x290()
  NO_HZ FULL will not work with unstable sched clock
  CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: events flush_to_ldisc
   8162a0c7 88011f583e88 814e6ba0 0002
   88011f583ed8 88011f583ec8 8104d095 88011f583eb8
    0003 0001 0001
  Call Trace:
   IRQ  [814e6ba0] dump_stack+0x4f/0x7b
   [8104d095] warn_slowpath_common+0x85/0xc0
   [8104d146] warn_slowpath_fmt+0x46/0x50
   [810bd2a9] can_stop_full_tick+0xb9/0x290
   [810bd9ed] tick_nohz_irq_exit+0x8d/0xb0
   [810511c5] irq_exit+0xc5/0x130
   [814f180a] smp_apic_timer_interrupt+0x4a/0x60
   [814eff5e] apic_timer_interrupt+0x6e/0x80
   EOI  [814ee5d1] ? _raw_spin_unlock_irqrestore+0x31/0x60
   [8108bbc8] __wake_up+0x48/0x60
   [8134836c] n_tty_receive_buf_common+0x49c/0xba0
   [8134a6bf] ? tty_ldisc_ref+0x1f/0x70
   [81348a84] n_tty_receive_buf2+0x14/0x20
   [8134b390] flush_to_ldisc+0xe0/0x120
   [81064d05] process_one_work+0x1d5/0x540
   [81064c81] ? process_one_work+0x151/0x540
   [81065191] worker_thread+0x121/0x470
   [81065070] ? process_one_work+0x540/0x540
   [8106b4df] kthread+0xef/0x110
   [8106b3f0] ? __kthread_parkme+0xa0/0xa0
   [814ef4f2] ret_from_fork+0x42/0x70
   [8106b3f0] ? __kthread_parkme+0xa0/0xa0
  ---[ end trace 06e3507544a38866 ]---
 
 However, it turns out that kvmclock does provide a stable
 sched_clock callback. So, let the scheduler know this which
 in turn makes NOHZ_FULL work in the guest.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
 ---
 
 PS: Original author of this patch is Marcelo. I did most of the
 testing and backported it to an older real-time kernel tree. Works
 like a charm.
 
  arch/x86/kernel/kvmclock.c | 3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
 index 42caaef..4e03921 100644
 --- a/arch/x86/kernel/kvmclock.c
 +++ b/arch/x86/kernel/kvmclock.c
 @@ -24,6 +24,7 @@
  #include linux/percpu.h
  #include linux/hardirq.h
  #include linux/memblock.h
 +#include linux/sched.h
  
  #include asm/x86_init.h
  #include asm/reboot.h
 @@ -265,6 +266,8 @@ void __init kvmclock_init(void)
  
   if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 +
 + set_sched_clock_stable();
  }
  
  int __init kvm_setup_vsyscall_timeinfo(void)
 -- 
 1.9.3
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ping?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvmclock: drop rdtsc_barrier()

2015-04-28 Thread Marcelo Tosatti
On Fri, Apr 24, 2015 at 10:36:14PM -0300, Marcelo Tosatti wrote:
 
 Drop unnecessary rdtsc_barrier(), as has been determined empirically,
 see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details.
 
 Noticed by Andy Lutomirski.
 
 Improves clock_gettime() by approximately 15% on 
 Intel i7-3520M @ 2.90GHz.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
 index 25b1cc0..a5beb23 100644
 --- a/arch/x86/include/asm/pvclock.h
 +++ b/arch/x86/include/asm/pvclock.h
 @@ -86,7 +86,6 @@ unsigned __pvclock_read_cycles(const struct 
 pvclock_vcpu_time_info *src,
   offset = pvclock_get_nsec_offset(src);
   ret = src-system_time + offset;
   ret_flags = src-flags;
 - rdtsc_barrier();
  
   *cycles = ret;
   *flags = ret_flags;
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ping?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: kvmclock: drop rdtsc_barrier()

2015-04-24 Thread Marcelo Tosatti

Drop unnecessary rdtsc_barrier(), as has been determined empirically,
see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details.

Noticed by Andy Lutomirski.

Improves clock_gettime() by approximately 15% on 
Intel i7-3520M @ 2.90GHz.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 25b1cc0..a5beb23 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -86,7 +86,6 @@ unsigned __pvclock_read_cycles(const struct 
pvclock_vcpu_time_info *src,
offset = pvclock_get_nsec_offset(src);
ret = src-system_time + offset;
ret_flags = src-flags;
-   rdtsc_barrier();
 
*cycles = ret;
*flags = ret_flags;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: fix kvmclock update protocol

2015-04-24 Thread Marcelo Tosatti
On Thu, Apr 23, 2015 at 01:46:55PM +0200, Paolo Bonzini wrote:
 From: Radim Krčmář rkrc...@redhat.com
 
 The kvmclock spec says that the host will increment a version field to
 an odd number, then update stuff, then increment it to an even number.
 The host is buggy and doesn't do this, and the result is observable
 when one vcpu reads another vcpu's kvmclock data.
 
 There's no good way for a guest kernel to keep its vdso from reading
 a different vcpu's kvmclock data, but we don't need to care about
 changing VCPUs as long as we read a consistent data from kvmclock.
 (VCPU can change outside of this loop too, so it doesn't matter if we
 return a value not fit for this VCPU.)
 
 Based on a patch by Radim Krčmář.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/x86.c | 33 -
  1 file changed, 28 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index ed31c31b2485..c73efcd03e29 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1669,12 +1669,28 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
   guest_hv_clock, sizeof(guest_hv_clock
   return 0;
  
 - /*
 -  * The interface expects us to write an even number signaling that the
 -  * update is finished. Since the guest won't see the intermediate
 -  * state, we just increase by 2 at the end.
 + /* This VCPU is paused, but it's legal for a guest to read another
 +  * VCPU's kvmclock, so we really have to follow the specification where
 +  * it says that version is odd if data is being modified, and even after
 +  * it is consistent.
 +  *
 +  * Version field updates must be kept separate.  This is because
 +  * kvm_write_guest_cached might use a rep movs instruction, and
 +  * writes within a string instruction are weakly ordered.  So there
 +  * are three writes overall.
 +  *
 +  * As a small optimization, only write the version field in the first
 +  * and third write.  The vcpu-pv_time cache is still valid, because the
 +  * version field is the first in the struct.
*/
 - vcpu-hv_clock.version = guest_hv_clock.version + 2;
 + BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
 +
 + vcpu-hv_clock.version = guest_hv_clock.version + 1;
 + kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 + vcpu-hv_clock,
 + sizeof(vcpu-hv_clock.version));
 +
 + smp_wmb();
  
   /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
   pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
 @@ -1695,6 +1711,13 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
   vcpu-hv_clock,
   sizeof(vcpu-hv_clock));
 +
 + smp_wmb();
 +
 + vcpu-hv_clock.version++;
 + kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 + vcpu-hv_clock,
 + sizeof(vcpu-hv_clock.version));
   return 0;
  }
  
 -- 
 1.8.3.1

Acked-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-23 Thread Marcelo Tosatti
On Thu, Apr 23, 2015 at 02:02:29PM +0200, Paolo Bonzini wrote:
 
 
 On 23/04/2015 13:51, Marcelo Tosatti wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1174664
   
   That was the missing volatile in an asm.  Older compilers didn't catch
   it. :(
  How do you know that? It looks like memory corruption (look at the
  pattern at the end).
 
 I suspect some kind of operator error there, it makes no sense.


if (unlikely(s-flags  SLAB_POISON))
memset(start, POISON_INUSE, PAGE_SIZE  order);

 *  Padding is done using 0x5a (POISON_INUSE)

 On the other hand, bug 1178975 is much clearer and the symptoms are the
 same.  In that bug, you can see that the same kernel source works on f20
 (package version 3.17.7-200.fc20.x86_64) and fails on f21 (package
 version 3.17.7-300.fc21.x86_64).  Of course the compiler is different.
 The newer one hoists the lsl out of the loop; if you get a CPU migration
 at the wrong time, the cpu != cpu1 condition will always be true the
 loop will never exit.
 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-23 Thread Marcelo Tosatti
On Thu, Apr 23, 2015 at 11:13:23AM +0200, Paolo Bonzini wrote:
 
 
 On 22/04/2015 23:21, Marcelo Tosatti wrote:
  On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote:
  On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
 
  On 17/04/2015 22:18, Marcelo Tosatti wrote:
  The bug which this is fixing is very rare, have no memory of a report.
 
  In fact, its even difficult to create a synthetic reproducer.
 
  But then why was the task migration notifier even in Jeremy's original
  code for Xen?  Was it supposed to work even on non-synchronized TSC?
 
  If that's the case, then it could be reverted indeed; but then why did
  you commit this patch to 4.1?  Did you think of something that would
  cause the seqcount-like protocol to fail, and that turned out not to be
  the case later?  I was only following the mailing list sparsely in March.
 
  I don't think anyone ever tried that hard to test this stuff.  There
  was an infinte loop that Firefox was triggering as a KVM guest
  somewhat reliably until a couple months ago in the same vdso code.  :(
  
  https://bugzilla.redhat.com/show_bug.cgi?id=1174664
 
 That was the missing volatile in an asm.  Older compilers didn't catch
 it. :(

How do you know that? It looks like memory corruption (look at the
pattern at the end).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-22 Thread Marcelo Tosatti
On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote:
 On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
 
  On 17/04/2015 22:18, Marcelo Tosatti wrote:
  The bug which this is fixing is very rare, have no memory of a report.
 
  In fact, its even difficult to create a synthetic reproducer.
 
  But then why was the task migration notifier even in Jeremy's original
  code for Xen?  Was it supposed to work even on non-synchronized TSC?
 
  If that's the case, then it could be reverted indeed; but then why did
  you commit this patch to 4.1?  Did you think of something that would
  cause the seqcount-like protocol to fail, and that turned out not to be
  the case later?  I was only following the mailing list sparsely in March.
 
 I don't think anyone ever tried that hard to test this stuff.  There
 was an infinte loop that Firefox was triggering as a KVM guest
 somewhat reliably until a couple months ago in the same vdso code.  :(

https://bugzilla.redhat.com/show_bug.cgi?id=1174664

--- Comment #5 from Juan Quintela quint...@redhat.com ---

Another round

# dmesg | grep msr
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[0.00] kvm-clock: cpu 0, msr 1:1ffd8001, primary cpu clock
[0.00] kvm-stealtime: cpu 0, msr 11fc0d100
[0.041174] kvm-clock: cpu 1, msr 1:1ffd8041, secondary cpu clock
[0.053011] kvm-stealtime: cpu 1, msr 11fc8d100


After start:

[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp
/8x
0x1ffd8000'
1ffd8000: 0x3b401060 0xfffc7f4b 0x3b42d040 0xfffc7f4b
1ffd8010: 0x3b42d460 0xfffc7f4b 0x3b42d4c0 0xfffc7f4b


[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp /8x 
0x1ffd8040'
1ffd8040: 0x3b42d700 0xfffc7f4b 0x3b42d760 0xfffc7f4b
1ffd8050: 0x3b42d7c0 0xfffc7f4b 0x3b42d820 0xfffc7f4b

When firefox hangs

[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp
/8x
0x1ffd8000'
1ffd8000: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a
1ffd8010: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a


[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp
/8x
0x1ffd8040'
1ffd8040: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a
1ffd8050: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-22 Thread Marcelo Tosatti
On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote:
 
 
 On 22/04/2015 22:56, Marcelo Tosatti wrote:
   But then why was the task migration notifier even in Jeremy's original
   code for Xen? 
  To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe.
 
 Ok, to cover it for non-synchronized TSC.  While KVM requires
 synchronized TSC.
 
   If that's the case, then it could be reverted indeed; but then why did
   you commit this patch to 4.1? 
  
  Because it fixes the problem Andy reported (see Subject: KVM: x86: fix
  kvmclock write race (v2) on kvm@). As long as you have Radim's
  fix on top.
 
 But if it's so rare, and it was known that fixing the host protocol was
 just as good a solution, why was the guest fix committed?

I don't know. Should have fixed the host protocol.

 I'm just trying to understand.  I am worried that this patch was rushed
 in; so far I had assumed it wasn't (a revert of a revert is rare enough
 that you don't do it lightly...) but maybe I was wrong.

Yes it was rushed in.

 Right now I cannot even decide whether to revert it (and please Peter in
 the process :)) or submit the Kconfig symbol patch officially.
 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-22 Thread Marcelo Tosatti
On Mon, Apr 20, 2015 at 06:59:04PM +0200, Paolo Bonzini wrote:
 
 
 On 17/04/2015 22:18, Marcelo Tosatti wrote:
  The bug which this is fixing is very rare, have no memory of a report.
  
  In fact, its even difficult to create a synthetic reproducer.
 
 But then why was the task migration notifier even in Jeremy's original
 code for Xen? 

To cover for the vcpu1 - vcpu2 - vcpu1 case, i believe.

 Was it supposed to work even on non-synchronized TSC?

Yes it is supposed to work on non-synchronized TSC.

 If that's the case, then it could be reverted indeed; but then why did
 you commit this patch to 4.1? 

Because it fixes the problem Andy reported (see Subject: KVM: x86: fix
kvmclock write race (v2) on kvm@). As long as you have Radim's
fix on top.

 Did you think of something that would
 cause the seqcount-like protocol to fail, and that turned out not to be
 the case later?  I was only following the mailing list sparsely in March.

No.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Marcelo Tosatti
On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote:
 
 
  From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
  From: Paolo Bonzini pbonz...@redhat.com
  Date: Fri, 17 Apr 2015 14:57:34 +0200
  Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
 
  The task migration notifier is only used in x86 paravirt.  Make it
  possible to compile it out.
 
  While at it, move some code around to ensure tmn is filled from CPU
  registers.
 
  Signed-off-by: Paolo Bonzini pbonz...@redhat.com
  ---
   arch/x86/Kconfig| 1 +
   init/Kconfig| 3 +++
   kernel/sched/core.c | 9 -
   3 files changed, 12 insertions(+), 1 deletion(-)
 
  diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
  index d43e7e1c784b..9af252c8698d 100644
  --- a/arch/x86/Kconfig
  +++ b/arch/x86/Kconfig
  @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
   
   config PARAVIRT
 bool Enable paravirtualization code
  +  select TASK_MIGRATION_NOTIFIER
 ---help---
   This changes the kernel so it can modify itself when it is run
   under a hypervisor, potentially improving performance significantly
  diff --git a/init/Kconfig b/init/Kconfig
  index 3b9df1aa35db..891917123338 100644
  --- a/init/Kconfig
  +++ b/init/Kconfig
  @@ -2016,6 +2016,9 @@ source block/Kconfig
   config PREEMPT_NOTIFIERS
 bool
   
  +config TASK_MIGRATION_NOTIFIER
  +  bool
  +
   config PADATA
 depends on SMP
 bool
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index f9123a82cbb6..c07a53aa543c 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct 
  task_struct *p, int flags)
 rq_clock_skip_update(rq, true);
   }
   
  +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
   static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
   
   void register_task_migration_notifier(struct notifier_block *n)
   {
 atomic_notifier_chain_register(task_migration_notifier, n);
   }
  +#endif
   
   #ifdef CONFIG_SMP
   void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
  @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned 
  int new_cpu)
 trace_sched_migrate_task(p, new_cpu);
   
 if (task_cpu(p) != new_cpu) {
  +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
 struct task_migration_notifier tmn;
  +  int from_cpu = task_cpu(p);
  +#endif
   
 if (p-sched_class-migrate_task_rq)
 p-sched_class-migrate_task_rq(p, new_cpu);
 p-se.nr_migrations++;
 perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
   
  +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
 tmn.task = p;
  -  tmn.from_cpu = task_cpu(p);
  +  tmn.from_cpu = from_cpu;
 tmn.to_cpu = new_cpu;
   
 atomic_notifier_call_chain(task_migration_notifier, 0, tmn);
  +#endif
 }
   
 __set_task_cpu(p, new_cpu);
  -- 
  2.3.5
  
  Paolo, 
  
  Please revert the patch -- can fix properly in the host
  which also conforms the KVM guest/host documented protocol.
  
  Radim submitted a patch to kvm@ to split 
  the kvm_write_guest in two with a barrier in between, i think.
  
  I'll review that patch.
 
 You're thinking of
 http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
 Andy's reply:
 
  
  I think there are at least two ways that would work:
  
  a) If KVM incremented version as advertised:
  
  cpu = getcpu();
  pvti = pvti for cpu;
  
  ver1 = pvti-version;
  check stable bit;
  rdtsc_barrier, rdtsc, read scale, shift, etc.
  if (getcpu() != cpu) retry;
  if (pvti-version != ver1) retry;
  
  I think this is safe because, we're guaranteed that there was an
  interval (between the two version reads) in which the vcpu we think
  we're on was running and the kvmclock data was valid and marked
  stable, and we know that the tsc we read came from that interval.
  
  Note: rdtscp isn't needed. If we're stable, is makes no difference
  which cpu's tsc we actually read.
  
  b) If version remains buggy but we use this migrations_from hack:
  
  cpu = getcpu();
  pvti = pvti for cpu;
  m1 = pvti-migrations_from;
  barrier();
  
  ver1 = pvti-version;
  check stable bit;
  rdtsc_barrier, rdtsc, read scale, shift, etc.
  if (getcpu() != cpu) retry;
  if (pvti-version != ver1) retry;  /* probably not really needed */
  
  barrier();
  if (pvti-migrations_from != m1) retry;
  
  This is just like (a), except that we're using a guest kernel hack to
  ensure that no one migrated off the vcpu during the version-protected
  critical section and that we were, in fact, on that vcpu at some point
  during that critical section.  Once we've ensured that we were on
  pvti's associated vcpu for the entire time we were reading it, then we
  are protected by the existing versioning in the host.
 
 (a) is not going to happen until 4.2, and there are too many buggy hosts
 around so we'd have to define new ABI that lets the guest 

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Marcelo Tosatti
On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote:
 On 17/04/2015 15:10, Peter Zijlstra wrote:
  On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote:
  On 17/04/2015 12:55, Peter Zijlstra wrote:
  Also, it looks like you already do exactly this for other things, look
  at:
 
kvm_sched_in()
  kvm_arch_vcpu_load()
if (unlikely(vcpu-cpu != cpu) ... )
 
  So no, I don't believe for one second you need this.
  
  This [...] brings us back to where we were last
  time. There is _0_ justification for this in the patches, that alone is
  grounds enough to reject it.
 
 Oh, we totally agree on that.  I didn't commit that patch, but I already
 said the commit message was insufficient.
 
  Why should the guest task care about the physical cpu of the vcpu;
  that's a layering fail if ever there was one.
 
 It's totally within your right to not read the code, but then please
 don't try commenting at it.
 
 This code:
 
   kvm_sched_in()
 kvm_arch_vcpu_load()
   if (unlikely(vcpu-cpu != cpu) ... )
 
 runs in the host.  The hypervisor obviously cares if the physical CPU of
 the VCPU changes.  It has to tell the source processor (vcpu-cpu) to
 release the VCPU's data structure and only then it can use it in the
 target processor (cpu).  No layering violation here.
 
 The task migration notifier runs in the guest, whenever the VCPU of
 a task changes.
 
  Furthermore, the only thing that migration handler seems to do is
  increment a variable that is not actually used in that file.
 
 It's used in the vDSO, so you cannot increment it in the file that uses it.
 
  And frankly, I think the static key is snake oil.  The cost of task 
  migration in terms of cache misses and TLB misses is in no way 
  comparable to the cost of filling in a structure on the stack, 
  dereferencing the head of the notifiers list and seeing that it's NULL.
  
  The path this notifier is called from has nothing to do with those
  costs.
 
 How not?  The task is going to incur those costs, it's not like half
 a dozen extra instruction make any difference.  But anyway...
 
  And the fact you're inflicting these costs on _everyone_ for a
  single x86_64-paravirt case is insane.
 
 ... that's a valid objection.  Please look at the patch below.
 
  I've had enough of this, the below goes into sched/urgent and you can
  come back with sane patches if and when you're ready.
 
 Oh, please, cut the alpha male crap.
 
 Paolo
 
 --- 8 
 From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
 From: Paolo Bonzini pbonz...@redhat.com
 Date: Fri, 17 Apr 2015 14:57:34 +0200
 Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
 
 The task migration notifier is only used in x86 paravirt.  Make it
 possible to compile it out.
 
 While at it, move some code around to ensure tmn is filled from CPU
 registers.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/Kconfig| 1 +
  init/Kconfig| 3 +++
  kernel/sched/core.c | 9 -
  3 files changed, 12 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index d43e7e1c784b..9af252c8698d 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
  
  config PARAVIRT
   bool Enable paravirtualization code
 + select TASK_MIGRATION_NOTIFIER
   ---help---
 This changes the kernel so it can modify itself when it is run
 under a hypervisor, potentially improving performance significantly
 diff --git a/init/Kconfig b/init/Kconfig
 index 3b9df1aa35db..891917123338 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -2016,6 +2016,9 @@ source block/Kconfig
  config PREEMPT_NOTIFIERS
   bool
  
 +config TASK_MIGRATION_NOTIFIER
 + bool
 +
  config PADATA
   depends on SMP
   bool
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index f9123a82cbb6..c07a53aa543c 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct 
 task_struct *p, int flags)
   rq_clock_skip_update(rq, true);
  }
  
 +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
  
  void register_task_migration_notifier(struct notifier_block *n)
  {
   atomic_notifier_chain_register(task_migration_notifier, n);
  }
 +#endif
  
  #ifdef CONFIG_SMP
  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int 
 new_cpu)
   trace_sched_migrate_task(p, new_cpu);
  
   if (task_cpu(p) != new_cpu) {
 +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
   struct task_migration_notifier tmn;
 + int from_cpu = task_cpu(p);
 +#endif
  
   if (p-sched_class-migrate_task_rq)
   p-sched_class-migrate_task_rq(p, new_cpu);
   p-se.nr_migrations++;

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Marcelo Tosatti
On Fri, Apr 17, 2015 at 03:25:28PM -0700, Andy Lutomirski wrote:
 On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski l...@amacapital.net 
  wrote:
 
  Muahaha.  The auditors have invaded your system.  (I did my little
  benchmark with a more sensible configuration -- see way below).
 
  Can you send the output of:
 
  # auditctl -s
  # auditctl -l
 
# auditctl -s
enabled 1
flag 1
pid 822
rate_limit 0
backlog_limit 320
lost 0
backlog 0
backlog_wait_time 6
loginuid_immutable 0 unlocked
# auditctl -l
No rules
 
 Yes, No rules doesn't mean don't audit.
 
 
  Are you, perchance, using Fedora?
 
  F21. Yup.
 
  I used to just disable auditing in the kernel entirely, but then I
  ended up deciding that I need to run something closer to the broken
  Fedora config (selinux in particular) in order to actually optimize
  the real-world pathname handling situation rather than the _sane_ one.
  Oh well. I think audit support got enabled at the same time in my
  kernels because I ended up using the default config and then taking
  out the truly crazy stuff without noticing AUDITSYSCALL.
 
  I lobbied rather heavily, and
  successfully, to get the default configuration to stop auditing.
  Unfortunately, the fix wasn't retroactive, so, unless you have a very
  fresh install, you might want to apply the fix yourself:
 
  Is that fix happening in Fedora going forward, though? Like F22?
 
 It's in F21, actually.  Unfortunately, the audit package is really
 weird.  It installs /etc/audit/rules.d/audit.rules, containing:
 
 # This file contains the auditctl rules that are loaded
 # whenever the audit daemon is started via the initscripts.
 # The rules are simply the parameters that would be passed
 # to auditctl.
 
 # First rule - delete all
 -D
 
 # This suppresses syscall auditing for all tasks started
 # with this rule in effect.  Remove it if you need syscall
 # auditing.
 -a task,never
 
 Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't
 exist) it copies that file to /etc/audit/audit.rules post-install.
 (No, I don't know why it works this way.)
 
 IOW, you might want to copy your /etc/audit/rules.d/audit.rules to
 /etc/audit/audit.rules.  I think you need to reboot to get the full
 effect.  You could apply the rule manually (or maybe restart the audit
 service), but it will only affect newly-started tasks.
 
 
  Amdy Lumirtowsky thinks he meant to attach a condition to his
  maintainerish activities: he will do his best to keep the audit code
  *out* of the low-level stuff, but he's going to try to avoid ever
  touching the audit code itself, because if he ever had to change it,
  he might accidentally delete the entire file.
 
  Oooh. That would be _such_ a shame.
 
  Can we please do it by mistake? Oops, my fingers slipped
 
  Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
  point?  I would love auditing to set some really loud global warning
  that you've just done a Bad Thing (tm) performance-wise by enabling
  it.
 
  Or even just a big fat warning in dmesg the first time auditing triggers.
 
  Back to timing.  With kvm-clock, I see:
 
23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
 
  Oh wow. How can that possibly be sane?
 
  Isn't the *whole* point of pvclock_clocksource_read() to be a native
  rdtsc with scaling? How does it cause that kind of insane pain?
 
 An unnecessarily complicated protocol, a buggy host implementation,
 and an unnecessarily complicated guest implementation :(

How about start by removing the unnecessary rdtsc-barrier? 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: fix kvmclock write race

2015-04-17 Thread Marcelo Tosatti
On Fri, Apr 17, 2015 at 05:04:29PM -0700, Andy Lutomirski wrote:
 On Fri, Apr 17, 2015 at 4:38 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  From: Radim Krčmář rkrc...@redhat.com
 
  As noted by Andy Lutomirski, kvm does not follow the documented version
  protocol. Fix it.
 
  Note: this bug results in a race which can occur if the following three
  conditions are met:
 
  1) There is KVM guest time update (there is one every 5 minutes).
 
  2) Which races with a thread in the guest in the following way:
  The execution of these 29 instructions has to take at _least_
  2 seconds (rebalance interval is 1 second).
 
  lsl%r9w,%esi
  mov%esi,%r8d
  and$0x3f,%esi
  and$0xfff,%r8d
  test   $0xfc0,%r8d
  jne0xa12 vread_pvclock+210
  shl$0x6,%rsi
  mov-0xa01000(%rsi),%r10d
  data32 xchg %ax,%ax
  data32 xchg %ax,%ax
  rdtsc
  shl$0x20,%rdx
  mov%eax,%eax
  movsbl -0xa00fe4(%rsi),%ecx
  or %rax,%rdx
  sub-0xa00ff8(%rsi),%rdx
  mov-0xa00fe8(%rsi),%r11d
  mov%rdx,%rax
  shl%cl,%rax
  test   %ecx,%ecx
  js 0xa08 vread_pvclock+200
  mov%r11d,%edx
  movzbl -0xa00fe3(%rsi),%ecx
  mov-0xa00ff0(%rsi),%r11
  mul%rdx
  shrd   $0x20,%rdx,%rax
  data32 xchg %ax,%ax
  data32 xchg %ax,%ax
  lsl%r9w,%edx
 
  3) Scheduler moves the task, while executing these 29 instructions, to a
  destination processor, then back to the source processor.
 
  4) Source processor, after has been moved back from destination,
  perceives data out of order as written by processor performing guest
  time update (item 1), with string mov.
 
  Given the rarity of this condition, and the fact it was never observed
  or reported, reverting pvclock vsyscall on systems whose host is
  susceptible to the race, seems unnecessary.
 
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index cc2c759f69a3..8658599e0024 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
  guest_hv_clock, sizeof(guest_hv_clock
  return 0;
 
  -   /*
  -* The interface expects us to write an even number signaling that 
  the
  -* update is finished. Since the guest won't see the intermediate
  -* state, we just increase by 2 at the end.
  +   /* A guest can read other VCPU's kvmclock; specification says that
  +* version is odd if data is being modified and even after it is
  +* consistent.
  +* We write three times to be sure.
  +*  1) update version to odd number
  +*  2) write modified data (version is still odd)
  +*  3) update version to even number
  +*
  +* TODO: optimize
  +*  - only two writes should be enough -- version is first
  +*  - the second write could update just version
 
 You're relying on lots of barely-defined behavior here, since I think
 that both copies could use fast string operations.  Those are
 explicitly unordered internally, so I think you really do need three
 writes.

Correct, 3 writes are needed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: fix kvmclock write race

2015-04-17 Thread Marcelo Tosatti

From: Radim Krčmář rkrc...@redhat.com

As noted by Andy Lutomirski, kvm does not follow the documented version
protocol. Fix it.

Note: this bug results in a race which can occur if the following three
conditions are met:

1) There is KVM guest time update (there is one every 5 minutes).

2) Which races with a thread in the guest in the following way:
The execution of these 29 instructions has to take at _least_
2 seconds (rebalance interval is 1 second).

lsl%r9w,%esi
mov%esi,%r8d
and$0x3f,%esi
and$0xfff,%r8d
test   $0xfc0,%r8d
jne0xa12 vread_pvclock+210
shl$0x6,%rsi
mov-0xa01000(%rsi),%r10d
data32 xchg %ax,%ax
data32 xchg %ax,%ax
rdtsc  
shl$0x20,%rdx
mov%eax,%eax
movsbl -0xa00fe4(%rsi),%ecx
or %rax,%rdx
sub-0xa00ff8(%rsi),%rdx
mov-0xa00fe8(%rsi),%r11d
mov%rdx,%rax
shl%cl,%rax
test   %ecx,%ecx
js 0xa08 vread_pvclock+200
mov%r11d,%edx
movzbl -0xa00fe3(%rsi),%ecx
mov-0xa00ff0(%rsi),%r11
mul%rdx
shrd   $0x20,%rdx,%rax
data32 xchg %ax,%ax
data32 xchg %ax,%ax
lsl%r9w,%edx

3) Scheduler moves the task, while executing these 29 instructions, to a
destination processor, then back to the source processor.

4) Source processor, after has been moved back from destination,
perceives data out of order as written by processor performing guest
time update (item 1), with string mov.

Given the rarity of this condition, and the fact it was never observed
or reported, reverting pvclock vsyscall on systems whose host is
susceptible to the race, seems unnecessary.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759f69a3..8658599e0024 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
guest_hv_clock, sizeof(guest_hv_clock
return 0;
 
-   /*
-* The interface expects us to write an even number signaling that the
-* update is finished. Since the guest won't see the intermediate
-* state, we just increase by 2 at the end.
+   /* A guest can read other VCPU's kvmclock; specification says that
+* version is odd if data is being modified and even after it is
+* consistent.
+* We write three times to be sure.
+*  1) update version to odd number
+*  2) write modified data (version is still odd)
+*  3) update version to even number
+*
+* TODO: optimize
+*  - only two writes should be enough -- version is first
+*  - the second write could update just version
 */
-   vcpu-hv_clock.version = guest_hv_clock.version + 2;
+   guest_hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   guest_hv_clock,
+   sizeof(guest_hv_clock));
+
+   vcpu-hv_clock.version = guest_hv_clock.version;
 
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
@@ -1684,6 +1696,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kvm_write_guest_cached(v-kvm, vcpu-pv_time,
vcpu-hv_clock,
sizeof(vcpu-hv_clock));
+
+   vcpu-hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   vcpu-hv_clock,
+   sizeof(vcpu-hv_clock));
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: fix kvmclock write race

2015-04-17 Thread Marcelo Tosatti
On Fri, Apr 17, 2015 at 05:04:29PM -0700, Andy Lutomirski wrote:
 On Fri, Apr 17, 2015 at 4:38 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  From: Radim Krčmář rkrc...@redhat.com
 
  As noted by Andy Lutomirski, kvm does not follow the documented version
  protocol. Fix it.
 
  Note: this bug results in a race which can occur if the following three
  conditions are met:
 
  1) There is KVM guest time update (there is one every 5 minutes).
 
  2) Which races with a thread in the guest in the following way:
  The execution of these 29 instructions has to take at _least_
  2 seconds (rebalance interval is 1 second).
 
  lsl%r9w,%esi
  mov%esi,%r8d
  and$0x3f,%esi
  and$0xfff,%r8d
  test   $0xfc0,%r8d
  jne0xa12 vread_pvclock+210
  shl$0x6,%rsi
  mov-0xa01000(%rsi),%r10d
  data32 xchg %ax,%ax
  data32 xchg %ax,%ax
  rdtsc
  shl$0x20,%rdx
  mov%eax,%eax
  movsbl -0xa00fe4(%rsi),%ecx
  or %rax,%rdx
  sub-0xa00ff8(%rsi),%rdx
  mov-0xa00fe8(%rsi),%r11d
  mov%rdx,%rax
  shl%cl,%rax
  test   %ecx,%ecx
  js 0xa08 vread_pvclock+200
  mov%r11d,%edx
  movzbl -0xa00fe3(%rsi),%ecx
  mov-0xa00ff0(%rsi),%r11
  mul%rdx
  shrd   $0x20,%rdx,%rax
  data32 xchg %ax,%ax
  data32 xchg %ax,%ax
  lsl%r9w,%edx
 
  3) Scheduler moves the task, while executing these 29 instructions, to a
  destination processor, then back to the source processor.
 
  4) Source processor, after has been moved back from destination,
  perceives data out of order as written by processor performing guest
  time update (item 1), with string mov.
 
  Given the rarity of this condition, and the fact it was never observed
  or reported, reverting pvclock vsyscall on systems whose host is
  susceptible to the race, seems unnecessary.
 
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index cc2c759f69a3..8658599e0024 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
  guest_hv_clock, sizeof(guest_hv_clock
  return 0;
 
  -   /*
  -* The interface expects us to write an even number signaling that 
  the
  -* update is finished. Since the guest won't see the intermediate
  -* state, we just increase by 2 at the end.
  +   /* A guest can read other VCPU's kvmclock; specification says that
  +* version is odd if data is being modified and even after it is
  +* consistent.
  +* We write three times to be sure.
  +*  1) update version to odd number
  +*  2) write modified data (version is still odd)
  +*  3) update version to even number
  +*
  +* TODO: optimize
  +*  - only two writes should be enough -- version is first
  +*  - the second write could update just version
 
 You're relying on lots of barely-defined behavior here, since I think
 that both copies could use fast string operations.  Those are
 explicitly unordered internally, so I think you really do need three
 writes.

The memory-ordering model respects the follow principles:
1. Stores within a single string operation may be executed out of order.
2. Stores from separate string operations (for example, stores from
consecutive string operations) do not execute
out of order. All the stores from an earlier string operation will
complete before any store from a later string
operation.

 Personally, if I wanted to optimize this (I'm not convinced it
 matters), 

It does not matter.

 I'd add a write-a-single-word primitive and use that for the
 version.
 
 Anyway, I think this code looks okay as is.

Looking forward to see the patchset to have all vcpus reading
from vcpu0 pvclock area.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: fix kvmclock write race (v2)

2015-04-17 Thread Marcelo Tosatti

From: Radim Krčmář rkrc...@redhat.com

As noted by Andy Lutomirski, kvm does not follow the documented version
protocol. Fix it.

Note: this bug results in a race which can occur if the following three
conditions are met:

1) There is KVM guest time update (there is one every 5 minutes).

2) Which races with a thread in the guest in the following way:
The execution of these 29 instructions has to take at _least_
2 seconds (rebalance interval is 1 second).

lsl%r9w,%esi
mov%esi,%r8d
and$0x3f,%esi
and$0xfff,%r8d
test   $0xfc0,%r8d
jne0xa12 vread_pvclock+210
shl$0x6,%rsi
mov-0xa01000(%rsi),%r10d
data32 xchg %ax,%ax
data32 xchg %ax,%ax
rdtsc  
shl$0x20,%rdx
mov%eax,%eax
movsbl -0xa00fe4(%rsi),%ecx
or %rax,%rdx
sub-0xa00ff8(%rsi),%rdx
mov-0xa00fe8(%rsi),%r11d
mov%rdx,%rax
shl%cl,%rax
test   %ecx,%ecx
js 0xa08 vread_pvclock+200
mov%r11d,%edx
movzbl -0xa00fe3(%rsi),%ecx
mov-0xa00ff0(%rsi),%r11
mul%rdx
shrd   $0x20,%rdx,%rax
data32 xchg %ax,%ax
data32 xchg %ax,%ax
lsl%r9w,%edx

3) Scheduler moves the task, while executing these 29 instructions, to a
destination processor, then back to the source processor.

4) Source processor, after has been moved back from destination,
perceives data out of order as written by processor performing guest
time update (item 1), with string mov.

Given the rarity of this condition, and the fact it was never observed
or reported, reverting pvclock vsyscall on systems whose host is
susceptible to the race, seems an excessive measure.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Cc: sta...@kernel.org

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759..e75fafd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1658,12 +1658,20 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
guest_hv_clock, sizeof(guest_hv_clock
return 0;
 
-   /*
-* The interface expects us to write an even number signaling that the
-* update is finished. Since the guest won't see the intermediate
-* state, we just increase by 2 at the end.
+   /* A guest can read other VCPU's kvmclock; specification says that
+* version is odd if data is being modified and even after it is
+* consistent.
+* We write three times to be sure.
+*  1) update version to odd number
+*  2) write modified data (version is still odd)
+*  3) update version to even number
 */
-   vcpu-hv_clock.version = guest_hv_clock.version + 2;
+   guest_hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   guest_hv_clock,
+   sizeof(guest_hv_clock));
+
+   vcpu-hv_clock.version = guest_hv_clock.version;
 
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
@@ -1684,6 +1692,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kvm_write_guest_cached(v-kvm, vcpu-pv_time,
vcpu-hv_clock,
sizeof(vcpu-hv_clock));
+
+   vcpu-hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   vcpu-hv_clock,
+   sizeof(vcpu-hv_clock));
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

2015-04-08 Thread Marcelo Tosatti
Since lapic timer handler only wakes up a simple waitqueue,
it can be executed from hardirq context.

Also handle the case where hrtimer_start_expires fails due to -ETIME,
by injecting the interrupt to the guest immediately.

Reduces average cyclictest latency by 3us.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/x86/kvm/lapic.c |   42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

Index: rt-linux/arch/x86/kvm/lapic.c
===
--- rt-linux.orig/arch/x86/kvm/lapic.c  2015-04-08 20:20:41.0 -0300
+++ rt-linux/arch/x86/kvm/lapic.c   2015-04-08 20:21:16.592739674 -0300
@@ -1034,8 +1034,38 @@
   apic-divide_count);
 }
 
+
+static enum hrtimer_restart apic_timer_fn(struct hrtimer *data);
+
+static void apic_timer_expired(struct hrtimer *data)
+{
+   int ret, i = 0;
+   enum hrtimer_restart r;
+   struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
+
+   r = apic_timer_fn(data);
+
+   if (r == HRTIMER_RESTART) {
+   do {
+   ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   hrtimer_add_expires_ns(ktimer-timer,
+   ktimer-period);
+   i++;
+   } while (ret == -ETIME  i  10);
+
+   if (ret == -ETIME) {
+   printk_once(KERN_ERR %s: failed to reprogram timer\n,
+  __func__);
+   WARN_ON_ONCE(1);
+   }
+   }
+}
+
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
+   int ret;
ktime_t now;
atomic_set(apic-lapic_timer.pending, 0);
 
@@ -1065,9 +1095,11 @@
}
}
 
-   hrtimer_start(apic-lapic_timer.timer,
+   ret = hrtimer_start(apic-lapic_timer.timer,
  ktime_add_ns(now, apic-lapic_timer.period),
  HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   apic_timer_expired(apic-lapic_timer.timer);
 
apic_debug(%s: bus cycle is % PRId64 ns, now 0x%016
   PRIx64 , 
@@ -1097,8 +1129,10 @@
ns = (tscdeadline - guest_tsc) * 100ULL;
do_div(ns, this_tsc_khz);
}
-   hrtimer_start(apic-lapic_timer.timer,
+   ret = hrtimer_start(apic-lapic_timer.timer,
ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   apic_timer_expired(apic-lapic_timer.timer);
 
local_irq_restore(flags);
}
@@ -1587,6 +1621,7 @@
hrtimer_init(apic-lapic_timer.timer, CLOCK_MONOTONIC,
 HRTIMER_MODE_ABS);
apic-lapic_timer.timer.function = apic_timer_fn;
+   apic-lapic_timer.timer.irqsafe = 1;
 
/*
 * APIC is created enabled. This will prevent kvm_lapic_set_base from
@@ -1707,7 +1742,8 @@
 
timer = vcpu-arch.apic-lapic_timer.timer;
if (hrtimer_cancel(timer))
-   hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
+   if (hrtimer_start_expires(timer, HRTIMER_MODE_ABS) == -ETIME)
+   apic_timer_expired(timer);
 }
 
 /*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-04-08 Thread Marcelo Tosatti
The problem:

On -RT, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the 
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT 
are sleepable locks. Therefore, waking up a waitqueue 
waiter involves locking a sleeping lock, which 
is not allowed from hard interrupt context.

cyclictest command line:
# cyclictest -m -n -q -p99 -l 100 -h60  -D 1m

This patch reduces the average latency in my tests from 14us to 11us.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

---
 arch/arm/kvm/arm.c  |4 ++--
 arch/arm/kvm/psci.c |4 ++--
 arch/powerpc/include/asm/kvm_host.h |4 ++--
 arch/powerpc/kvm/book3s_hv.c|   20 ++--
 arch/s390/include/asm/kvm_host.h|2 +-
 arch/s390/kvm/interrupt.c   |8 
 arch/x86/kvm/lapic.c|6 +++---
 include/linux/kvm_host.h|4 ++--
 virt/kvm/async_pf.c |4 ++--
 virt/kvm/kvm_main.c |   16 
 10 files changed, 36 insertions(+), 36 deletions(-)

Index: rt-linux/arch/arm/kvm/arm.c
===
--- rt-linux.orig/arch/arm/kvm/arm.c2015-04-08 20:20:39.962649422 -0300
+++ rt-linux/arch/arm/kvm/arm.c 2015-04-08 20:20:41.966654408 -0300
@@ -441,9 +441,9 @@
 
 static void vcpu_pause(struct kvm_vcpu *vcpu)
 {
-   wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu);
+   struct swait_head *wq = kvm_arch_vcpu_wq(vcpu);
 
-   wait_event_interruptible(*wq, !vcpu-arch.pause);
+   swait_event_interruptible(*wq, !vcpu-arch.pause);
 }
 
 static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
Index: rt-linux/arch/arm/kvm/psci.c
===
--- rt-linux.orig/arch/arm/kvm/psci.c   2015-04-08 20:20:39.962649422 -0300
+++ rt-linux/arch/arm/kvm/psci.c2015-04-08 20:20:41.966654408 -0300
@@ -66,7 +66,7 @@
 {
struct kvm *kvm = source_vcpu-kvm;
struct kvm_vcpu *vcpu = NULL, *tmp;
-   wait_queue_head_t *wq;
+   struct swait_head *wq;
unsigned long cpu_id;
unsigned long context_id;
unsigned long mpidr;
@@ -123,7 +123,7 @@
smp_mb();   /* Make sure the above is visible */
 
wq = kvm_arch_vcpu_wq(vcpu);
-   wake_up_interruptible(wq);
+   swait_wake_interruptible(wq);
 
return PSCI_RET_SUCCESS;
 }
Index: rt-linux/arch/powerpc/include/asm/kvm_host.h
===
--- rt-linux.orig/arch/powerpc/include/asm/kvm_host.h   2015-04-08 
20:20:39.963649425 -0300
+++ rt-linux/arch/powerpc/include/asm/kvm_host.h2015-04-08 
20:20:41.966654408 -0300
@@ -296,7 +296,7 @@
u8 in_guest;
struct list_head runnable_threads;
spinlock_t lock;
-   wait_queue_head_t wq;
+   struct swait_head wq;
u64 stolen_tb;
u64 preempt_tb;
struct kvm_vcpu *runner;
@@ -618,7 +618,7 @@
u8 prodded;
u32 last_inst;
 
-   wait_queue_head_t *wqp;
+   struct swait_head *wqp;
struct kvmppc_vcore *vcore;
int ret;
int trap;
Index: rt-linux/arch/powerpc/kvm/book3s_hv.c
===
--- rt-linux.orig/arch/powerpc/kvm/book3s_hv.c  2015-04-08 20:20:39.964649427 
-0300
+++ rt-linux/arch/powerpc/kvm/book3s_hv.c   2015-04-08 20:20:41.966654408 
-0300
@@ -84,11 +84,11 @@
 {
int me;
int cpu = vcpu-cpu;
-   wait_queue_head_t *wqp;
+   struct swait_head *wqp;
 
wqp = kvm_arch_vcpu_wq(vcpu);
-   if (waitqueue_active(wqp)) {
-   wake_up_interruptible(wqp);
+   if (swaitqueue_active(wqp)) {
+   swait_wake_interruptible(wqp);
++vcpu-stat.halt_wakeup;
}
 
@@ -639,8 +639,8 @@
tvcpu-arch.prodded = 1;
smp_mb();
if (vcpu-arch.ceded) {
-   if (waitqueue_active(vcpu-wq)) {
-   wake_up_interruptible(vcpu-wq);
+   if (swaitqueue_active(vcpu-wq)) {
+   swait_wake_interruptible(vcpu-wq);
vcpu-stat.halt_wakeup++;
}
}
@@ -1357,7 +1357,7 @@
 
INIT_LIST_HEAD(vcore-runnable_threads);
spin_lock_init(vcore-lock);
-   init_waitqueue_head(vcore-wq);
+   init_swait_head(vcore-wq);
vcore-preempt_tb = TB_NIL;
vcore-lpcr = kvm-arch.lpcr;
vcore-first_vcpuid = core * threads_per_subcore;
@@ -1826,13 +1826,13 @@
  */
 static

[patch -rt 0/2] use simple waitqueue for kvm vcpu waitqueue (v5)

2015-04-08 Thread Marcelo Tosatti
Sebastian,

rebased against v3.18.7-rt2 as requested.

The problem:

On -RT, an emulated LAPIC timer instance has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.

cyclictest command line:
# cyclictest -m -n -q -p99 -l 100 -h60  -D 1m

This patch reduces the average latency in my tests from 14us to 11us.

v2: improve changelog (Rik van Riel)
v3: limit (once) guest triggered printk and WARN_ON (Paolo Bonzini)
v4: fix typo (Steven Rostedt)
v5: rebase against v3.18.7-rt2.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: vdso: fix pvclock races with task migration

2015-04-06 Thread Marcelo Tosatti
On Thu, Apr 02, 2015 at 08:44:23PM +0200, Radim Krčmář wrote:
 If we were migrated right after __getcpu, but before reading the
 migration_count, we wouldn't notice that we read TSC of a different
 VCPU, nor that KVM's bug made pvti invalid, as only migration_count
 on source VCPU is increased.
 
 Change vdso instead of updating migration_count on destination.
 
 Fixes: 0a4e6be9ca17 (x86: kvm: Revert remove sched notifier for cross-cpu 
 migrations)
 Cc: sta...@vger.kernel.org
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
  Because it we'll get a complete rewrite, this series does not
  - remove the outdated 'TODO: We can put [...]' comment
  - use a proper encapsulation for the inner do-while loop
  - optimize the outer do-while loop
(no need to re-read cpu id on version mismatch)
 
  arch/x86/vdso/vclock_gettime.c | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)
 
 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 index 30933760ee5f..40d2473836c9 100644
 --- a/arch/x86/vdso/vclock_gettime.c
 +++ b/arch/x86/vdso/vclock_gettime.c
 @@ -99,21 +99,25 @@ static notrace cycle_t vread_pvclock(int *mode)
* __getcpu() calls (Gleb).
*/
  
 - pvti = get_pvti(cpu);
 + /* Make sure migrate_count will change if we leave the VCPU. */
 + do {
 + pvti = get_pvti(cpu);
 + migrate_count = pvti-migrate_count;
  
 - migrate_count = pvti-migrate_count;
 + cpu1 = cpu;
 + cpu = __getcpu()  VGETCPU_CPU_MASK;
 + } while (unlikely(cpu != cpu1));
  
   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
  
   /*
* Test we're still on the cpu as well as the version.
 -  * We could have been migrated just after the first
 -  * vgetcpu but before fetching the version, so we
 -  * wouldn't notice a version change.
 +  * - We must read TSC of pvti's VCPU.
 +  * - KVM doesn't follow the versioning protocol, so data could
 +  *   change before version if we left the VCPU.
*/
 - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 - } while (unlikely(cpu != cpu1 ||
 -   (pvti-pvti.version  1) ||
 + smp_rmb();
 + } while (unlikely((pvti-pvti.version  1) ||
 pvti-pvti.version != version ||
 pvti-migrate_count != migrate_count));
  
 -- 
 2.3.4
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-03-30 Thread Marcelo Tosatti
On Mon, Mar 30, 2015 at 04:46:55AM +, Wu, Feng wrote:
 
 
  -Original Message-
  From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
  Sent: Saturday, March 28, 2015 3:30 AM
  To: Wu, Feng
  Cc: h...@zytor.com; t...@linutronix.de; mi...@redhat.com; x...@kernel.org;
  g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
  j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com;
  eric.au...@linaro.org; linux-ker...@vger.kernel.org;
  io...@lists.linux-foundation.org; kvm@vger.kernel.org
  Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
  is blocked
  
  On Fri, Mar 27, 2015 at 06:34:14AM +, Wu, Feng wrote:
 Currently, the following code is executed before local_irq_disable() 
 is
  called,
 so do you mean 1)moving local_irq_disable() to the place before it. 
 2) after
interrupt
 is disabled, set KVM_REQ_EVENT in case the ON bit is set?
   
2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
is set.
  
   Here is my understanding about your comments here:
   - Disable interrupts
   - Check 'ON'
   - Set KVM_REQ_EVENT if 'ON' is set
  
   Then we can put the above code inside  if
  (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) 
   just like it used to be. However, I still have some questions about this
  comment:
  
   1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or
  other places?
  
  See below:
  
   If in vcpu_enter_guest(), since currently local_irq_disable() is called 
   after
  'KVM_REQ_EVENT'
   is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() 
   is
  called?
  
  local_irq_disable();
  
  *** add code here ***
 
 So we need add code like the following here, right?
 
   if ('ON' is set)
   kvm_make_request(KVM_REQ_EVENT, vcpu);

Yes.

  if (vcpu-mode == EXITING_GUEST_MODE || vcpu-requests
  ^^

Point *1.

  || need_resched() || signal_pending(current)) {
  vcpu-mode = OUTSIDE_GUEST_MODE;
  smp_wmb();
  local_irq_enable();
  preempt_enable();
  vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
  r = 1;
  goto cancel_injection;
  }
  
   2. 'ON' is set by VT-d hardware, it can be set even when interrupt is 
   disabled
  (the related bit in PIR is also set).
  
  Yes, we are checking if the HW has set an interrupt in PIR while
  outside VM (which requires PIR-VIRR transfer by software).
  
  If the interrupt it set by hardware after local_irq_disable(),
  VMX-entry will handle the interrupt and perform the PIR-VIRR
  transfer and reevaluate interrupts, injecting to guest
  if necessary, is that correct ?
  
   So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly
  after interrupt is disabled?
  
  To replace the costly
  
  +*/
  +   if (kvm_x86_ops-hwapic_irr_update)
  +   kvm_x86_ops-hwapic_irr_update(vcpu,
  +   kvm_lapic_find_highest_irr(vcpu));
  
  Yes, i think so.
 
 After adding the checking ON and setting KVM_REQ_EVENT operations listed in 
 my
 comments above, do you mean we still need to keep the costly code above
 inside if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {} in 
 function
 vcpu_enter_guest() as it used to be? If yes, my question is what is the exact 
 purpose
 of checking ON and setting KVM_REQ_EVENT operations? Here is the code flow 
 in
 vcpu_enter_guest():
 
 1. Check KVM_REQ_EVENT, if it is set, sync pir-virr
 2. Disable interrupts
 3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but it is
 checked in the step 1, which means, we cannot get any benefits even we set it 
 here,
 since the pir-virr sync operation was done in step 1, between step 3 and 
 VM-Entry,
 we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here, the 
 interrupts
 remaining in PIR cannot be delivered to guest during this VM-Entry, right?

Please check point *1 above. The code will go back to  

if (kvm_check_request(KVM_REQ_EVENT, vcpu)

And perform the pir-virr sync.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-03-27 Thread Marcelo Tosatti
On Fri, Mar 27, 2015 at 06:34:14AM +, Wu, Feng wrote:
   Currently, the following code is executed before local_irq_disable() is 
   called,
   so do you mean 1)moving local_irq_disable() to the place before it. 2) 
   after
  interrupt
   is disabled, set KVM_REQ_EVENT in case the ON bit is set?
  
  2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
  is set.
 
 Here is my understanding about your comments here:
 - Disable interrupts
 - Check 'ON'
 - Set KVM_REQ_EVENT if 'ON' is set
 
 Then we can put the above code inside  if (kvm_check_request(KVM_REQ_EVENT, 
 vcpu) || req_int_win) 
 just like it used to be. However, I still have some questions about this 
 comment:
 
 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or other 
 places?

See below:

 If in vcpu_enter_guest(), since currently local_irq_disable() is called after 
 'KVM_REQ_EVENT'
 is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is 
 called?

local_irq_disable();

*** add code here ***

if (vcpu-mode == EXITING_GUEST_MODE || vcpu-requests
^^
|| need_resched() || signal_pending(current)) {
vcpu-mode = OUTSIDE_GUEST_MODE;
smp_wmb();
local_irq_enable();
preempt_enable();
vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
r = 1;
goto cancel_injection;
}

 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is 
 disabled (the related bit in PIR is also set).

Yes, we are checking if the HW has set an interrupt in PIR while
outside VM (which requires PIR-VIRR transfer by software).

If the interrupt it set by hardware after local_irq_disable(), 
VMX-entry will handle the interrupt and perform the PIR-VIRR
transfer and reevaluate interrupts, injecting to guest 
if necessary, is that correct ?

 So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly after 
 interrupt is disabled?

To replace the costly 

+*/
+   if (kvm_x86_ops-hwapic_irr_update)
+   kvm_x86_ops-hwapic_irr_update(vcpu,
+   kvm_lapic_find_highest_irr(vcpu));

Yes, i think so.

 I might miss something in your comments, if so please point out. Thanks a lot!
 
 Thanks,
 Feng
 
  
  
   if (kvm_x86_ops-hwapic_irr_update)
 kvm_x86_ops-hwapic_irr_update(vcpu,
 kvm_lapic_find_highest_irr(vcpu));
  
kvm_lapic_find_highest_irr(vcpu) eats some cache
(4 cachelines) versus 1 cacheline for reading ON bit.
   
Please remove blocked and wakeup_cpu, they should not be
  necessary.
  
   Why do you think wakeup_cpu is not needed, when vCPU is blocked,
   wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
   is woken up, it can run on a different cpu, so we need wakeup_cpu 
   to
   find the right list to wake up the vCPU.
 
  If the vCPU was moved it should have updated IRTE destination field
  to the pCPU which it has moved to?

 Every time a vCPU is scheduled to a new pCPU, the IRTE destination 
 filed
 would be updated accordingly.

 When vCPU is blocked. To wake up the blocked vCPU, we need to find
  which
 list the vCPU is blocked on, and this is what wakeup_cpu used for?
   
Right, perhaps prev_vcpu is a better name.
  
   Do you mean prev_pcpu?
  
  Yes.
  
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 4.0-rc5

2015-03-27 Thread Marcelo Tosatti

Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following PPC KVM bug fixes


Marcelo Tosatti (1):
  Merge tag 'signed-for-4.0' of git://github.com/agraf/linux-2.6

Paul Mackerras (3):
  KVM: PPC: Book3S HV: Fix spinlock/mutex ordering issue in 
kvmppc_set_lpcr()
  KVM: PPC: Book3S HV: Endian fix for accessing VPA yield count
  KVM: PPC: Book3S HV: Fix instruction emulation

 arch/powerpc/kvm/book3s_hv.c|8 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |1 +
 2 files changed, 5 insertions(+), 4 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
 On Wed, Mar 25, 2015 at 4:13 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote:
  On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
   On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
   
On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
 2015-03-25 12:08+0100, Radim Krčmář:
  Reverting the patch protects us from any migration, but I don't 
  think we
  need to care about changing VCPUs as long as we read a consistent 
  data
  from kvmclock.  (VCPU can change outside of this loop too, so it 
  doesn't
  matter if we return a value not fit for this VCPU.)
 
  I think we could drop the second __getcpu if our kvmclock was 
  being
  handled better;  maybe with a patch like the one below:

 The second __getcpu is not neccessary, but I forgot about rdtsc.
 We need to either use rtdscp, know the host has synchronized tsc, or
 monitor VCPU migrations.  Only the last one works everywhere.
   
The vdso code is only used if host has synchronized tsc.
   
But you have to handle the case where host goes from synchronized tsc 
to
unsynchronized tsc (see the clocksource notifier in the host side).
   
  
   Can't we change the host to freeze all vcpus and clear the stable bit
   on all of them if this happens?  This would simplify and speed up
   vclock_gettime.
  
   --Andy
  
   Seems interesting to do on 512-vcpus, but sure, could be done.
  
 
  If you have a 512-vcpu system that switches between stable and
  unstable more than once per migration, then I expect that you have
  serious problems and this is the least of your worries.
 
  Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
  if we're stable.  If nothing else, I'm not even remotely convinced
  that the current scheme gives monotonic timing due to skew between
  when the updates happen on different vcpus.
 
  Can you write down the problem ?
 
 
 I can try.
 
 Suppose we start out with all vcpus agreeing on their pvti and perfect
 invariant TSCs.  Now the host updates its frequency (due to NTP or
 whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
 pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
 They'll disagree on the time, and one of them will be ahead until vcpu
 1's pvti gets updated.

The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs
to be visible at one time, for all vcpus.


 * That is, when timespec0 != timespec1, M  N. Unfortunately that is
 * not
 * always the case (the difference between two distinct xtime instances
 * might be smaller then the difference between corresponding TSC reads,
 * when updating guest vcpus pvclock areas).
 *
 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Marcelo Tosatti
On Thu, Mar 26, 2015 at 04:28:37PM -0700, Andy Lutomirski wrote:
 On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
  On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
   On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com 
   wrote:
2015-03-26 11:51-0700, Andy Lutomirski:
On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti 
mtosa...@redhat.com wrote:
 On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
 Suppose we start out with all vcpus agreeing on their pvti and 
 perfect
 invariant TSCs.  Now the host updates its frequency (due to NTP or
 whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 
 1's
 pvti, guest code on vcpus 0 and 1 see synced TSCs but different 
 pvti.
 They'll disagree on the time, and one of them will be ahead until 
 vcpu
 1's pvti gets updated.

 The masterclock scheme enforces the same 
 system_timestamp/tsc_timestamp pairs
 to be visible at one time, for all vcpus.


  * That is, when timespec0 != timespec1, M  N. Unfortunately that 
 is
  * not
  * always the case (the difference between two distinct xtime 
 instances
  * might be smaller then the difference between corresponding TSC 
 reads,
  * when updating guest vcpus pvclock areas).
  *
  * To avoid that problem, do not allow visibility of distinct
  * system_timestamp/tsc_timestamp values simultaneously: use a 
 master
  * copy of host monotonic time values. Update that master copy
  * in lockstep.
   
Yuck.  So we have per cpu timing data, but the protocol is only 
usable
for monotonic timing because we forcibly freeze all vcpus when we
update the nominally per cpu data.
   
The obvious guest implementations are still unnecessarily slow,
though.  It would be nice if the guest could get away without using
any getcpu operation at all.
   
Even if we fixed the host to increment version as advertised, I think
we can't avoid two getcpu ops.  We need one before rdtsc to figure 
out
which pvti to look at,
   
Yes.
   
   and we need another to make sure that we were
actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
-- we need to check version before rdtsc, and we don't know what
version to check until we do a getcpu.).
   
Exactly, reading cpuid after rdtsc doesn't do that though, we could 
have
migrated back between those reads.
rtdscp would allow us to check that we read tsc of pvti's cpu.
(It doesn't get rid of that first read.)
   
 The migration hook has the
same issue -- we need to check the migration count, then confirm 
we're
on that cpu, then check the migration count again, and we can't do
that until we know what cpu we're on.
   
True;  the revert has a bug -- we need to check cpuid for the second
time before rdtsc.  (Migration hook is there just because we don't 
know
which cpu executed rdtsc.)
  
   One way or another, I'm planning on completely rewriting the vdso
   code.  An early draft is here:
  
   https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
  
   but I can't finish it until the KVM side shakes out.
  
   I think there are at least two ways that would work:
  
   a) If KVM incremented version as advertised:
  
   All for it.
  
   cpu = getcpu();
   pvti = pvti for cpu;
  
   ver1 = pvti-version;
   check stable bit;
   rdtsc_barrier, rdtsc, read scale, shift, etc.
   if (getcpu() != cpu) retry;
   if (pvti-version != ver1) retry;
  
   I think this is safe because, we're guaranteed that there was an
   interval (between the two version reads) in which the vcpu we think
   we're on was running and the kvmclock data was valid and marked
   stable, and we know that the tsc we read came from that interval.
  
   Note: rdtscp isn't needed. If we're stable, is makes no difference
   which cpu's tsc we actually read.
  
   Yes, can't see a problem with that.
  
   b) If version remains buggy but we use this migrations_from hack:
  
   There is no reason for version to remain buggy.
  
   cpu = getcpu();
   pvti = pvti for cpu;
   m1 = pvti-migrations_from;
   barrier();
  
   ver1 = pvti-version;
   check stable bit;
   rdtsc_barrier, rdtsc, read scale, shift, etc.
   if (getcpu() != cpu) retry;
   if (pvti-version != ver1) retry;  /* probably not really needed */
  
   barrier();
   if (pvti-migrations_from != m1) retry;
  
   This is just like (a), except that we're using a guest kernel hack to
   ensure that no one migrated off the vcpu during the version-protected
   critical section

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Marcelo Tosatti
On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote:
 2015-03-23 20:21-0300, Marcelo Tosatti:
  
  The following point:
  
  2. per-CPU pvclock time info is updated if the
 underlying CPU changes.
  
  Is not true anymore since KVM: x86: update pvclock area conditionally,
  on cpu migration.
  
  Add task migration notification back.
  
  Problem noticed by Andy Lutomirski.
  
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
  CC: sta...@kernel.org # 3.11+
 
 Revert contains a bug that got pointed out in the discussion:
 
  diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
  do {
  cpu = __getcpu()  VGETCPU_CPU_MASK;
   
  pvti = get_pvti(cpu);
 
 We can migrate to 'other cpu' here.
 
  +   migrate_count = pvti-migrate_count;
  +
  version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 
 And migrate back to 'cpu' here.

Migrating back will increase pvti-migrate_count, right ?

 rdtsc was executed on different cpu, so pvti and tsc might not be in
 sync, but migrate_count hasn't changed.
 
  cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 
 (Reading cpuid here is useless.)
 
  } while (unlikely(cpu != cpu1 ||
(pvti-pvti.version  1) ||
  - pvti-pvti.version != version));
  + pvti-pvti.version != version ||
  + pvti-migrate_count != migrate_count));
 
 We can workaround the bug with,
 
   cpu = __getcpu()  VGETCPU_CPU_MASK;
   pvti = get_pvti(cpu);
   migrate_count = pvti-migrate_count;
   if (cpu != (__getcpu()  VGETCPU_CPU_MASK))
   continue;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Marcelo Tosatti
On Thu, Mar 26, 2015 at 03:24:10PM -0700, Andy Lutomirski wrote:
 On Thu, Mar 26, 2015 at 3:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote:
  2015-03-23 20:21-0300, Marcelo Tosatti:
  
   The following point:
  
   2. per-CPU pvclock time info is updated if the
  underlying CPU changes.
  
   Is not true anymore since KVM: x86: update pvclock area conditionally,
   on cpu migration.
  
   Add task migration notification back.
  
   Problem noticed by Andy Lutomirski.
  
   Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
   CC: sta...@kernel.org # 3.11+
 
  Revert contains a bug that got pointed out in the discussion:
 
   diff --git a/arch/x86/vdso/vclock_gettime.c 
   b/arch/x86/vdso/vclock_gettime.c
   do {
   cpu = __getcpu()  VGETCPU_CPU_MASK;
  
   pvti = get_pvti(cpu);
 
  We can migrate to 'other cpu' here.
 
   +   migrate_count = pvti-migrate_count;
   +
   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 
  And migrate back to 'cpu' here.
 
  Migrating back will increase pvti-migrate_count, right ?
 
 I thought it only increased the count when we migrated away.

Right.

 --Andy
 
 
  rdtsc was executed on different cpu, so pvti and tsc might not be in
  sync, but migrate_count hasn't changed.
 
   cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 
  (Reading cpuid here is useless.)
 
   } while (unlikely(cpu != cpu1 ||
 (pvti-pvti.version  1) ||
   - pvti-pvti.version != version));
   + pvti-pvti.version != version ||
   + pvti-migrate_count != migrate_count));
 
  We can workaround the bug with,
 
cpu = __getcpu()  VGETCPU_CPU_MASK;
pvti = get_pvti(cpu);
migrate_count = pvti-migrate_count;
if (cpu != (__getcpu()  VGETCPU_CPU_MASK))
continue;

Looks good, please submit a fix.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Marcelo Tosatti
On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
 On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
  2015-03-26 11:51-0700, Andy Lutomirski:
  On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
   Suppose we start out with all vcpus agreeing on their pvti and perfect
   invariant TSCs.  Now the host updates its frequency (due to NTP or
   whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
   pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
   They'll disagree on the time, and one of them will be ahead until vcpu
   1's pvti gets updated.
  
   The masterclock scheme enforces the same system_timestamp/tsc_timestamp 
   pairs
   to be visible at one time, for all vcpus.
  
  
* That is, when timespec0 != timespec1, M  N. Unfortunately that is
* not
* always the case (the difference between two distinct xtime instances
* might be smaller then the difference between corresponding TSC reads,
* when updating guest vcpus pvclock areas).
*
* To avoid that problem, do not allow visibility of distinct
* system_timestamp/tsc_timestamp values simultaneously: use a master
* copy of host monotonic time values. Update that master copy
* in lockstep.
 
  Yuck.  So we have per cpu timing data, but the protocol is only usable
  for monotonic timing because we forcibly freeze all vcpus when we
  update the nominally per cpu data.
 
  The obvious guest implementations are still unnecessarily slow,
  though.  It would be nice if the guest could get away without using
  any getcpu operation at all.
 
  Even if we fixed the host to increment version as advertised, I think
  we can't avoid two getcpu ops.  We need one before rdtsc to figure out
  which pvti to look at,
 
  Yes.
 
 and we need another to make sure that we were
  actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
  -- we need to check version before rdtsc, and we don't know what
  version to check until we do a getcpu.).
 
  Exactly, reading cpuid after rdtsc doesn't do that though, we could have
  migrated back between those reads.
  rtdscp would allow us to check that we read tsc of pvti's cpu.
  (It doesn't get rid of that first read.)
 
   The migration hook has the
  same issue -- we need to check the migration count, then confirm we're
  on that cpu, then check the migration count again, and we can't do
  that until we know what cpu we're on.
 
  True;  the revert has a bug -- we need to check cpuid for the second
  time before rdtsc.  (Migration hook is there just because we don't know
  which cpu executed rdtsc.)
 
 One way or another, I'm planning on completely rewriting the vdso
 code.  An early draft is here:
 
 https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
 
 but I can't finish it until the KVM side shakes out.
 
 I think there are at least two ways that would work:
 
 a) If KVM incremented version as advertised:

All for it.

 cpu = getcpu();
 pvti = pvti for cpu;
 
 ver1 = pvti-version;
 check stable bit;
 rdtsc_barrier, rdtsc, read scale, shift, etc.
 if (getcpu() != cpu) retry;
 if (pvti-version != ver1) retry;
 
 I think this is safe because, we're guaranteed that there was an
 interval (between the two version reads) in which the vcpu we think
 we're on was running and the kvmclock data was valid and marked
 stable, and we know that the tsc we read came from that interval.
 
 Note: rdtscp isn't needed. If we're stable, is makes no difference
 which cpu's tsc we actually read.

Yes, can't see a problem with that.

 b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

 cpu = getcpu();
 pvti = pvti for cpu;
 m1 = pvti-migrations_from;
 barrier();
 
 ver1 = pvti-version;
 check stable bit;
 rdtsc_barrier, rdtsc, read scale, shift, etc.
 if (getcpu() != cpu) retry;
 if (pvti-version != ver1) retry;  /* probably not really needed */
 
 barrier();
 if (pvti-migrations_from != m1) retry;
 
 This is just like (a), except that we're using a guest kernel hack to
 ensure that no one migrated off the vcpu during the version-protected
 critical section and that we were, in fact, on that vcpu at some point
 during that critical section.  Once we've ensured that we were on
 pvti's associated vcpu for the entire time we were reading it, then we
 are protected by the existing versioning in the host.
 
 
  If, on the other hand, we could rely on having all of these things in
  sync, then this complication goes away, and we go down from two getcpu
  ops to zero.
 
  (Yeah, we should look what are the drawbacks of doing it differently.)
 
 If the versioning were fixed, I think we could almost get away with:
 
 pvti = pvti for vcpu 0;
 
 ver1

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Marcelo Tosatti
On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
 On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
  On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
   2015-03-26 11:51-0700, Andy Lutomirski:
   On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com 
   wrote:
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
Suppose we start out with all vcpus agreeing on their pvti and 
perfect
invariant TSCs.  Now the host updates its frequency (due to NTP or
whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
They'll disagree on the time, and one of them will be ahead until 
vcpu
1's pvti gets updated.
   
The masterclock scheme enforces the same 
system_timestamp/tsc_timestamp pairs
to be visible at one time, for all vcpus.
   
   
 * That is, when timespec0 != timespec1, M  N. Unfortunately that is
 * not
 * always the case (the difference between two distinct xtime 
instances
 * might be smaller then the difference between corresponding TSC 
reads,
 * when updating guest vcpus pvclock areas).
 *
 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.
  
   Yuck.  So we have per cpu timing data, but the protocol is only usable
   for monotonic timing because we forcibly freeze all vcpus when we
   update the nominally per cpu data.
  
   The obvious guest implementations are still unnecessarily slow,
   though.  It would be nice if the guest could get away without using
   any getcpu operation at all.
  
   Even if we fixed the host to increment version as advertised, I think
   we can't avoid two getcpu ops.  We need one before rdtsc to figure out
   which pvti to look at,
  
   Yes.
  
  and we need another to make sure that we were
   actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
   -- we need to check version before rdtsc, and we don't know what
   version to check until we do a getcpu.).
  
   Exactly, reading cpuid after rdtsc doesn't do that though, we could have
   migrated back between those reads.
   rtdscp would allow us to check that we read tsc of pvti's cpu.
   (It doesn't get rid of that first read.)
  
The migration hook has the
   same issue -- we need to check the migration count, then confirm we're
   on that cpu, then check the migration count again, and we can't do
   that until we know what cpu we're on.
  
   True;  the revert has a bug -- we need to check cpuid for the second
   time before rdtsc.  (Migration hook is there just because we don't know
   which cpu executed rdtsc.)
 
  One way or another, I'm planning on completely rewriting the vdso
  code.  An early draft is here:
 
  https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
 
  but I can't finish it until the KVM side shakes out.
 
  I think there are at least two ways that would work:
 
  a) If KVM incremented version as advertised:
 
  All for it.
 
  cpu = getcpu();
  pvti = pvti for cpu;
 
  ver1 = pvti-version;
  check stable bit;
  rdtsc_barrier, rdtsc, read scale, shift, etc.
  if (getcpu() != cpu) retry;
  if (pvti-version != ver1) retry;
 
  I think this is safe because, we're guaranteed that there was an
  interval (between the two version reads) in which the vcpu we think
  we're on was running and the kvmclock data was valid and marked
  stable, and we know that the tsc we read came from that interval.
 
  Note: rdtscp isn't needed. If we're stable, is makes no difference
  which cpu's tsc we actually read.
 
  Yes, can't see a problem with that.
 
  b) If version remains buggy but we use this migrations_from hack:
 
  There is no reason for version to remain buggy.
 
  cpu = getcpu();
  pvti = pvti for cpu;
  m1 = pvti-migrations_from;
  barrier();
 
  ver1 = pvti-version;
  check stable bit;
  rdtsc_barrier, rdtsc, read scale, shift, etc.
  if (getcpu() != cpu) retry;
  if (pvti-version != ver1) retry;  /* probably not really needed */
 
  barrier();
  if (pvti-migrations_from != m1) retry;
 
  This is just like (a), except that we're using a guest kernel hack to
  ensure that no one migrated off the vcpu during the version-protected
  critical section and that we were, in fact, on that vcpu at some point
  during that critical section.  Once we've ensured that we were on
  pvti's associated vcpu for the entire time we were reading it, then we
  are protected by the existing versioning in the host.
 
  
   If, on the other hand, we could rely on having all

Re: [PATCH v2 06/12] KVM: mark kvm-buses as empty once they were destroyed

2015-03-26 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 05:09:13PM +, Marc Zyngier wrote:
 On 23/03/15 15:58, Andre Przywara wrote:
  In kvm_destroy_vm() we call kvm_io_bus_destroy() pretty early,
  especially before calling kvm_arch_destroy_vm(). To avoid
  unregistering devices from the already destroyed bus, let's mark
  the bus with NULL to let other users know it has been destroyed
  already.
  This avoids a crash on a VM shutdown with the VGIC using the
  kvm_io_bus later (the unregistering is in there to be able to roll
  back a faulting init).
  
  Signed-off-by: Andre Przywara andre.przyw...@arm.com
 
 That seems sensible, but I don't see why nobody else hits that. What are
 we doing differently?
 
 Otherwise,
 
 Reviewed-by: Marc Zyngier marc.zyng...@arm.com
 
 Paolo, Marcelo, can we have your Ack on this?
 
 Thanks,
 
   M.
 
  ---
   virt/kvm/kvm_main.c |4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)
  
  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index 8c7ab0b..6f164eb 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -604,8 +604,10 @@ static void kvm_destroy_vm(struct kvm *kvm)
  list_del(kvm-vm_list);
  spin_unlock(kvm_lock);
  kvm_free_irq_routing(kvm);
  -   for (i = 0; i  KVM_NR_BUSES; i++)
  +   for (i = 0; i  KVM_NR_BUSES; i++) {
  kvm_io_bus_destroy(kvm-buses[i]);
  +   kvm-buses[i] = NULL;
  +   }
  kvm_coalesced_mmio_free(kvm);
   #if defined(CONFIG_MMU_NOTIFIER)  defined(KVM_ARCH_WANT_MMU_NOTIFIER)
  mmu_notifier_unregister(kvm-mmu_notifier, kvm-mm);
  
 
 
 -- 
 Jazz is not dead. It just smells funny...

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: nVMX: Add support for rdtscp

2015-03-26 Thread Marcelo Tosatti
On Mon, Mar 23, 2015 at 07:27:19PM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If the guest CPU is supposed to support rdtscp and the host has rdtscp
 enabled in the secondary execution controls, we can also expose this
 feature to L1. Just extend nested_vmx_exit_handled to properly route
 EXIT_REASON_RDTSCP.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
 2015-03-25 12:08+0100, Radim Krčmář:
  Reverting the patch protects us from any migration, but I don't think we
  need to care about changing VCPUs as long as we read a consistent data
  from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
  matter if we return a value not fit for this VCPU.)
  
  I think we could drop the second __getcpu if our kvmclock was being
  handled better;  maybe with a patch like the one below:
 
 The second __getcpu is not neccessary, but I forgot about rdtsc.
 We need to either use rtdscp, know the host has synchronized tsc, or
 monitor VCPU migrations.  Only the last one works everywhere.

The vdso code is only used if host has synchronized tsc.

But you have to handle the case where host goes from synchronized tsc to
unsynchronized tsc (see the clocksource notifier in the host side).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/12] KVM: x86: remove now unneeded include directory from Makefile

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 05:01:52PM +, Marc Zyngier wrote:
 On 23/03/15 15:58, Andre Przywara wrote:
  virt/kvm was never really a good include directory for anything else
  than locally included headers.
  With the move of iodev.h there is no need anymore to add this
  directory the compiler's include path, so remove it from the x86 kvm
  Makefile.
  
  Signed-off-by: Andre Przywara andre.przyw...@arm.com
 
 Reviewed-by: Marc Zyngier marc.zyng...@arm.com
 
 Paolo, Marcelo: can we have your Ack on this?
 
 Thanks,
 
   M.

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/12] KVM: move iodev.h from virt/kvm/ to include/kvm

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 05:00:02PM +, Marc Zyngier wrote:
 On 23/03/15 15:58, Andre Przywara wrote:
  iodev.h contains definitions for the kvm_io_bus framework. This is
  needed both by the generic KVM code in virt/kvm as well as by
  architecture specific code under arch/. Putting the header file in
  virt/kvm and using local includes in the architecture part seems at
  least dodgy to me, so let's move the file into include/kvm, so that a
  more natural #include kvm/iodev.h can be used by all of the code.
  This also solves a problem later when using struct kvm_io_device
  in arm_vgic.h.
  Fixing up the FSF address in the GPL header and a wrong include path
  on the way.
  
  Signed-off-by: Andre Przywara andre.przyw...@arm.com
  Acked-by: Christoffer Dall christoffer.d...@linaro.org
 
 Reviewed-by: Marc Zyngier marc.zyng...@arm.com
 
 Paolo, Marcelo: can we have your Ack on this?
 
 Thanks,

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/12] KVM: mark kvm-buses as empty once they were destroyed

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 05:09:13PM +, Marc Zyngier wrote:
 On 23/03/15 15:58, Andre Przywara wrote:
  In kvm_destroy_vm() we call kvm_io_bus_destroy() pretty early,
  especially before calling kvm_arch_destroy_vm(). To avoid
  unregistering devices from the already destroyed bus, let's mark
  the bus with NULL to let other users know it has been destroyed
  already.
  This avoids a crash on a VM shutdown with the VGIC using the
  kvm_io_bus later (the unregistering is in there to be able to roll
  back a faulting init).
  
  Signed-off-by: Andre Przywara andre.przyw...@arm.com
 
 That seems sensible, but I don't see why nobody else hits that. What are
 we doing differently?

It should be valid to call kvm_io_bus_unregister_dev after
kvm_io_bus_destroy.

Are you patching it to handle NULL kvm-buses[bus_idx] ?

 Otherwise,
 
 Reviewed-by: Marc Zyngier marc.zyng...@arm.com
 
 Paolo, Marcelo, can we have your Ack on this?
 
 Thanks,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
 On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
   2015-03-25 12:08+0100, Radim Krčmář:
Reverting the patch protects us from any migration, but I don't think we
need to care about changing VCPUs as long as we read a consistent data
from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
matter if we return a value not fit for this VCPU.)
   
I think we could drop the second __getcpu if our kvmclock was being
handled better;  maybe with a patch like the one below:
  
   The second __getcpu is not neccessary, but I forgot about rdtsc.
   We need to either use rtdscp, know the host has synchronized tsc, or
   monitor VCPU migrations.  Only the last one works everywhere.
 
  The vdso code is only used if host has synchronized tsc.
 
  But you have to handle the case where host goes from synchronized tsc to
  unsynchronized tsc (see the clocksource notifier in the host side).
 
 
 Can't we change the host to freeze all vcpus and clear the stable bit
 on all of them if this happens?  This would simplify and speed up
 vclock_gettime.
 
 --Andy

Seems interesting to do on 512-vcpus, but sure, could be done.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-03-25 Thread Marcelo Tosatti
On Mon, Mar 16, 2015 at 11:42:06AM +, Wu, Feng wrote:
  Do you have any reason why having the code at vcpu_put/vcpu_load is
  better than the proposal to have the code at kvm_vcpu_block?
 
 I think your proposal is good, I just want to better understand your idea 
 here.:)

Reduce the overhead of vcpu sched in / vcpu sched out, basically.

 One thing, even we put the code to kvm_vcpu_block, we still need to add code
 at vcpu_put/vcpu_load for the preemption case like what I did now.
 
  
  
   Global vector is a limited resources in the system, and this involves
   common x86 interrupt code changes. I am not sure we can allocate
   so many dedicated global vector for KVM usage.
  
  Why not? Have KVM use all free vectors (so if vectors are necessary for
  other purposes, people should shrink the KVM vector pool).
 
 If we want to allocate more global vector for this usage, we need hpa's
 input about it. Peter, what is your opinion?

Peter?

  BTW the Intel docs talk about that (one vector per vCPU).
 Yes, the Spec talks about this, but it is more complex using one vector per 
 vCPU.
 
  
  It seems there is a bunch free:
 
  commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
  Author: Alex Shi alex@intel.com
  Date:   Thu Jun 28 09:02:23 2012 +0800
 
  x86/tlb: replace INVALIDATE_TLB_VECTOR by
CALL_FUNCTION_VECTOR
 
  Can you add only vcpus which have posted IRTEs that point to this 
  pCPU
  to the HLT'ed vcpu lists? (so for example, vcpus without assigned
  devices are not part of the list).

 Is it easy to find whether a vCPU (or the associated domain) has 
 assigned
devices?
 If so, we can only add those vCPUs with assigned devices.
   
When configuring IRTE, at kvm_arch_vfio_update_pi_irte?
  
   Yes.
  
   
   +
static int __init vmx_init(void)
{
 int r, i, msr;
   @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
  
 update_ple_window_actual_max();
  
   + wakeup_handler_callback = wakeup_handler;
   +
 return 0;
  
out7:
   diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
   index 0033df3..1551a46 100644
   --- a/arch/x86/kvm/x86.c
   +++ b/arch/x86/kvm/x86.c
   @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct
  kvm_vcpu
  *vcpu)
 kvm_vcpu_reload_apic_access_page(vcpu);
 }
  
   + /*
   +  * Since posted-interrupts can be set by VT-d HW now, in this
   +  * case, KVM_REQ_EVENT is not set. We move the following
   +  * operations out of the if statement.
   +  */
   + if (kvm_lapic_enabled(vcpu)) {
   + /*
   +  * Update architecture specific hints for APIC
   +  * virtual interrupt delivery.
   +  */
   + if (kvm_x86_ops-hwapic_irr_update)
   + kvm_x86_ops-hwapic_irr_update(vcpu,
   + kvm_lapic_find_highest_irr(vcpu));
   + }
   +
 
  This is a hot fast path. You can set KVM_REQ_EVENT from
  wakeup_handler.

 I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help
much,
 if vCPU is running in ROOT mode, and VT-d hardware issues an 
 notification
event,
 POSTED_INTR_VECTOR interrupt handler will be called.
   
If vCPU is in root mode, remapping HW will find IRTE configured with
vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
VM-exit, and execute the interrupt handler wakeup_handler. Right?
  
   There are two cases:
   Case 1: vCPU is blocked, so it is in root mode, this is what you described
  above.
   Case 2, vCPU is running in root mode, such as, handling vm-exits, in this 
   case,
   the notification vector is 'POSTED_INTR_VECTOR', and if external 
   interrupts
   from assigned devices happen, the handled of 'POSTED_INTR_VECTOR' will
   be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine 
   doesn't need
   do real things, since the pending interrupts in PIR will be synced to vIRR
  before
   VM-Entry (this code have already been there when enabling CPU-side
   posted-interrupt along with APICv). Like what I said before, it is a 
   little hard
  to
   get vCPU related information in it, even if we get, it is not accurate 
   and may
  harm
   the performance.(need search)
  
   So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the
  notification
   event for 'POSTED_INTR_VECTOR'.
  
   
The point of this comment is that you can keep the
   
if (kvm_x86_ops-hwapic_irr_update)
kvm_x86_ops-hwapic_irr_update(vcpu,
kvm_lapic_find_highest_irr(vcpu));

   
Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
wakeup_handler sets KVM_REQ_EVENT.
  
   Please see above.
  
  OK can you set KVM_REQ_EVENT in case the ON bit is set,
  after disabling 

Re: [PULL 0/3] 4.0 patch queue 2015-03-25

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 10:58:54PM +0100, Alexander Graf wrote:
 Hi Paolo,
 
 This is my current patch queue for 4.0.  Please pull.
 
 Alex
 
 
 The following changes since commit f710a12d73dfa1c3a5d2417f2482b970f03bb850:
 
   Merge tag 'kvm-arm-fixes-4.0-rc5' of 
 git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm (2015-03-16 
 20:08:56 -0300)
 
 are available in the git repository at:
 
 
   git://github.com/agraf/linux-2.6.git tags/signed-for-4.0
 
 for you to fetch changes up to 2bf27601c7b50b6ced72f27304109dc52eb52919:
 
   KVM: PPC: Book3S HV: Fix instruction emulation (2015-03-20 11:42:33 +0100)

Pulled, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 0/3] 4.0 patch queue 2015-03-25

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 10:58:54PM +0100, Alexander Graf wrote:
 Hi Paolo,
 
 This is my current patch queue for 4.0.  Please pull.
 
 Alex
 
 
 The following changes since commit f710a12d73dfa1c3a5d2417f2482b970f03bb850:
 
   Merge tag 'kvm-arm-fixes-4.0-rc5' of 
 git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm (2015-03-16 
 20:08:56 -0300)
 
 are available in the git repository at:
 
 
   git://github.com/agraf/linux-2.6.git tags/signed-for-4.0
 
 for you to fetch changes up to 2bf27601c7b50b6ced72f27304109dc52eb52919:
 
   KVM: PPC: Book3S HV: Fix instruction emulation (2015-03-20 11:42:33 +0100)

Pulled, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Marcelo Tosatti
On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote:
 On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
  On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
   On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
2015-03-25 12:08+0100, Radim Krčmář:
 Reverting the patch protects us from any migration, but I don't 
 think we
 need to care about changing VCPUs as long as we read a consistent 
 data
 from kvmclock.  (VCPU can change outside of this loop too, so it 
 doesn't
 matter if we return a value not fit for this VCPU.)

 I think we could drop the second __getcpu if our kvmclock was being
 handled better;  maybe with a patch like the one below:
   
The second __getcpu is not neccessary, but I forgot about rdtsc.
We need to either use rtdscp, know the host has synchronized tsc, or
monitor VCPU migrations.  Only the last one works everywhere.
  
   The vdso code is only used if host has synchronized tsc.
  
   But you have to handle the case where host goes from synchronized tsc to
   unsynchronized tsc (see the clocksource notifier in the host side).
  
 
  Can't we change the host to freeze all vcpus and clear the stable bit
  on all of them if this happens?  This would simplify and speed up
  vclock_gettime.
 
  --Andy
 
  Seems interesting to do on 512-vcpus, but sure, could be done.
 
 
 If you have a 512-vcpu system that switches between stable and
 unstable more than once per migration, then I expect that you have
 serious problems and this is the least of your worries.
 
 Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
 if we're stable.  If nothing else, I'm not even remotely convinced
 that the current scheme gives monotonic timing due to skew between
 when the updates happen on different vcpus.

Can you write down the problem ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 4.0-rc5

2015-03-24 Thread Marcelo Tosatti

Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following: fix for higher-order page
allocation failures, fix Xen-on-KVM with x2apic, 
L1 crash with unrestricted guest mode (nested VMX).


Igor Mammedov (1):
  kvm: avoid page allocation failure in kvm_set_memory_region()

Radim Krčmář (2):
  KVM: nVMX: mask unrestricted_guest if disabled on L0
  KVM: x86: call irq notifiers with directed EOI

 arch/x86/kvm/ioapic.c |4 +++-
 arch/x86/kvm/lapic.c  |3 +--
 arch/x86/kvm/vmx.c|7 +--
 virt/kvm/kvm_main.c   |   14 +++---
 4 files changed, 16 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-24 Thread Marcelo Tosatti
On Tue, Mar 24, 2015 at 04:34:12PM +0100, Radim Krčmář wrote:
 2015-03-23 20:21-0300, Marcelo Tosatti:
  The following point:
  
  2. per-CPU pvclock time info is updated if the
 underlying CPU changes.
  
  Is not true anymore since KVM: x86: update pvclock area conditionally,
  on cpu migration.
 
 I think that the revert doesn't fix point 2.:  KVM: x86: update pvclock
 [...] changed the host to skip clock update on physical CPU change, but
 guest's task migration notifier isn't tied to it at all.

per-CPU pvclock time info is updated if the underlying CPU changes
is the same as
always perform clock update on physical CPU change.

That was a requirement for the original patch, to drop migration
notifiers.

 (Guest can have all tasks pinned, so the revert changed nothing.)
 
  Add task migration notification back.
  
  Problem noticed by Andy Lutomirski.
 
 What is the problem?
 
 Thanks.

The problem is this:

T1) guest thread1 on vcpu1.
T2) guest thread1 on vcpu2.
T3) guest thread1 on vcpu1.

Inside a pvclock read loop.

Since the writes by hypervisor of pvclock area are not ordered, 
you cannot rely on version being updated _before_ 
the rest of pvclock data.

(in the case above, has the physical cpu changed check, inside the
guests thread1, obviously fails).




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 00/11] KVM: s390: Features and fixes for 4.1 (kvm/next)

2015-03-23 Thread Marcelo Tosatti
On Wed, Mar 18, 2015 at 12:43:58PM +0100, Christian Borntraeger wrote:
 Paolo, Marcelo,
 
 here is the followup pull request. As Marcelo has not yet pushed out
 queue or next to git.kernel.org, this request is based on the previous
 s390 pull request and should merge without conflicts.
 
 For details see tag description.
 
 Christian
 
 The following changes since commit 13211ea7b47db3d8ee2ff258a9a973a6d3aa3d43:
 
   KVM: s390: Enable vector support for capable guest (2015-03-06 13:49:35 
 +0100)
 
 are available in the git repository at:
 
   git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git  
 tags/kvm-s390-next-20150318
 
 for you to fetch changes up to 18280d8b4bcd4a2b174ee3cd748166c6190acacb:
 
   KVM: s390: represent SIMD cap in kvm facility (2015-03-17 16:33:14 +0100)

Pulled, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-23 Thread Marcelo Tosatti

The following point:

2. per-CPU pvclock time info is updated if the
   underlying CPU changes.

Is not true anymore since KVM: x86: update pvclock area conditionally,
on cpu migration.

Add task migration notification back.

Problem noticed by Andy Lutomirski.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
CC: sta...@kernel.org # 3.11+

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index d6b078e..25b1cc0 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -95,6 +95,7 @@ unsigned __pvclock_read_cycles(const struct 
pvclock_vcpu_time_info *src,
 
 struct pvclock_vsyscall_time_info {
struct pvclock_vcpu_time_info pvti;
+   u32 migrate_count;
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d2..e5ecd20 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -141,7 +141,46 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
 
+static struct pvclock_vsyscall_time_info *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *
+pvclock_get_vsyscall_user_time_info(int cpu)
+{
+   if (!pvclock_vdso_info) {
+   BUG();
+   return NULL;
+   }
+
+   return pvclock_vdso_info[cpu];
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+   return pvclock_get_vsyscall_user_time_info(cpu)-pvti;
+}
+
 #ifdef CONFIG_X86_64
+static int pvclock_task_migrate(struct notifier_block *nb, unsigned long l,
+   void *v)
+{
+   struct task_migration_notifier *mn = v;
+   struct pvclock_vsyscall_time_info *pvti;
+
+   pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu);
+
+   /* this is NULL when pvclock vsyscall is not initialized */
+   if (unlikely(pvti == NULL))
+   return NOTIFY_DONE;
+
+   pvti-migrate_count++;
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+   .notifier_call = pvclock_task_migrate,
+};
+
 /*
  * Initialize the generic pvclock vsyscall state.  This will allocate
  * a/some page(s) for the per-vcpu pvclock information, set up a
@@ -155,12 +194,17 @@ int __init pvclock_init_vsyscall(struct 
pvclock_vsyscall_time_info *i,
 
WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
 
+   pvclock_vdso_info = i;
+
for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
 __pa(i) + (idx*PAGE_SIZE),
 PAGE_KERNEL_VVAR);
}
 
+
+   register_task_migration_notifier(pvclock_migrate);
+
return 0;
 }
 #endif
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 9793322..3093376 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -82,18 +82,15 @@ static notrace cycle_t vread_pvclock(int *mode)
cycle_t ret;
u64 last;
u32 version;
+   u32 migrate_count;
u8 flags;
unsigned cpu, cpu1;
 
 
/*
-* Note: hypervisor must guarantee that:
-* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-* 2. that per-CPU pvclock time info is updated if the
-*underlying CPU changes.
-* 3. that version is increased whenever underlying CPU
-*changes.
-*
+* When looping to get a consistent (time-info, tsc) pair, we
+* also need to deal with the possibility we can switch vcpus,
+* so make sure we always re-fetch time-info for the current vcpu.
 */
do {
cpu = __getcpu()  VGETCPU_CPU_MASK;
@@ -104,6 +101,8 @@ static notrace cycle_t vread_pvclock(int *mode)
 
pvti = get_pvti(cpu);
 
+   migrate_count = pvti-migrate_count;
+
version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 
/*
@@ -115,7 +114,8 @@ static notrace cycle_t vread_pvclock(int *mode)
cpu1 = __getcpu()  VGETCPU_CPU_MASK;
} while (unlikely(cpu != cpu1 ||
  (pvti-pvti.version  1) ||
- pvti-pvti.version != version));
+ pvti-pvti.version != version ||
+ pvti-migrate_count != migrate_count));
 
if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
*mode = VCLOCK_NONE;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..be98910 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -176,6 +176,14 @@ extern void get_iowait_load(unsigned long *nr_waiters, 
unsigned long *load);
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void

Re: [PATCH] KVM: x86: call irq notifiers with directed EOI

2015-03-23 Thread Marcelo Tosatti
On Wed, Mar 18, 2015 at 07:38:22PM +0100, Radim Krčmář wrote:
 kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled.
 We need to do that for irq notifiers.  (Like with edge interrupts.)
 
 Fix it by skipping EOI broadcast only.
 
 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
  arch/x86/kvm/ioapic.c | 4 +++-
  arch/x86/kvm/lapic.c  | 3 +--
  2 files changed, 4 insertions(+), 3 deletions(-)

Applied to master, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: inline kvm_ioapic_handles_vector()

2015-03-23 Thread Marcelo Tosatti
On Thu, Mar 19, 2015 at 09:52:41PM +0100, Radim Krčmář wrote:
 An overhead from function call is not appropriate for its size and
 frequency of execution.
 
 Suggested-by: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
   I'm not very fond of that smp_rmb(): there is no real synchronization
   against update_handled_vectors(), 

Yes, because the guest OS should provide synchronization (it should
shutdown interrupts before attempting to modify IOAPIC table).

The smp_wmb is necessary.

  so the only point I see is to drop
   cached value of handled_vectors, which seems like bad use of LFENCE.

test_bit has volatile on *addr, so don't see why the smp_rmb is
necessary at all.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: kvm: rename migrate_count variable

2015-03-23 Thread Marcelo Tosatti

As thats more indicative of the variables usage.

Suggested by Andy Lutomirski.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 25b1cc0..1c1b474 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -95,7 +95,7 @@ unsigned __pvclock_read_cycles(const struct 
pvclock_vcpu_time_info *src,
 
 struct pvclock_vsyscall_time_info {
struct pvclock_vcpu_time_info pvti;
-   u32 migrate_count;
+   u32 migrate_from_count;
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index e5ecd20..8eaf04b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -172,7 +172,7 @@ static int pvclock_task_migrate(struct notifier_block *nb, 
unsigned long l,
if (unlikely(pvti == NULL))
return NOTIFY_DONE;
 
-   pvti-migrate_count++;
+   pvti-migrate_from_count++;
 
return NOTIFY_DONE;
 }
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 3093376..ef8bb76 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -82,7 +82,7 @@ static notrace cycle_t vread_pvclock(int *mode)
cycle_t ret;
u64 last;
u32 version;
-   u32 migrate_count;
+   u32 migrate_from_count;
u8 flags;
unsigned cpu, cpu1;
 
@@ -101,7 +101,7 @@ static notrace cycle_t vread_pvclock(int *mode)
 
pvti = get_pvti(cpu);
 
-   migrate_count = pvti-migrate_count;
+   migrate_from_count = pvti-migrate_from_count;
 
version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 
@@ -115,7 +115,7 @@ static notrace cycle_t vread_pvclock(int *mode)
} while (unlikely(cpu != cpu1 ||
  (pvti-pvti.version  1) ||
  pvti-pvti.version != version ||
- pvti-migrate_count != migrate_count));
+ pvti-migrate_from_count != migrate_from_count));
 
if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
*mode = VCLOCK_NONE;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: avoid page allocation failure in kvm_set_memory_region()

2015-03-20 Thread Marcelo Tosatti
On Fri, Mar 20, 2015 at 09:51:26AM +, Igor Mammedov wrote:
 KVM guest can fail to startup with following trace on host:
 
 qemu-system-x86: page allocation failure: order:4, mode:0x40d0
 Call Trace:
   dump_stack+0x47/0x67
   warn_alloc_failed+0xee/0x150
   __alloc_pages_direct_compact+0x14a/0x150
   __alloc_pages_nodemask+0x776/0xb80
   alloc_kmem_pages+0x3a/0x110
   kmalloc_order+0x13/0x50
   kmemdup+0x1b/0x40
   __kvm_set_memory_region+0x24a/0x9f0 [kvm]
   kvm_set_ioapic+0x130/0x130 [kvm]
   kvm_set_memory_region+0x21/0x40 [kvm]
   kvm_vm_ioctl+0x43f/0x750 [kvm]
 
 Failure happens when attempting to allocate pages for
 'struct kvm_memslots', however it doesn't have to be
 present in physically contiguous (kmalloc-ed) address
 space, change allocation to kvm_kvzalloc() so that
 it will be vmalloc-ed when its size is more then a page.
 
 Signed-off-by: Igor Mammedov imamm...@redhat.com

Igor, two things:

1) kvm_create_vm should also use vmalloc

r = -ENOMEM;
kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm-memslots)
goto out_err_no_srcu;

2) there are additional places where its necessary to use
proper freeing function, i believe:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ce7888a..651ff2d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -522,7 +522,7 @@ out_err_no_srcu:
 out_err_no_disable:
for (i = 0; i  KVM_NR_BUSES; i++)
kfree(kvm-buses[i]);
-   kfree(kvm-memslots);
+   kvfree(kvm-memslots);
kvm_arch_free_vm(kvm);
return ERR_PTR(r);
 }
@@ -570,7 +570,7 @@ static void kvm_free_physmem(struct kvm *kvm)
kvm_for_each_memslot(memslot, slots)
kvm_free_physmem_slot(kvm, memslot, NULL);
 
-   kfree(kvm-memslots);
+   kvfree(kvm-memslots);
 }
 
 static void kvm_destroy_devices(struct kvm *kvm)
@@ -909,7 +922,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
kvm_arch_commit_memory_region(kvm, mem, old, change);
 
kvm_free_physmem_slot(kvm, old, new);
-   kfree(old_memslots);
+   kvfree(old_memslots);
 
/*
 * IOMMU mapping:  New slots need to be mapped.  Old slots need to be

 ---
 TODO:
  - work on follow up patches to allocate space for
actual amount of memory_slots instead of possible maximum.
 ---
  virt/kvm/kvm_main.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index a2214d9..7ed1f5c 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -871,10 +871,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
   goto out_free;
   }
  
 - slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots),
 - GFP_KERNEL);
 + slots = kvm_kvzalloc(sizeof(struct kvm_memslots));
   if (!slots)
   goto out_free;
 + memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
  
   if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) {
   slot = id_to_memslot(slots, mem-slot);
 @@ -936,7 +936,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
   return 0;
  
  out_slots:
 - kfree(slots);
 + kvfree(slots);
  out_free:
   kvm_free_physmem_slot(kvm, new, old);
  out:
 -- 
 1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 4/9] KVM: s390: Add MEMOP ioctls for reading/writing guest memory

2015-03-19 Thread Marcelo Tosatti
On Thu, Mar 19, 2015 at 02:15:18PM +0100, Thomas Huth wrote:
  Date: Wed, 18 Mar 2015 21:23:48 -0300
  From: Marcelo Tosatti mtosa...@redhat.com
 
  On Mon, Mar 16, 2015 at 09:51:40AM +0100, Christian Borntraeger wrote:
   From: Thomas Huth th...@linux.vnet.ibm.com
  
   On s390, we've got to make sure to hold the IPTE lock while accessing
   logical memory. So let's add an ioctl for reading and writing logical
   memory to provide this feature for userspace, too.
 ...
 
  What i was wondering is why you can't translate the address
  in the kernel and let userspace perform the actual read/write?
 
 The idea here is to protect the read/write access with the ipte-lock, too.
 That way, the whole address translation _and_ the read/write access are
 protected together against invalidate-page-table operations from other
 CPUs,
 so the whole memory access looks atomic for other VCPUs. And since we do
 not
 want to expose the ipte lock directly to user space, both has to be done
 in the kernel.
 We already had a long internal discussion about this in our team, and
 indeed, if the ipte-lock would be the only problem that we face on s390,
 we might also come up with a solution where the memory read/write access
 is done in userspace instead. However, for full architecture compliance,
 we later have got to support the so-called storage keys during memory
 accesses, too, and this can hardly be done accurately and safely from
 userspace. So I'm afraid, it's somewhat ugly that we've got to provide an
 ioctl here to read/write the guest memory, but it's the only feasible
 solution that I could think of.

I see, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: call irq notifiers with directed EOI

2015-03-19 Thread Marcelo Tosatti
On Wed, Mar 18, 2015 at 07:38:22PM +0100, Radim Krčmář wrote:
 kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled.
 We need to do that for irq notifiers.  (Like with edge interrupts.)
 
 Fix it by skipping EOI broadcast only.
 
 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
  arch/x86/kvm/ioapic.c | 4 +++-
  arch/x86/kvm/lapic.c  | 3 +--
  2 files changed, 4 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
 index b1947e0f3e10..46d4449772bc 100644
 --- a/arch/x86/kvm/ioapic.c
 +++ b/arch/x86/kvm/ioapic.c
 @@ -422,6 +422,7 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
   struct kvm_ioapic *ioapic, int vector, int trigger_mode)
  {
   int i;
 + struct kvm_lapic *apic = vcpu-arch.apic;
  
   for (i = 0; i  IOAPIC_NUM_PINS; i++) {
   union kvm_ioapic_redirect_entry *ent = ioapic-redirtbl[i];
 @@ -443,7 +444,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
   kvm_notify_acked_irq(ioapic-kvm, KVM_IRQCHIP_IOAPIC, i);
   spin_lock(ioapic-lock);
  
 - if (trigger_mode != IOAPIC_LEVEL_TRIG)
 + if (trigger_mode != IOAPIC_LEVEL_TRIG ||
 + kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI)
   continue;

Don't you have to handle kvm_ioapic_eoi_inject_work as well?

   ASSERT(ent-fields.trig_mode == IOAPIC_LEVEL_TRIG);

This assert can now fail? 

 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index bd4e34de24c7..4ee827d7bf36 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -833,8 +833,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
 kvm_vcpu *vcpu2)
  
  static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
  {
 - if (!(kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_DIRECTED_EOI) 
 - kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) {
 + if (kvm_ioapic_handles_vector(apic-vcpu-kvm, vector)) {
   int trigger_mode;
   if (apic_test_vector(vector, apic-regs + APIC_TMR))
   trigger_mode = IOAPIC_LEVEL_TRIG;
 -- 
 2.3.3
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch v5] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-18 Thread Marcelo Tosatti
On Fri, Mar 13, 2015 at 09:14:35AM -0600, James Sullivan wrote:
 This patch adds a check for RH=1 in kvm_set_msi_irq. Currently the
 DM bit is the only thing used to decide irq-dest_mode (logical when DM
 set, physical when unset). Documentation indicates that the DM bit will
 be 'ignored' when the RH bit is unset, and physical destination mode is
 used in this case.
 
 Fixed this to set irq-dest_mode to APIC_DEST_LOGICAL just in case both
 RH=1/DM=1.
 
 This patch doesn't completely handle RH=1; if RH=1 then the delivery will 
 behave
 as in low priority mode (deliver the interrupt to only the lowest priority 
 processor),
 but the delivery mode may still used to specify the semantics of the delivery 
 beyond
 its destination.
 
 I will be trying and comparing a few options to handle this fully (extension 
 of
 struct kvm_lapic_irq, introduction of MSI specific delivery functions or 
 helpers,
 etc) and hope to have some patches to show in the near future.
 
 
 Signed-off-by: James Sullivan sullivan.jame...@gmail.com

The documentation states the following:

* When RH is 0, the interrupt is directed to the processor listed in the
Destination ID field.

* If RH is 0, then the DM bit is ignored and the message is sent ahead
independent of whether the physical or logical destination mode is used.

However, from the POV of a device writing to memory to generate an MSI 
interrupt, there is no (or i can't see any) other information that 
can be used to infer logical or physical mode for the interrupt message.

Before your patch:

(dm, rh) = (0, 0) = irq-dest_mode = 0
(dm, rh) = (0, 1) = irq-dest_mode = 0
(dm, rh) = (1, 0) = irq-dest_mode = 1
(dm, rh) = (1, 1) = irq-dest_mode = 1

After your patch:

(dm, rh) = (0, 0) = irq-dest_mode = 0
(dm, rh) = (0, 1) = irq-dest_mode = 0
(dm, rh) = (1, 0) = irq-dest_mode = 0
(dm, rh) = (1, 1) = irq-dest_mode = 1


Am i missing some explicit documentation that refers 
to (dm, rh) = (1, 0) = irq-dest_mode = 0 ?

See native_compose_msi_msg:

msg-address_lo =
MSI_ADDR_BASE_LO |
((apic-irq_dest_mode == 0) ?
MSI_ADDR_DEST_MODE_PHYSICAL :
MSI_ADDR_DEST_MODE_LOGICAL) |
((apic-irq_delivery_mode != dest_LowestPrio) ?
MSI_ADDR_REDIRECTION_CPU :
MSI_ADDR_REDIRECTION_LOWPRI) |
MSI_ADDR_DEST_ID(dest);


So it does configure DM = MSI_ADDR_DEST_MODE_LOGICAL
and RH = MSI_ADDR_REDIRECTION_LOWPRI.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XP machine freeze

2015-03-18 Thread Marcelo Tosatti
On Mon, Mar 16, 2015 at 04:10:40PM +0100, Saso Slavicic wrote:
 Hi,
 
 I'm fairly experienced with KVM (Centos 5/6), running about a dozen servers
 with 20-30 different (Linux  MS platform) systems.
 I have one Windows XP machine that acts very strangely - it freezes. I get
 ping timeout for the VM from my monitoring and the machine spins 2 or 3
 cores using all the cpu. Now the interesting thing that happens is that once
 you open the console, it suddenly starts working again. You can see the
 clock catching up as it was frozen in time and everything works normally
 once the timer catches up. It usually happens probably about once a month,
 although it happened yesterday and today again.
 
 This machine is on Centos 6, qemu-kvm-0.12.1.2-2.448.el6_6, kernel
 2.6.32-504.3.3.el6.x86_64.
 I was able to do some debugging when the machine was frozen, so I got some
 things to work with:
 
 # virsh qemu-monitor-command --hmp DBserver 'info cpus'
 * CPU #0: pc=0x80501fdd thread_id=32595
   CPU #1: pc=0x806e7a9b thread_id=32596
   CPU #2: pc=0xba2da162 (halted) thread_id=32597
   CPU #3: pc=0xba2da162 (halted) thread_id=32598
 
 Now, in both yesterday's and today's event the CPU0 was stopped at
 0x80501fdd. I've disassembled the function and got this:
 
  0x80501fb5:  int3
  0x80501fb6:  mov%edi,%edi
  0x80501fb8:  push   %ebp
  0x80501fb9:  mov%esp,%ebp
  0x80501fbb:  push   %esi
  0x80501fbc:  mov%fs:0x20,%eax
  0x80501fc2:  mov0x8(%ebp),%ecx
  0x80501fc5:  lea-0x1(%ecx),%esi
  0x80501fc8:  test   %esi,%ecx
  0x80501fca:  lea0x7ec(%eax),%edx
  0x80501fd0:  pop%esi
  0x80501fd1:  je 0x80501fdd
  0x80501fd3:  lea0x7a0(%eax),%edx
  0x80501fd9:  jmp0x80501fdd
  *0x80501fdb:  pause
  0x80501fdd:  cmpl   $0x0,(%edx)
  0x80501fe0:  jne0x80501fdb
  0x80501fe2:  pop%ebp
  0x80501fe3:  ret$0x4
  0x80501fe6:  int3
 
 Mov %edi,%edi is clearly the start of some function. From what I've been
 able to understand, the code fetches _KPRCB structure (%fs:0x20) and then
 does a spinlock between fdb and fe0 checking for PacketBarrier (?) in EDX
 (0xffdff8c0). Now, $pc always shows fdd address, shouldn't it jump between
 fdb and fe0, it seems as if it was stuck at fdd?
 
 # virsh qemu-monitor-command --hmp DBserver 'info registers'
  EAX=ffdff120 EBX=c06ddf58 ECX=000e EDX=ffdff8c0
  ESI=be6e3921 EDI=c06ddf60 EBP=ba4ff708 ESP=ba4ff708
  EIP=80501fdd EFL=0202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0023   00c0f300 DPL=3 DS   [-WA]
  CS =0008   00c09b00 DPL=0 CS32 [-RA]
  SS =0010   00c09300 DPL=0 DS   [-WA]
  DS =0023   00c0f300 DPL=3 DS   [-WA]
  FS =0030 ffdff000 1fff 00c09300 DPL=0 DS   [-WA]
  GS =  000f 
  LDT=  000f 
  TR =0028 80042000 20ab 8b00 DPL=0 TSS32-busy
  GDT= 8003f000 03ff
  IDT= 8003f400 07ff
  CR0=8001003b CR2=dbbec000 CR3=0b3c0020 CR4=06f8
  DR0= DR1= DR2= DR3=
  DR6=0ff0 DR7=0400
  FCW=027f FSW=0020 [ST=0] FTW=00 MXCSR=1fa0
  FPR0=8053632b003c1658 c048 FPR1=e1e0c048bf80f6ab 76f8
  FPR2=e1e0 0023 FPR3=0b017c30003c1658 
  FPR4=003bba1a7604 1e64 FPR5=0007268c 003b
  FPR6=0202001b 2684 FPR7=e3e0a9b4e1b50de4 ca0b
  XMM00=00a1fc950020027f
 XMM01=1fa01c4c0001
  XMM02=c0488053632b003c1658
 XMM03=76f8e1e0c048bf80f6ab
  XMM04=0023e1e0
 XMM05=0b017c30003c1658
  XMM06=1e64003bba1a7604
 XMM07=003b0007268c
 
 Clearly, the address in EDX is not 0:
 
 [root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0'
 ffdff8c0: 0x0e
 
 [root@linux ~]# virt-manager
 
 [root@linux ~]# virsh qemu-monitor-command --hmp DBserver 'x/1xb 0xFFDFF8C0'
 ffdff8c0: 0x00
 
 However as soon as the VM console is opened and machine starts, the address
 in EDX is set to 0 and the loop is broken.
 Does anybody recognize what function that is? What could possibly happen
 that opening the console and moving the mouse a little, unfreezes the
 machine?
 VM has .81 virtio drivers from Fedora repo at the moment.

Generate a Windows dump? 

https://support.microsoft.com/en-us/kb/254649

https://support.microsoft.com/en-us/kb/972110
Step 7: Generate a complete crash dump file or a kernel crash dump file
by using an NMI on a Windows-based system

(you can inject NMIs via QEMU monitor).

 
 The configuration of the machine is pretty standard:
 
 !--
 WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
 OVERWRITTEN AND LOST. Changes to this xml configuration should be made
 using:
   virsh edit 

Re: [Patch v5] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-18 Thread Marcelo Tosatti
On Wed, Mar 18, 2015 at 06:59:22PM -0600, James Sullivan wrote:
  
  The documentation states the following:
  
  * When RH is 0, the interrupt is directed to the processor listed in the
  Destination ID field.
  
  * If RH is 0, then the DM bit is ignored and the message is sent ahead
  independent of whether the physical or logical destination mode is used.
  
  However, from the POV of a device writing to memory to generate an MSI 
  interrupt, there is no (or i can't see any) other information that 
  can be used to infer logical or physical mode for the interrupt message.
  
  Before your patch:
  
  (dm, rh) = (0, 0) = irq-dest_mode = 0
  (dm, rh) = (0, 1) = irq-dest_mode = 0
  (dm, rh) = (1, 0) = irq-dest_mode = 1
  (dm, rh) = (1, 1) = irq-dest_mode = 1
  
  After your patch:
  
  (dm, rh) = (0, 0) = irq-dest_mode = 0
  (dm, rh) = (0, 1) = irq-dest_mode = 0
  (dm, rh) = (1, 0) = irq-dest_mode = 0
  (dm, rh) = (1, 1) = irq-dest_mode = 1
  
  
  Am i missing some explicit documentation that refers 
  to (dm, rh) = (1, 0) = irq-dest_mode = 0 ?
 
 From the IA32 manual (Vol. 3, 10.11.2):
 
  * When RH is 0, the interrupt is directed to the processor listed
in the Destination ID field.
  * When RH is 1 and the physical destination mode is used, the Destination
ID field must not be set to FFH; it must point to a processor that is
present and enabled to receive the interrupt.
  * When RH is 1 and the logical destination mode is active in a system using
a flat addressing model, the Destination ID field must be set so that bits
set to 1 identify processors that are present and enabled to receive the
interrupt.
  * If RH is set to 1 and the logical destination mode is active in a system
using cluster addressing model, then Destination ID field must not be
set to FFH; the processors identified with this field must be present
and enabled to receive the interrupt.
 
 My interpretation of this is that RH=0 indicates that the Dest. ID field
 contains an APIC ID, and as such destination mode is physical. When RH=1,
 depending on the value of DM, we either use physical or logical dest mode.
 The result of this is that logical dest mode is set just when RH=1/DM=1,
 as far as I understand.
 
  
  See native_compose_msi_msg:
  
  msg-address_lo =
  MSI_ADDR_BASE_LO |
  ((apic-irq_dest_mode == 0) ?
  MSI_ADDR_DEST_MODE_PHYSICAL :
  MSI_ADDR_DEST_MODE_LOGICAL) |
  ((apic-irq_delivery_mode != dest_LowestPrio) ?
  MSI_ADDR_REDIRECTION_CPU :
  MSI_ADDR_REDIRECTION_LOWPRI) |
  MSI_ADDR_DEST_ID(dest);
  
  
  So it does configure DM = MSI_ADDR_DEST_MODE_LOGICAL
  and RH = MSI_ADDR_REDIRECTION_LOWPRI.
  
 
 ...and yet this is a good counterexample against my argument :)
 
 What I think I'll do is revert this particular change so that dest_mode is
 set independently of RH. While I'm not entirely convinced that this is the
 intended interpretation, I do think that consistency with the existing logic
 is probably desirable for the time being. If I can get closure on the matter
 I'll re-submit that change, but for the time being I will undo it.
 
 -James

Just write MSI-X table entries on real hardware (say: modify
native_compose_msi_msg or MSI-X equivalent), with all RH/DM
combinations, and see what behaviour is
comes up?  

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch v5] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-18 Thread Marcelo Tosatti
On Wed, Mar 18, 2015 at 06:59:22PM -0600, James Sullivan wrote:
  
  The documentation states the following:
  
  * When RH is 0, the interrupt is directed to the processor listed in the
  Destination ID field.
  
  * If RH is 0, then the DM bit is ignored and the message is sent ahead
  independent of whether the physical or logical destination mode is used.
  
  However, from the POV of a device writing to memory to generate an MSI 
  interrupt, there is no (or i can't see any) other information that 
  can be used to infer logical or physical mode for the interrupt message.
  
  Before your patch:
  
  (dm, rh) = (0, 0) = irq-dest_mode = 0
  (dm, rh) = (0, 1) = irq-dest_mode = 0
  (dm, rh) = (1, 0) = irq-dest_mode = 1
  (dm, rh) = (1, 1) = irq-dest_mode = 1
  
  After your patch:
  
  (dm, rh) = (0, 0) = irq-dest_mode = 0
  (dm, rh) = (0, 1) = irq-dest_mode = 0
  (dm, rh) = (1, 0) = irq-dest_mode = 0
  (dm, rh) = (1, 1) = irq-dest_mode = 1
  
  
  Am i missing some explicit documentation that refers 
  to (dm, rh) = (1, 0) = irq-dest_mode = 0 ?
 
 From the IA32 manual (Vol. 3, 10.11.2):
 
  * When RH is 0, the interrupt is directed to the processor listed
in the Destination ID field.
  * When RH is 1 and the physical destination mode is used, the Destination
ID field must not be set to FFH; it must point to a processor that is
present and enabled to receive the interrupt.
  * When RH is 1 and the logical destination mode is active in a system using
a flat addressing model, the Destination ID field must be set so that bits
set to 1 identify processors that are present and enabled to receive the
interrupt.
  * If RH is set to 1 and the logical destination mode is active in a system
using cluster addressing model, then Destination ID field must not be
set to FFH; the processors identified with this field must be present
and enabled to receive the interrupt.
 
 My interpretation of this is that RH=0 indicates that the Dest. ID field
 contains an APIC ID, and as such destination mode is physical. When RH=1,
 depending on the value of DM, we either use physical or logical dest mode.
 The result of this is that logical dest mode is set just when RH=1/DM=1,
 as far as I understand.
 
  
  See native_compose_msi_msg:
  
  msg-address_lo =
  MSI_ADDR_BASE_LO |
  ((apic-irq_dest_mode == 0) ?
  MSI_ADDR_DEST_MODE_PHYSICAL :
  MSI_ADDR_DEST_MODE_LOGICAL) |
  ((apic-irq_delivery_mode != dest_LowestPrio) ?
  MSI_ADDR_REDIRECTION_CPU :
  MSI_ADDR_REDIRECTION_LOWPRI) |
  MSI_ADDR_DEST_ID(dest);
  
  
  So it does configure DM = MSI_ADDR_DEST_MODE_LOGICAL
  and RH = MSI_ADDR_REDIRECTION_LOWPRI.
  
 
 ...and yet this is a good counterexample against my argument :)
 
 What I think I'll do is revert this particular change so that dest_mode is
 set independently of RH. While I'm not entirely convinced that this is the
 intended interpretation, 

Where would the logical/physical information come from, if not from the
DM bit ?

 I do think that consistency with the existing logic
 is probably desirable for the time being. If I can get closure on the matter
 I'll re-submit that change, but for the time being I will undo it.
 
 -James


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()

2015-03-18 Thread Marcelo Tosatti
On Tue, Mar 17, 2015 at 09:58:21AM +0100, Paolo Bonzini wrote:
 
 
 On 17/03/2015 08:19, Takuya Yoshikawa wrote:
  When all bits in mask are not set,
  kvm_arch_mmu_enable_log_dirty_pt_masked() has nothing to do.  But since
  it needs to be called from the generic code, it cannot be inlined, and
  a few function calls, two when PML is enabled, are wasted.
  
  Since it is common to see many pages remain clean, e.g. framebuffers can
  stay calm for a long time, it is worth eliminating this overhead.
  
  Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
  ---
   virt/kvm/kvm_main.c | 8 +---
   1 file changed, 5 insertions(+), 3 deletions(-)
  
  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index a109370..420d8cf 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -1061,9 +1061,11 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
  mask = xchg(dirty_bitmap[i], 0);
  dirty_bitmap_buffer[i] = mask;
   
  -   offset = i * BITS_PER_LONG;
  -   kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset,
  -   mask);
  +   if (mask) {
  +   offset = i * BITS_PER_LONG;
  +   kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
  +   offset, mask);
  +   }
  }
   
  spin_unlock(kvm-mmu_lock);
  
 
 Good catch!
 
 Reviewed-by: Paolo Bonzini pbonz...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: SVM: Fix confusing message if no exit handlers are installed

2015-03-18 Thread Marcelo Tosatti
On Mon, Mar 16, 2015 at 05:18:25PM -0400, Bandan Das wrote:
 
 I hit this path on a AMD box and thought
 someone was playing a April Fool's joke on me.
 
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  arch/x86/kvm/svm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >