subject:"Re\: x86\: kvm\: Revert \"remove sched notifier for cross\-cpu migrations\""

On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
 On Wed, Mar 25, 2015 at 4:13 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote:
  On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
   On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
   
On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
 2015-03-25 12:08+0100, Radim Krčmář:
  Reverting the patch protects us from any migration, but I don't 
  think we
  need to care about changing VCPUs as long as we read a consistent 
  data
  from kvmclock.  (VCPU can change outside of this loop too, so it 
  doesn't
  matter if we return a value not fit for this VCPU.)
 
  I think we could drop the second __getcpu if our kvmclock was 
  being
  handled better;  maybe with a patch like the one below:

 The second __getcpu is not neccessary, but I forgot about rdtsc.
 We need to either use rtdscp, know the host has synchronized tsc, or
 monitor VCPU migrations.  Only the last one works everywhere.
   
The vdso code is only used if host has synchronized tsc.
   
But you have to handle the case where host goes from synchronized tsc 
to
unsynchronized tsc (see the clocksource notifier in the host side).
   
  
   Can't we change the host to freeze all vcpus and clear the stable bit
   on all of them if this happens?  This would simplify and speed up
   vclock_gettime.
  
   --Andy
  
   Seems interesting to do on 512-vcpus, but sure, could be done.
  
 
  If you have a 512-vcpu system that switches between stable and
  unstable more than once per migration, then I expect that you have
  serious problems and this is the least of your worries.
 
  Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
  if we're stable.  If nothing else, I'm not even remotely convinced
  that the current scheme gives monotonic timing due to skew between
  when the updates happen on different vcpus.
 
  Can you write down the problem ?
 
 
 I can try.
 
 Suppose we start out with all vcpus agreeing on their pvti and perfect
 invariant TSCs.  Now the host updates its frequency (due to NTP or
 whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
 pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
 They'll disagree on the time, and one of them will be ahead until vcpu
 1's pvti gets updated.

The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs
to be visible at one time, for all vcpus.


 * That is, when timespec0 != timespec1, M  N. Unfortunately that is
 * not
 * always the case (the difference between two distinct xtime instances
 * might be smaller then the difference between corresponding TSC reads,
 * when updating guest vcpus pvclock areas).
 *
 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Radim Krcmar

2015-03-26 11:51-0700, Andy Lutomirski:
 On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
  Suppose we start out with all vcpus agreeing on their pvti and perfect
  invariant TSCs.  Now the host updates its frequency (due to NTP or
  whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
  pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
  They'll disagree on the time, and one of them will be ahead until vcpu
  1's pvti gets updated.
 
  The masterclock scheme enforces the same system_timestamp/tsc_timestamp 
  pairs
  to be visible at one time, for all vcpus.
 
 
   * That is, when timespec0 != timespec1, M  N. Unfortunately that is
   * not
   * always the case (the difference between two distinct xtime instances
   * might be smaller then the difference between corresponding TSC reads,
   * when updating guest vcpus pvclock areas).
   *
   * To avoid that problem, do not allow visibility of distinct
   * system_timestamp/tsc_timestamp values simultaneously: use a master
   * copy of host monotonic time values. Update that master copy
   * in lockstep.
 
 Yuck.  So we have per cpu timing data, but the protocol is only usable
 for monotonic timing because we forcibly freeze all vcpus when we
 update the nominally per cpu data.
 
 The obvious guest implementations are still unnecessarily slow,
 though.  It would be nice if the guest could get away without using
 any getcpu operation at all.
 
 Even if we fixed the host to increment version as advertised, I think
 we can't avoid two getcpu ops.  We need one before rdtsc to figure out
 which pvti to look at,

Yes.

and we need another to make sure that we were
 actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
 -- we need to check version before rdtsc, and we don't know what
 version to check until we do a getcpu.).

Exactly, reading cpuid after rdtsc doesn't do that though, we could have
migrated back between those reads.
rtdscp would allow us to check that we read tsc of pvti's cpu.
(It doesn't get rid of that first read.)

  The migration hook has the
 same issue -- we need to check the migration count, then confirm we're
 on that cpu, then check the migration count again, and we can't do
 that until we know what cpu we're on.

True;  the revert has a bug -- we need to check cpuid for the second
time before rdtsc.  (Migration hook is there just because we don't know
which cpu executed rdtsc.)

 If, on the other hand, we could rely on having all of these things in
 sync, then this complication goes away, and we go down from two getcpu
 ops to zero.

(Yeah, we should look what are the drawbacks of doing it differently.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
 On Wed, Mar 25, 2015 at 4:13 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote:
  On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
   On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
   
On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
 2015-03-25 12:08+0100, Radim Krčmář:
  Reverting the patch protects us from any migration, but I don't 
  think we
  need to care about changing VCPUs as long as we read a 
  consistent data
  from kvmclock.  (VCPU can change outside of this loop too, so it 
  doesn't
  matter if we return a value not fit for this VCPU.)
 
  I think we could drop the second __getcpu if our kvmclock was 
  being
  handled better;  maybe with a patch like the one below:

 The second __getcpu is not neccessary, but I forgot about rdtsc.
 We need to either use rtdscp, know the host has synchronized tsc, 
 or
 monitor VCPU migrations.  Only the last one works everywhere.
   
The vdso code is only used if host has synchronized tsc.
   
But you have to handle the case where host goes from synchronized 
tsc to
unsynchronized tsc (see the clocksource notifier in the host side).
   
  
   Can't we change the host to freeze all vcpus and clear the stable bit
   on all of them if this happens?  This would simplify and speed up
   vclock_gettime.
  
   --Andy
  
   Seems interesting to do on 512-vcpus, but sure, could be done.
  
 
  If you have a 512-vcpu system that switches between stable and
  unstable more than once per migration, then I expect that you have
  serious problems and this is the least of your worries.
 
  Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
  if we're stable.  If nothing else, I'm not even remotely convinced
  that the current scheme gives monotonic timing due to skew between
  when the updates happen on different vcpus.
 
  Can you write down the problem ?
 

 I can try.

 Suppose we start out with all vcpus agreeing on their pvti and perfect
 invariant TSCs.  Now the host updates its frequency (due to NTP or
 whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
 pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
 They'll disagree on the time, and one of them will be ahead until vcpu
 1's pvti gets updated.

 The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs
 to be visible at one time, for all vcpus.


  * That is, when timespec0 != timespec1, M  N. Unfortunately that is
  * not
  * always the case (the difference between two distinct xtime instances
  * might be smaller then the difference between corresponding TSC reads,
  * when updating guest vcpus pvclock areas).
  *
  * To avoid that problem, do not allow visibility of distinct
  * system_timestamp/tsc_timestamp values simultaneously: use a master
  * copy of host monotonic time values. Update that master copy
  * in lockstep.



[resend without HTML]

Yuck.  So we have per cpu timing data, but the protocol is only usable
for monotonic timing because we forcibly freeze all vcpus when we
update the nominally per cpu data.

The obvious guest implementations are still unnecessarily slow,
though.  It would be nice if the guest could get away without using
any getcpu operation at all.

Even if we fixed the host to increment version as advertised, I think
we can't avoid two getcpu ops.  We need one before rdtsc to figure out
which pvti to look at, and we need another to make sure that we were
actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
-- we need to check version before rdtsc, and we don't know what
version to check until we do a getcpu.). The migration hook has the
same issue -- we need to check the migration count, then confirm we're
on that cpu, then check the migration count again, and we can't do
that until we know what cpu we're on.

If, on the other hand, we could rely on having all of these things in
sync, then this complication goes away, and we go down from two getcpu
ops to zero.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Radim Krčmář

2015-03-26 11:47-0700, Andy Lutomirski:
 On Wed, Mar 25, 2015 at 4:08 AM, Radim Krčmář rkrc...@redhat.com wrote:
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  +   /* A guest can read other VCPU's kvmclock; specification says that
  +* version is odd if data is being modified and even after it is
  +* consistent.
  +* We write three times to be sure.
  +*  1) update version to odd number
  +*  2) write modified data (version is still odd)
  +*  3) update version to even number
  +*
  +* TODO: optimize
  +*  - only two writes should be enough -- version is first
  +*  - the second write could update just version
   */
 
 The trouble with this is that kvm_write_guest_cached seems to
 correspond roughly to a rep movs variant, and those are weakly
 ordered.  As a result, I don't really know whether they have
 well-defined semantics wrt concurrent reads.  What we really want is
 just mov.

Ah, so the first optimization TODO is not possible, but stores are
weakly ordered only within one rep movs.   We're safe if compiler
outputs three mov-like instructions.

(Btw. does current hardware reorder string stores?)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
 2015-03-26 11:51-0700, Andy Lutomirski:
 On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
  Suppose we start out with all vcpus agreeing on their pvti and perfect
  invariant TSCs.  Now the host updates its frequency (due to NTP or
  whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
  pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
  They'll disagree on the time, and one of them will be ahead until vcpu
  1's pvti gets updated.
 
  The masterclock scheme enforces the same system_timestamp/tsc_timestamp 
  pairs
  to be visible at one time, for all vcpus.
 
 
   * That is, when timespec0 != timespec1, M  N. Unfortunately that is
   * not
   * always the case (the difference between two distinct xtime instances
   * might be smaller then the difference between corresponding TSC reads,
   * when updating guest vcpus pvclock areas).
   *
   * To avoid that problem, do not allow visibility of distinct
   * system_timestamp/tsc_timestamp values simultaneously: use a master
   * copy of host monotonic time values. Update that master copy
   * in lockstep.

 Yuck.  So we have per cpu timing data, but the protocol is only usable
 for monotonic timing because we forcibly freeze all vcpus when we
 update the nominally per cpu data.

 The obvious guest implementations are still unnecessarily slow,
 though.  It would be nice if the guest could get away without using
 any getcpu operation at all.

 Even if we fixed the host to increment version as advertised, I think
 we can't avoid two getcpu ops.  We need one before rdtsc to figure out
 which pvti to look at,

 Yes.

and we need another to make sure that we were
 actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
 -- we need to check version before rdtsc, and we don't know what
 version to check until we do a getcpu.).

 Exactly, reading cpuid after rdtsc doesn't do that though, we could have
 migrated back between those reads.
 rtdscp would allow us to check that we read tsc of pvti's cpu.
 (It doesn't get rid of that first read.)

  The migration hook has the
 same issue -- we need to check the migration count, then confirm we're
 on that cpu, then check the migration count again, and we can't do
 that until we know what cpu we're on.

 True;  the revert has a bug -- we need to check cpuid for the second
 time before rdtsc.  (Migration hook is there just because we don't know
 which cpu executed rdtsc.)

One way or another, I'm planning on completely rewriting the vdso
code.  An early draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980

but I can't finish it until the KVM side shakes out.

I think there are at least two ways that would work:

a) If KVM incremented version as advertised:

cpu = getcpu();
pvti = pvti for cpu;

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry;

I think this is safe because, we're guaranteed that there was an
interval (between the two version reads) in which the vcpu we think
we're on was running and the kvmclock data was valid and marked
stable, and we know that the tsc we read came from that interval.

Note: rdtscp isn't needed. If we're stable, is makes no difference
which cpu's tsc we actually read.

b) If version remains buggy but we use this migrations_from hack:

cpu = getcpu();
pvti = pvti for cpu;
m1 = pvti-migrations_from;
barrier();

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry;  /* probably not really needed */

barrier();
if (pvti-migrations_from != m1) retry;

This is just like (a), except that we're using a guest kernel hack to
ensure that no one migrated off the vcpu during the version-protected
critical section and that we were, in fact, on that vcpu at some point
during that critical section.  Once we've ensured that we were on
pvti's associated vcpu for the entire time we were reading it, then we
are protected by the existing versioning in the host.


 If, on the other hand, we could rely on having all of these things in
 sync, then this complication goes away, and we go down from two getcpu
 ops to zero.

 (Yeah, we should look what are the drawbacks of doing it differently.)

If the versioning were fixed, I think we could almost get away with:

pvti = pvti for vcpu 0;

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (pvti-version != ver1) retry;

This guarantees that the tsc came from an interval in which vcpu0's
kvmclock was *marked* stable.  If vcpu0's kvmclock were genuinely
stable in that interval, then we'd be

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Paolo Bonzini



On 26/03/2015 21:10, Radim Krčmář wrote:
 2015-03-26 11:47-0700, Andy Lutomirski:
 On Wed, Mar 25, 2015 at 4:08 AM, Radim Krčmář rkrc...@redhat.com wrote:
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 +   /* A guest can read other VCPU's kvmclock; specification says that
 +* version is odd if data is being modified and even after it is
 +* consistent.
 +* We write three times to be sure.
 +*  1) update version to odd number
 +*  2) write modified data (version is still odd)
 +*  3) update version to even number
 +*
 +* TODO: optimize
 +*  - only two writes should be enough -- version is first
 +*  - the second write could update just version
  */

 The trouble with this is that kvm_write_guest_cached seems to
 correspond roughly to a rep movs variant, and those are weakly
 ordered.  As a result, I don't really know whether they have
 well-defined semantics wrt concurrent reads.  What we really want is
 just mov.
 
 Ah, so the first optimization TODO is not possible, but stores are
 weakly ordered only within one rep movs.   We're safe if compiler
 outputs three mov-like instructions.
 
 (Btw. does current hardware reorder string stores?)

It probably does so if they hit multiple cache lines.  Within a cache
line, probably not.

We can add kvm_map/unmap_guest_cached and then use __put_user.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-26 Thread Radim Krčmář

2015-03-23 20:21-0300, Marcelo Tosatti:
 
 The following point:
 
 2. per-CPU pvclock time info is updated if the
underlying CPU changes.
 
 Is not true anymore since KVM: x86: update pvclock area conditionally,
 on cpu migration.
 
 Add task migration notification back.
 
 Problem noticed by Andy Lutomirski.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 CC: sta...@kernel.org # 3.11+

Revert contains a bug that got pointed out in the discussion:

 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
   do {
   cpu = __getcpu()  VGETCPU_CPU_MASK;
  
   pvti = get_pvti(cpu);

We can migrate to 'other cpu' here.

 + migrate_count = pvti-migrate_count;
 +
   version = __pvclock_read_cycles(pvti-pvti, ret, flags);

And migrate back to 'cpu' here.

rdtsc was executed on different cpu, so pvti and tsc might not be in
sync, but migrate_count hasn't changed.

   cpu1 = __getcpu()  VGETCPU_CPU_MASK;

(Reading cpuid here is useless.)

   } while (unlikely(cpu != cpu1 ||
 (pvti-pvti.version  1) ||
 -   pvti-pvti.version != version));
 +   pvti-pvti.version != version ||
 +   pvti-migrate_count != migrate_count));

We can workaround the bug with,

cpu = __getcpu()  VGETCPU_CPU_MASK;
pvti = get_pvti(cpu);
migrate_count = pvti-migrate_count;
if (cpu != (__getcpu()  VGETCPU_CPU_MASK))
continue;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Wed, Mar 25, 2015 at 4:08 AM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-24 15:33-0700, Andy Lutomirski:
 On Tue, Mar 24, 2015 at 8:34 AM, Radim Krčmář rkrc...@redhat.com wrote:
  What is the problem?

 The kvmclock spec says that the host will increment a version field to
 an odd number, then update stuff, then increment it to an even number.
 The host is buggy and doesn't do this, and the result is observable
 when one vcpu reads another vcpu's kvmclock data.

 Since there's no good way for a guest kernel to keep its vdso from
 reading a different vcpu's kvmclock data, this is a real corner-case
 bug.  This patch allows the vdso to retry when this happens.  I don't
 think it's a great solution, but it should mostly work.

 Great explanation, thank you.

 Reverting the patch protects us from any migration, but I don't think we
 need to care about changing VCPUs as long as we read a consistent data
 from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
 matter if we return a value not fit for this VCPU.)

 I think we could drop the second __getcpu if our kvmclock was being
 handled better;  maybe with a patch like the one below:

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index cc2c759f69a3..8658599e0024 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 guest_hv_clock, sizeof(guest_hv_clock
 return 0;

 -   /*
 -* The interface expects us to write an even number signaling that the
 -* update is finished. Since the guest won't see the intermediate
 -* state, we just increase by 2 at the end.
 +   /* A guest can read other VCPU's kvmclock; specification says that
 +* version is odd if data is being modified and even after it is
 +* consistent.
 +* We write three times to be sure.
 +*  1) update version to odd number
 +*  2) write modified data (version is still odd)
 +*  3) update version to even number
 +*
 +* TODO: optimize
 +*  - only two writes should be enough -- version is first
 +*  - the second write could update just version
  */
 -   vcpu-hv_clock.version = guest_hv_clock.version + 2;
 +   guest_hv_clock.version += 1;
 +   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 +   guest_hv_clock,
 +   sizeof(guest_hv_clock));
 +
 +   vcpu-hv_clock.version = guest_hv_clock.version;

 /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
 pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
 @@ -1684,6 +1696,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 vcpu-hv_clock,
 sizeof(vcpu-hv_clock));
 +
 +   vcpu-hv_clock.version += 1;
 +   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 +   vcpu-hv_clock,
 +   sizeof(vcpu-hv_clock));
 return 0;
  }


The trouble with this is that kvm_write_guest_cached seems to
correspond roughly to a rep movs variant, and those are weakly
ordered.  As a result, I don't really know whether they have
well-defined semantics wrt concurrent reads.  What we really want is
just mov.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
2015-03-26 11:51-0700, Andy Lutomirski:
On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com
wrote:
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
Suppose we start out with all vcpus agreeing on their pvti and
perfect
invariant TSCs. Now the host updates its frequency (due to NTP or
whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's
pvti, guest code on vcpus 0 and 1 see synced TSCs but different
pvti.
They'll disagree on the time, and one of them will be ahead until
vcpu
1's pvti gets updated.

The masterclock scheme enforces the same
system_timestamp/tsc_timestamp pairs
to be visible at one time, for all vcpus.

* That is, when timespec0 != timespec1, M N. Unfortunately that is
* not
* always the case (the difference between two distinct xtime
instances
* might be smaller then the difference between corresponding TSC
reads,
* when updating guest vcpus pvclock areas).
*
* To avoid that problem, do not allow visibility of distinct
* system_timestamp/tsc_timestamp values simultaneously: use a master
* copy of host monotonic time values. Update that master copy
* in lockstep.

Yuck. So we have per cpu timing data, but the protocol is only usable
for monotonic timing because we forcibly freeze all vcpus when we
update the nominally per cpu data.

The obvious guest implementations are still unnecessarily slow,
though. It would be nice if the guest could get away without using
any getcpu operation at all.

Even if we fixed the host to increment version as advertised, I think
we can't avoid two getcpu ops. We need one before rdtsc to figure out
which pvti to look at,

Yes.

and we need another to make sure that we were
actually on that cpu at the time we did rdtsc. (Rdtscp doesn't help
-- we need to check version before rdtsc, and we don't know what
version to check until we do a getcpu.).

Exactly, reading cpuid after rdtsc doesn't do that though, we could have
migrated back between those reads.
rtdscp would allow us to check that we read tsc of pvti's cpu.
(It doesn't get rid of that first read.)

The migration hook has the
same issue -- we need to check the migration count, then confirm we're
on that cpu, then check the migration count again, and we can't do
that until we know what cpu we're on.

True; the revert has a bug -- we need to check cpuid for the second
time before rdtsc. (Migration hook is there just because we don't know
which cpu executed rdtsc.)

One way or another, I'm planning on completely rewriting the vdso
code. An early draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980

but I can't finish it until the KVM side shakes out.

I think there are at least two ways that would work:

a) If KVM incremented version as advertised:

All for it.

cpu = getcpu();
pvti = pvti for cpu;

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry;

I think this is safe because, we're guaranteed that there was an
interval (between the two version reads) in which the vcpu we think
we're on was running and the kvmclock data was valid and marked
stable, and we know that the tsc we read came from that interval.

Note: rdtscp isn't needed. If we're stable, is makes no difference
which cpu's tsc we actually read.

Yes, can't see a problem with that.

b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

cpu = getcpu();
pvti = pvti for cpu;
m1 = pvti-migrations_from;
barrier();

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry; /* probably not really needed */

barrier();
if (pvti-migrations_from != m1) retry;

This is just like (a), except that we're using a guest kernel hack to
ensure that no one migrated off the vcpu during the version-protected
critical section and that we were, in fact, on that vcpu at some point
during that critical section. Once we've ensured that we were on
pvti's associated vcpu for the entire time we were reading it, then we
are protected by the existing versioning in

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 04:28:37PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com
wrote:
On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com
wrote:
2015-03-26 11:51-0700, Andy Lutomirski:
On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti
mtosa...@redhat.com wrote:
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
Suppose we start out with all vcpus agreeing on their pvti and
perfect
invariant TSCs. Now the host updates its frequency (due to NTP or
whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu
1's
pvti, guest code on vcpus 0 and 1 see synced TSCs but different
pvti.
They'll disagree on the time, and one of them will be ahead until
vcpu
1's pvti gets updated.

The masterclock scheme enforces the same
system_timestamp/tsc_timestamp pairs
to be visible at one time, for all vcpus.

* That is, when timespec0 != timespec1, M N. Unfortunately that
is
* not
* always the case (the difference between two distinct xtime
instances
* might be smaller then the difference between corresponding TSC
reads,
* when updating guest vcpus pvclock areas).
*
* To avoid that problem, do not allow visibility of distinct
* system_timestamp/tsc_timestamp values simultaneously: use a
master
* copy of host monotonic time values. Update that master copy
* in lockstep.

Yuck. So we have per cpu timing data, but the protocol is only
usable
for monotonic timing because we forcibly freeze all vcpus when we
update the nominally per cpu data.

The obvious guest implementations are still unnecessarily slow,
though. It would be nice if the guest could get away without using
any getcpu operation at all.

Even if we fixed the host to increment version as advertised, I think
we can't avoid two getcpu ops. We need one before rdtsc to figure
out
which pvti to look at,

Yes.

Exactly, reading cpuid after rdtsc doesn't do that though, we could
have
migrated back between those reads.
rtdscp would allow us to check that we read tsc of pvti's cpu.
(It doesn't get rid of that first read.)

The migration hook has the
same issue -- we need to check the migration count, then confirm
we're
on that cpu, then check the migration count again, and we can't do
that until we know what cpu we're on.

True; the revert has a bug -- we need to check cpuid for the second
time before rdtsc. (Migration hook is there just because we don't
know
which cpu executed rdtsc.)

One way or another, I'm planning on completely rewriting the vdso
code. An early draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980

but I can't finish it until the KVM side shakes out.

I think there are at least two ways that would work:

a) If KVM incremented version as advertised:

All for it.

cpu = getcpu();
pvti = pvti for cpu;

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry;

Note: rdtscp isn't needed. If we're stable, is makes no difference
which cpu's tsc we actually read.

Yes, can't see a problem with that.

b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

cpu = getcpu();
pvti = pvti for cpu;
m1 = pvti-migrations_from;
barrier();

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry; /* probably not really needed */

barrier();
if (pvti-migrations_from != m1) retry;

This is just like (a), except that we're using a guest kernel hack to
ensure that no one migrated off the vcpu during the version-protected
critical section and

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
2015-03-26 11:51-0700, Andy Lutomirski:
On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com
wrote:
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
Suppose we start out with all vcpus agreeing on their pvti and perfect
invariant TSCs. Now the host updates its frequency (due to NTP or
whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's
pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
They'll disagree on the time, and one of them will be ahead until vcpu
1's pvti gets updated.

The masterclock scheme enforces the same system_timestamp/tsc_timestamp
pairs
to be visible at one time, for all vcpus.

* That is, when timespec0 != timespec1, M N. Unfortunately that is
* not
* always the case (the difference between two distinct xtime instances
* might be smaller then the difference between corresponding TSC reads,
* when updating guest vcpus pvclock areas).
*
* To avoid that problem, do not allow visibility of distinct
* system_timestamp/tsc_timestamp values simultaneously: use a master
* copy of host monotonic time values. Update that master copy
* in lockstep.

Yuck. So we have per cpu timing data, but the protocol is only usable
for monotonic timing because we forcibly freeze all vcpus when we
update the nominally per cpu data.

The obvious guest implementations are still unnecessarily slow,
though. It would be nice if the guest could get away without using
any getcpu operation at all.

Even if we fixed the host to increment version as advertised, I think
we can't avoid two getcpu ops. We need one before rdtsc to figure out
which pvti to look at,

Yes.

True; the revert has a bug -- we need to check cpuid for the second
time before rdtsc. (Migration hook is there just because we don't know
which cpu executed rdtsc.)

One way or another, I'm planning on completely rewriting the vdso
code. An early draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980

but I can't finish it until the KVM side shakes out.

I think there are at least two ways that would work:

a) If KVM incremented version as advertised:

All for it.

cpu = getcpu();
pvti = pvti for cpu;

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry;

Note: rdtscp isn't needed. If we're stable, is makes no difference
which cpu's tsc we actually read.

Yes, can't see a problem with that.

b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

cpu = getcpu();
pvti = pvti for cpu;
m1 = pvti-migrations_from;
barrier();

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry; /* probably not really needed */

barrier();
if (pvti-migrations_from != m1) retry;

If, on the other hand, we could rely on having all of these things in
sync, then this complication goes away, and we go down from two getcpu
ops to zero.

(Yeah, we should look what are the drawbacks of doing it differently.)

If the versioning were fixed, I

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

[much snippage]

On Thu, Mar 26, 2015 at 1:58 PM, Andy Lutomirski l...@amacapital.net wrote:

 If the versioning were fixed, I think we could almost get away with:

 pvti = pvti for vcpu 0;

 ver1 = pvti-version;
 check stable bit;
 rdtsc_barrier, rdtsc, read scale, shift, etc.
 if (pvti-version != ver1) retry;

 This guarantees that the tsc came from an interval in which vcpu0's
 kvmclock was *marked* stable.  If vcpu0's kvmclock were genuinely
 stable in that interval, then we'd be fine, but there's a race window
 in which the kvmclock is *not* stable and vcpu 0 wasn't running.

Rik pointed out that this could actually be okay. Apparently vcpu 0 is
somewhat special, and it may actually be impossible to switch from
stable to unstable which a vcpu other than 0 is running and vcpu0
hasn't updated its kvmclock data.

--Andy


 Why doesn't KVM just update all of the kvmclock data at once?  (For
 that matter, why is the pvti in guest memory at all?  Wouldn't this
 all be simpler if the kvmclock data were host-allocated so the host
 could write it directly and maybe even share it between guests?)

 --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote:
 2015-03-23 20:21-0300, Marcelo Tosatti:
  
  The following point:
  
  2. per-CPU pvclock time info is updated if the
 underlying CPU changes.
  
  Is not true anymore since KVM: x86: update pvclock area conditionally,
  on cpu migration.
  
  Add task migration notification back.
  
  Problem noticed by Andy Lutomirski.
  
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
  CC: sta...@kernel.org # 3.11+
 
 Revert contains a bug that got pointed out in the discussion:
 
  diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
  do {
  cpu = __getcpu()  VGETCPU_CPU_MASK;
   
  pvti = get_pvti(cpu);
 
 We can migrate to 'other cpu' here.
 
  +   migrate_count = pvti-migrate_count;
  +
  version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 
 And migrate back to 'cpu' here.

Migrating back will increase pvti-migrate_count, right ?

 rdtsc was executed on different cpu, so pvti and tsc might not be in
 sync, but migrate_count hasn't changed.
 
  cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 
 (Reading cpuid here is useless.)
 
  } while (unlikely(cpu != cpu1 ||
(pvti-pvti.version  1) ||
  - pvti-pvti.version != version));
  + pvti-pvti.version != version ||
  + pvti-migrate_count != migrate_count));
 
 We can workaround the bug with,
 
   cpu = __getcpu()  VGETCPU_CPU_MASK;
   pvti = get_pvti(cpu);
   migrate_count = pvti-migrate_count;
   if (cpu != (__getcpu()  VGETCPU_CPU_MASK))
   continue;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 3:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote:
 2015-03-23 20:21-0300, Marcelo Tosatti:
 
  The following point:
 
  2. per-CPU pvclock time info is updated if the
 underlying CPU changes.
 
  Is not true anymore since KVM: x86: update pvclock area conditionally,
  on cpu migration.
 
  Add task migration notification back.
 
  Problem noticed by Andy Lutomirski.
 
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
  CC: sta...@kernel.org # 3.11+

 Revert contains a bug that got pointed out in the discussion:

  diff --git a/arch/x86/vdso/vclock_gettime.c 
  b/arch/x86/vdso/vclock_gettime.c
  do {
  cpu = __getcpu()  VGETCPU_CPU_MASK;
 
  pvti = get_pvti(cpu);

 We can migrate to 'other cpu' here.

  +   migrate_count = pvti-migrate_count;
  +
  version = __pvclock_read_cycles(pvti-pvti, ret, flags);

 And migrate back to 'cpu' here.

 Migrating back will increase pvti-migrate_count, right ?

I thought it only increased the count when we migrated away.

--Andy


 rdtsc was executed on different cpu, so pvti and tsc might not be in
 sync, but migrate_count hasn't changed.

  cpu1 = __getcpu()  VGETCPU_CPU_MASK;

 (Reading cpuid here is useless.)

  } while (unlikely(cpu != cpu1 ||
(pvti-pvti.version  1) ||
  - pvti-pvti.version != version));
  + pvti-pvti.version != version ||
  + pvti-migrate_count != migrate_count));

 We can workaround the bug with,

   cpu = __getcpu()  VGETCPU_CPU_MASK;
   pvti = get_pvti(cpu);
   migrate_count = pvti-migrate_count;
   if (cpu != (__getcpu()  VGETCPU_CPU_MASK))
   continue;



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 03:24:10PM -0700, Andy Lutomirski wrote:
 On Thu, Mar 26, 2015 at 3:22 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Krčmář wrote:
  2015-03-23 20:21-0300, Marcelo Tosatti:
  
   The following point:
  
   2. per-CPU pvclock time info is updated if the
  underlying CPU changes.
  
   Is not true anymore since KVM: x86: update pvclock area conditionally,
   on cpu migration.
  
   Add task migration notification back.
  
   Problem noticed by Andy Lutomirski.
  
   Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
   CC: sta...@kernel.org # 3.11+
 
  Revert contains a bug that got pointed out in the discussion:
 
   diff --git a/arch/x86/vdso/vclock_gettime.c 
   b/arch/x86/vdso/vclock_gettime.c
   do {
   cpu = __getcpu()  VGETCPU_CPU_MASK;
  
   pvti = get_pvti(cpu);
 
  We can migrate to 'other cpu' here.
 
   +   migrate_count = pvti-migrate_count;
   +
   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 
  And migrate back to 'cpu' here.
 
  Migrating back will increase pvti-migrate_count, right ?
 
 I thought it only increased the count when we migrated away.

Right.

 --Andy
 
 
  rdtsc was executed on different cpu, so pvti and tsc might not be in
  sync, but migrate_count hasn't changed.
 
   cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 
  (Reading cpuid here is useless.)
 
   } while (unlikely(cpu != cpu1 ||
 (pvti-pvti.version  1) ||
   - pvti-pvti.version != version));
   + pvti-pvti.version != version ||
   + pvti-migrate_count != migrate_count));
 
  We can workaround the bug with,
 
cpu = __getcpu()  VGETCPU_CPU_MASK;
pvti = get_pvti(cpu);
migrate_count = pvti-migrate_count;
if (cpu != (__getcpu()  VGETCPU_CPU_MASK))
continue;

Looks good, please submit a fix.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
 On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
  2015-03-26 11:51-0700, Andy Lutomirski:
  On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
   Suppose we start out with all vcpus agreeing on their pvti and perfect
   invariant TSCs.  Now the host updates its frequency (due to NTP or
   whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
   pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
   They'll disagree on the time, and one of them will be ahead until vcpu
   1's pvti gets updated.
  
   The masterclock scheme enforces the same system_timestamp/tsc_timestamp 
   pairs
   to be visible at one time, for all vcpus.
  
  
* That is, when timespec0 != timespec1, M  N. Unfortunately that is
* not
* always the case (the difference between two distinct xtime instances
* might be smaller then the difference between corresponding TSC reads,
* when updating guest vcpus pvclock areas).
*
* To avoid that problem, do not allow visibility of distinct
* system_timestamp/tsc_timestamp values simultaneously: use a master
* copy of host monotonic time values. Update that master copy
* in lockstep.
 
  Yuck.  So we have per cpu timing data, but the protocol is only usable
  for monotonic timing because we forcibly freeze all vcpus when we
  update the nominally per cpu data.
 
  The obvious guest implementations are still unnecessarily slow,
  though.  It would be nice if the guest could get away without using
  any getcpu operation at all.
 
  Even if we fixed the host to increment version as advertised, I think
  we can't avoid two getcpu ops.  We need one before rdtsc to figure out
  which pvti to look at,
 
  Yes.
 
 and we need another to make sure that we were
  actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
  -- we need to check version before rdtsc, and we don't know what
  version to check until we do a getcpu.).
 
  Exactly, reading cpuid after rdtsc doesn't do that though, we could have
  migrated back between those reads.
  rtdscp would allow us to check that we read tsc of pvti's cpu.
  (It doesn't get rid of that first read.)
 
   The migration hook has the
  same issue -- we need to check the migration count, then confirm we're
  on that cpu, then check the migration count again, and we can't do
  that until we know what cpu we're on.
 
  True;  the revert has a bug -- we need to check cpuid for the second
  time before rdtsc.  (Migration hook is there just because we don't know
  which cpu executed rdtsc.)
 
 One way or another, I'm planning on completely rewriting the vdso
 code.  An early draft is here:
 
 https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
 
 but I can't finish it until the KVM side shakes out.
 
 I think there are at least two ways that would work:
 
 a) If KVM incremented version as advertised:

All for it.

 cpu = getcpu();
 pvti = pvti for cpu;
 
 ver1 = pvti-version;
 check stable bit;
 rdtsc_barrier, rdtsc, read scale, shift, etc.
 if (getcpu() != cpu) retry;
 if (pvti-version != ver1) retry;
 
 I think this is safe because, we're guaranteed that there was an
 interval (between the two version reads) in which the vcpu we think
 we're on was running and the kvmclock data was valid and marked
 stable, and we know that the tsc we read came from that interval.
 
 Note: rdtscp isn't needed. If we're stable, is makes no difference
 which cpu's tsc we actually read.

Yes, can't see a problem with that.

 b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

 cpu = getcpu();
 pvti = pvti for cpu;
 m1 = pvti-migrations_from;
 barrier();
 
 ver1 = pvti-version;
 check stable bit;
 rdtsc_barrier, rdtsc, read scale, shift, etc.
 if (getcpu() != cpu) retry;
 if (pvti-version != ver1) retry;  /* probably not really needed */
 
 barrier();
 if (pvti-migrations_from != m1) retry;
 
 This is just like (a), except that we're using a guest kernel hack to
 ensure that no one migrated off the vcpu during the version-protected
 critical section and that we were, in fact, on that vcpu at some point
 during that critical section.  Once we've ensured that we were on
 pvti's associated vcpu for the entire time we were reading it, then we
 are protected by the existing versioning in the host.
 
 
  If, on the other hand, we could rely on having all of these things in
  sync, then this complication goes away, and we go down from two getcpu
  ops to zero.
 
  (Yeah, we should look what are the drawbacks of doing it differently.)
 
 If the versioning were fixed, I think we could almost get away with:
 
 pvti = pvti for vcpu 0;
 
 ver1

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar rkrc...@redhat.com wrote:
2015-03-26 11:51-0700, Andy Lutomirski:
On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti mtosa...@redhat.com
wrote:
On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
Suppose we start out with all vcpus agreeing on their pvti and
perfect
invariant TSCs. Now the host updates its frequency (due to NTP or
whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's
pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
They'll disagree on the time, and one of them will be ahead until
vcpu
1's pvti gets updated.

The masterclock scheme enforces the same
system_timestamp/tsc_timestamp pairs
to be visible at one time, for all vcpus.

Yuck. So we have per cpu timing data, but the protocol is only usable
for monotonic timing because we forcibly freeze all vcpus when we
update the nominally per cpu data.

The obvious guest implementations are still unnecessarily slow,
though. It would be nice if the guest could get away without using
any getcpu operation at all.

Even if we fixed the host to increment version as advertised, I think
we can't avoid two getcpu ops. We need one before rdtsc to figure out
which pvti to look at,

Yes.

True; the revert has a bug -- we need to check cpuid for the second
time before rdtsc. (Migration hook is there just because we don't know
which cpu executed rdtsc.)

One way or another, I'm planning on completely rewriting the vdso
code. An early draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdsoid=57ace6e6e032afc4faf7b9ec52f78a8e6642c980

but I can't finish it until the KVM side shakes out.

I think there are at least two ways that would work:

a) If KVM incremented version as advertised:

All for it.

cpu = getcpu();
pvti = pvti for cpu;

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry;

Note: rdtscp isn't needed. If we're stable, is makes no difference
which cpu's tsc we actually read.

Yes, can't see a problem with that.

b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

cpu = getcpu();
pvti = pvti for cpu;
m1 = pvti-migrations_from;
barrier();

ver1 = pvti-version;
check stable bit;
rdtsc_barrier, rdtsc, read scale, shift, etc.
if (getcpu() != cpu) retry;
if (pvti-version != ver1) retry; /* probably not really needed */

barrier();
if (pvti-migrations_from != m1) retry;

If, on the other hand, we could rely on having all of these

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 12:08+0100, Radim Krčmář:
 Reverting the patch protects us from any migration, but I don't think we
 need to care about changing VCPUs as long as we read a consistent data
 from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
 matter if we return a value not fit for this VCPU.)
 
 I think we could drop the second __getcpu if our kvmclock was being
 handled better;  maybe with a patch like the one below:

The second __getcpu is not neccessary, but I forgot about rdtsc.
We need to either use rtdscp, know the host has synchronized tsc, or
monitor VCPU migrations.  Only the last one works everywhere.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-23 20:21-0300, Marcelo Tosatti:
 The following point:
 
 2. per-CPU pvclock time info is updated if the
underlying CPU changes.
 
 Is not true anymore since KVM: x86: update pvclock area conditionally,
 on cpu migration.
 
 Add task migration notification back.
 
 Problem noticed by Andy Lutomirski.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 CC: sta...@kernel.org # 3.11+

Please improve the commit message.
KVM: x86: update pvclock area conditionally [...] was merged half a
year before the patch we are reverting and is completely unrelated to
the bug we are fixing now, (reverted patch just was just wrong)

Reviewed-by: Radim Krčmář rkrc...@redhat.com

 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 @@ -82,18 +82,15 @@ static notrace cycle_t vread_pvclock(int *mode)
   /*
 -  * Note: hypervisor must guarantee that:
 -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
 -  * 2. that per-CPU pvclock time info is updated if the
 -  *underlying CPU changes.
 -  * 3. that version is increased whenever underlying CPU
 -  *changes.
 -  *
 +  * When looping to get a consistent (time-info, tsc) pair, we
 +  * also need to deal with the possibility we can switch vcpus,
 +  * so make sure we always re-fetch time-info for the current vcpu.

(All points from the original comment need to hold -- it would be nicer
 to keep both.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Marcelo Tosatti

On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
 2015-03-25 12:08+0100, Radim Krčmář:
  Reverting the patch protects us from any migration, but I don't think we
  need to care about changing VCPUs as long as we read a consistent data
  from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
  matter if we return a value not fit for this VCPU.)
  
  I think we could drop the second __getcpu if our kvmclock was being
  handled better;  maybe with a patch like the one below:
 
 The second __getcpu is not neccessary, but I forgot about rdtsc.
 We need to either use rtdscp, know the host has synchronized tsc, or
 monitor VCPU migrations.  Only the last one works everywhere.

The vdso code is only used if host has synchronized tsc.

But you have to handle the case where host goes from synchronized tsc to
unsynchronized tsc (see the clocksource notifier in the host side).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Andy Lutomirski

On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
 On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
   2015-03-25 12:08+0100, Radim Krčmář:
Reverting the patch protects us from any migration, but I don't think 
we
need to care about changing VCPUs as long as we read a consistent data
from kvmclock.  (VCPU can change outside of this loop too, so it 
doesn't
matter if we return a value not fit for this VCPU.)
   
I think we could drop the second __getcpu if our kvmclock was being
handled better;  maybe with a patch like the one below:
  
   The second __getcpu is not neccessary, but I forgot about rdtsc.
   We need to either use rtdscp, know the host has synchronized tsc, or
   monitor VCPU migrations.  Only the last one works everywhere.
 
  The vdso code is only used if host has synchronized tsc.
 
  But you have to handle the case where host goes from synchronized tsc to
  unsynchronized tsc (see the clocksource notifier in the host side).
 

 Can't we change the host to freeze all vcpus and clear the stable bit
 on all of them if this happens?  This would simplify and speed up
 vclock_gettime.

 --Andy

 Seems interesting to do on 512-vcpus, but sure, could be done.


If you have a 512-vcpu system that switches between stable and
unstable more than once per migration, then I expect that you have
serious problems and this is the least of your worries.

Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
if we're stable.  If nothing else, I'm not even remotely convinced
that the current scheme gives monotonic timing due to skew between
when the updates happen on different vcpus.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Andy Lutomirski

On Wed, Mar 25, 2015 at 4:13 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote:
 On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
  On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
   On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
2015-03-25 12:08+0100, Radim Krčmář:
 Reverting the patch protects us from any migration, but I don't 
 think we
 need to care about changing VCPUs as long as we read a consistent 
 data
 from kvmclock.  (VCPU can change outside of this loop too, so it 
 doesn't
 matter if we return a value not fit for this VCPU.)

 I think we could drop the second __getcpu if our kvmclock was being
 handled better;  maybe with a patch like the one below:
   
The second __getcpu is not neccessary, but I forgot about rdtsc.
We need to either use rtdscp, know the host has synchronized tsc, or
monitor VCPU migrations.  Only the last one works everywhere.
  
   The vdso code is only used if host has synchronized tsc.
  
   But you have to handle the case where host goes from synchronized tsc to
   unsynchronized tsc (see the clocksource notifier in the host side).
  
 
  Can't we change the host to freeze all vcpus and clear the stable bit
  on all of them if this happens?  This would simplify and speed up
  vclock_gettime.
 
  --Andy
 
  Seems interesting to do on 512-vcpus, but sure, could be done.
 

 If you have a 512-vcpu system that switches between stable and
 unstable more than once per migration, then I expect that you have
 serious problems and this is the least of your worries.

 Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
 if we're stable.  If nothing else, I'm not even remotely convinced
 that the current scheme gives monotonic timing due to skew between
 when the updates happen on different vcpus.

 Can you write down the problem ?


I can try.

Suppose we start out with all vcpus agreeing on their pvti and perfect
invariant TSCs.  Now the host updates its frequency (due to NTP or
whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
They'll disagree on the time, and one of them will be ahead until vcpu
1's pvti gets updated.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Marcelo Tosatti

On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
 On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
   2015-03-25 12:08+0100, Radim Krčmář:
Reverting the patch protects us from any migration, but I don't think we
need to care about changing VCPUs as long as we read a consistent data
from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
matter if we return a value not fit for this VCPU.)
   
I think we could drop the second __getcpu if our kvmclock was being
handled better;  maybe with a patch like the one below:
  
   The second __getcpu is not neccessary, but I forgot about rdtsc.
   We need to either use rtdscp, know the host has synchronized tsc, or
   monitor VCPU migrations.  Only the last one works everywhere.
 
  The vdso code is only used if host has synchronized tsc.
 
  But you have to handle the case where host goes from synchronized tsc to
  unsynchronized tsc (see the clocksource notifier in the host side).
 
 
 Can't we change the host to freeze all vcpus and clear the stable bit
 on all of them if this happens?  This would simplify and speed up
 vclock_gettime.
 
 --Andy

Seems interesting to do on 512-vcpus, but sure, could be done.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Marcelo Tosatti

On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote:
 On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote:
  On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
   On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
2015-03-25 12:08+0100, Radim Krčmář:
 Reverting the patch protects us from any migration, but I don't 
 think we
 need to care about changing VCPUs as long as we read a consistent 
 data
 from kvmclock.  (VCPU can change outside of this loop too, so it 
 doesn't
 matter if we return a value not fit for this VCPU.)

 I think we could drop the second __getcpu if our kvmclock was being
 handled better;  maybe with a patch like the one below:
   
The second __getcpu is not neccessary, but I forgot about rdtsc.
We need to either use rtdscp, know the host has synchronized tsc, or
monitor VCPU migrations.  Only the last one works everywhere.
  
   The vdso code is only used if host has synchronized tsc.
  
   But you have to handle the case where host goes from synchronized tsc to
   unsynchronized tsc (see the clocksource notifier in the host side).
  
 
  Can't we change the host to freeze all vcpus and clear the stable bit
  on all of them if this happens?  This would simplify and speed up
  vclock_gettime.
 
  --Andy
 
  Seems interesting to do on 512-vcpus, but sure, could be done.
 
 
 If you have a 512-vcpu system that switches between stable and
 unstable more than once per migration, then I expect that you have
 serious problems and this is the least of your worries.
 
 Personally, I'd *much* rather we just made vcpu 0's pvti authoritative
 if we're stable.  If nothing else, I'm not even remotely convinced
 that the current scheme gives monotonic timing due to skew between
 when the updates happen on different vcpus.

Can you write down the problem ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-25 Thread Andy Lutomirski

On Mar 25, 2015 2:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Krčmář wrote:
  2015-03-25 12:08+0100, Radim Krčmář:
   Reverting the patch protects us from any migration, but I don't think we
   need to care about changing VCPUs as long as we read a consistent data
   from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
   matter if we return a value not fit for this VCPU.)
  
   I think we could drop the second __getcpu if our kvmclock was being
   handled better;  maybe with a patch like the one below:
 
  The second __getcpu is not neccessary, but I forgot about rdtsc.
  We need to either use rtdscp, know the host has synchronized tsc, or
  monitor VCPU migrations.  Only the last one works everywhere.

 The vdso code is only used if host has synchronized tsc.

 But you have to handle the case where host goes from synchronized tsc to
 unsynchronized tsc (see the clocksource notifier in the host side).


Can't we change the host to freeze all vcpus and clear the stable bit
on all of them if this happens?  This would simplify and speed up
vclock_gettime.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations

2015-03-24 15:33-0700, Andy Lutomirski:
 On Tue, Mar 24, 2015 at 8:34 AM, Radim Krčmář rkrc...@redhat.com wrote:
  What is the problem?
 
 The kvmclock spec says that the host will increment a version field to
 an odd number, then update stuff, then increment it to an even number.
 The host is buggy and doesn't do this, and the result is observable
 when one vcpu reads another vcpu's kvmclock data.
 
 Since there's no good way for a guest kernel to keep its vdso from
 reading a different vcpu's kvmclock data, this is a real corner-case
 bug.  This patch allows the vdso to retry when this happens.  I don't
 think it's a great solution, but it should mostly work.

Great explanation, thank you.

Reverting the patch protects us from any migration, but I don't think we
need to care about changing VCPUs as long as we read a consistent data
from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
matter if we return a value not fit for this VCPU.)

I think we could drop the second __getcpu if our kvmclock was being
handled better;  maybe with a patch like the one below:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759f69a3..8658599e0024 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
guest_hv_clock, sizeof(guest_hv_clock
return 0;
 
-   /*
-* The interface expects us to write an even number signaling that the
-* update is finished. Since the guest won't see the intermediate
-* state, we just increase by 2 at the end.
+   /* A guest can read other VCPU's kvmclock; specification says that
+* version is odd if data is being modified and even after it is
+* consistent.
+* We write three times to be sure.
+*  1) update version to odd number
+*  2) write modified data (version is still odd)
+*  3) update version to even number
+*
+* TODO: optimize
+*  - only two writes should be enough -- version is first
+*  - the second write could update just version
 */
-   vcpu-hv_clock.version = guest_hv_clock.version + 2;
+   guest_hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   guest_hv_clock,
+   sizeof(guest_hv_clock));
+
+   vcpu-hv_clock.version = guest_hv_clock.version;
 
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
@@ -1684,6 +1696,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kvm_write_guest_cached(v-kvm, vcpu-pv_time,
vcpu-hv_clock,
sizeof(vcpu-hv_clock));
+
+   vcpu-hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   vcpu-hv_clock,
+   sizeof(vcpu-hv_clock));
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvm: Revert remove sched notifier for cross-cpu migrations