Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-19 Thread Eduardo Habkost
On Wed, Oct 19, 2016 at 05:42:16PM +0200, Radim Krčmář wrote:
> 2016-10-19 11:55-0200, Eduardo Habkost:
> > On Wed, Oct 19, 2016 at 03:27:52PM +0200, Radim Krčmář wrote:
> >> 2016-10-18 19:05-0200, Eduardo Habkost:
> >> > On Tue, Oct 18, 2016 at 10:52:14PM +0200, Radim Krčmář wrote:
> >> > [...]
> >> >> The main problem is that QEMU changes virtual_tsc_khz when migrating
> >> >> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
> >> >> 
> >> >> If QEMU doesn't want to keep the TSC frequency constant, then it would
> >> >> be better if it didn't expose TSC in CPUID -- guest would just use
> >> >> kvmclock without being tempted by direct TSC accesses.
> >> > 
> >> > Isn't enough to simply not expose invtsc? Aren't guests expected
> >> > to assume the TSC frequency can change if invtsc isn't set on
> >> > CPUID?
> >> 
> >> There are exceptions.  An OS can assume constant TSC on some models that
> >> QEMU emulates: coreduo, core2duo, Conroe, Penryn, n270, kvm32 and kvm64.
> >> The list from SDM (17.15 TIME-STAMP COUNTER):
> >> 
> >>   Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H
> >>   and higher]); Intel Core Solo and Intel Core Duo processors (family
> >>   [06H], model [0EH]); the Intel Xeon processor 5100 series and Intel
> >>   Core 2 Duo processors (family [06H], model [0FH]); Intel Core 2 and
> >>   Intel Xeon processors (family [06H], DisplayModel [17H]); Intel Atom
> >>   processors (family [06H], DisplayModel [1CH]))
> >> 
> >> Another sad part is that Linux uses the following condition to assume
> >> constant TSC frequency:
> >> 
> >>if ((c->x86 == 0xf && c->x86_model >= 0x03) ||
> >>(c->x86 == 0x6 && c->x86_model >= 0x0e))
> >>set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
> >> 
> >> which returns sets constant TSC for all modern processors.  It's not a
> >> problem on real hardware, because all modern processors likely have
> >> invariant TSC.
> >> 
> >> Fun fact: Linux shows constant_tsc flag in /proc/cpuinfo even if the
> >>   modern CPU doesn't expose TSC in CPUID.
> >> 
> >> Considering that Linux is fixed on Nehalem and newer processors, we have
> >> few options for the rest:
> >>  1) treat TSC like invariant TSC on those models (the guest cannot use
> >> ACPI state, so its OS might assume that they are equivalent)
> >>  2) hide TSC on those models
> >>  3) ignore the problem
> >>  4) remove those models
> >> 
> >> I don't know enough about QEMU design goals to guess which one is the
> >> most appropriate.  (4) is the clear winner for me, followed by (3). :)
> > 
> > (4) can't be implemented because it breaks existing
> > configurations. (3) is the current solution.
> 
> Existing machine types must remain compatible, but isn't it possible to
> cull options in new machine types?

We specifically promised to libvirt developers that a CPU model
that can be started with a machine-type should be still runnable
with other versions of the same machine-type family. In other
words, a running config should keep working if only the
machine-type version changed.

> 
> > Option (2) sounds attractive to me, but seems risky.
> 
> Definitely.
> If users have a setup that works, then any change can break it.
> 
> It would be the best option few years back when we wrote the code, but
> now the change will happen *in* the guest, so we can't control it as in
> the case of (4), where broken guests won't start, or (1), where broken
> guests won't migrate.
> 
> >  I would like
> > to understand the consequences for guests. What could stop
> > working if we remove TSC? What about kvmclock?
> 
> Hiding TSC in CPUID doesn't disable the RDTSC instruction in the guest.
> 
> kvmclock is a paravirtual device on top of TSC, so if kvmclock is
> present, then it should be safe to assume that the guest can use TSC for
> operations with kvmclock.
> Linux does that, but I don't think this behavior was ever written down,
> so other kvmclock users could break.
> 
> Maybe Hyper-V TSC page would stop working, because Windows and other
> users could have a check for CPUID.1:EDX.TSC separately.
> Linux's implemention would work, because it just checks for the
> paravirtual feature, like in case of kvmclock.
> 
> And minor cases are: an OS that has no other option that TSC for clock;
> userspace that checks TSC before using it; an OS that stops setting
> CR4.TSD and its userspace starts to use TSC; and probably many others.

OK, that sounds very risky. This means it is probably better to
let management software explicitly choose the new stricter
behavior.

...and we already have a mechanism to request stricter behavior:
explicitly disabling TSC, or setting tsc-frequency explicitly on
the command-line.

> 
> > If we implement (2), we could even add an extra check that blocks
> > migration (or at least prints a warning) in case:
> > 1) TSC is forcibly enabled in the configuration;
> > 2) TSC scaling is 

Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-19 Thread Radim Krčmář
2016-10-19 11:55-0200, Eduardo Habkost:
> On Wed, Oct 19, 2016 at 03:27:52PM +0200, Radim Krčmář wrote:
>> 2016-10-18 19:05-0200, Eduardo Habkost:
>> > On Tue, Oct 18, 2016 at 10:52:14PM +0200, Radim Krčmář wrote:
>> > [...]
>> >> The main problem is that QEMU changes virtual_tsc_khz when migrating
>> >> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
>> >> 
>> >> If QEMU doesn't want to keep the TSC frequency constant, then it would
>> >> be better if it didn't expose TSC in CPUID -- guest would just use
>> >> kvmclock without being tempted by direct TSC accesses.
>> > 
>> > Isn't enough to simply not expose invtsc? Aren't guests expected
>> > to assume the TSC frequency can change if invtsc isn't set on
>> > CPUID?
>> 
>> There are exceptions.  An OS can assume constant TSC on some models that
>> QEMU emulates: coreduo, core2duo, Conroe, Penryn, n270, kvm32 and kvm64.
>> The list from SDM (17.15 TIME-STAMP COUNTER):
>> 
>>   Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H
>>   and higher]); Intel Core Solo and Intel Core Duo processors (family
>>   [06H], model [0EH]); the Intel Xeon processor 5100 series and Intel
>>   Core 2 Duo processors (family [06H], model [0FH]); Intel Core 2 and
>>   Intel Xeon processors (family [06H], DisplayModel [17H]); Intel Atom
>>   processors (family [06H], DisplayModel [1CH]))
>> 
>> Another sad part is that Linux uses the following condition to assume
>> constant TSC frequency:
>> 
>>  if ((c->x86 == 0xf && c->x86_model >= 0x03) ||
>>  (c->x86 == 0x6 && c->x86_model >= 0x0e))
>>  set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
>> 
>> which returns sets constant TSC for all modern processors.  It's not a
>> problem on real hardware, because all modern processors likely have
>> invariant TSC.
>> 
>> Fun fact: Linux shows constant_tsc flag in /proc/cpuinfo even if the
>>   modern CPU doesn't expose TSC in CPUID.
>> 
>> Considering that Linux is fixed on Nehalem and newer processors, we have
>> few options for the rest:
>>  1) treat TSC like invariant TSC on those models (the guest cannot use
>> ACPI state, so its OS might assume that they are equivalent)
>>  2) hide TSC on those models
>>  3) ignore the problem
>>  4) remove those models
>> 
>> I don't know enough about QEMU design goals to guess which one is the
>> most appropriate.  (4) is the clear winner for me, followed by (3). :)
> 
> (4) can't be implemented because it breaks existing
> configurations. (3) is the current solution.

Existing machine types must remain compatible, but isn't it possible to
cull options in new machine types?

> Option (2) sounds attractive to me, but seems risky.

Definitely.
If users have a setup that works, then any change can break it.

It would be the best option few years back when we wrote the code, but
now the change will happen *in* the guest, so we can't control it as in
the case of (4), where broken guests won't start, or (1), where broken
guests won't migrate.

>  I would like
> to understand the consequences for guests. What could stop
> working if we remove TSC? What about kvmclock?

Hiding TSC in CPUID doesn't disable the RDTSC instruction in the guest.

kvmclock is a paravirtual device on top of TSC, so if kvmclock is
present, then it should be safe to assume that the guest can use TSC for
operations with kvmclock.
Linux does that, but I don't think this behavior was ever written down,
so other kvmclock users could break.

Maybe Hyper-V TSC page would stop working, because Windows and other
users could have a check for CPUID.1:EDX.TSC separately.
Linux's implemention would work, because it just checks for the
paravirtual feature, like in case of kvmclock.

And minor cases are: an OS that has no other option that TSC for clock;
userspace that checks TSC before using it; an OS that stops setting
CR4.TSD and its userspace starts to use TSC; and probably many others.

> If we implement (2), we could even add an extra check that blocks
> migration (or at least prints a warning) in case:
> 1) TSC is forcibly enabled in the configuration;
> 2) TSC scaling is not available on destination; and
> 3) the family/model values match the ones on the list above.
> 
> And we could even keep TSC enabled by default for users who don't
> want migration (using migratable=false).

That would be nice.



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-19 Thread Eduardo Habkost
On Wed, Oct 19, 2016 at 03:27:52PM +0200, Radim Krčmář wrote:
> 2016-10-18 19:05-0200, Eduardo Habkost:
> > On Tue, Oct 18, 2016 at 10:52:14PM +0200, Radim Krčmář wrote:
> > [...]
> >> The main problem is that QEMU changes virtual_tsc_khz when migrating
> >> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
> >> 
> >> If QEMU doesn't want to keep the TSC frequency constant, then it would
> >> be better if it didn't expose TSC in CPUID -- guest would just use
> >> kvmclock without being tempted by direct TSC accesses.
> > 
> > Isn't enough to simply not expose invtsc? Aren't guests expected
> > to assume the TSC frequency can change if invtsc isn't set on
> > CPUID?
> 
> There are exceptions.  An OS can assume constant TSC on some models that
> QEMU emulates: coreduo, core2duo, Conroe, Penryn, n270, kvm32 and kvm64.
> The list from SDM (17.15 TIME-STAMP COUNTER):
> 
>   Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H
>   and higher]); Intel Core Solo and Intel Core Duo processors (family
>   [06H], model [0EH]); the Intel Xeon processor 5100 series and Intel
>   Core 2 Duo processors (family [06H], model [0FH]); Intel Core 2 and
>   Intel Xeon processors (family [06H], DisplayModel [17H]); Intel Atom
>   processors (family [06H], DisplayModel [1CH]))
> 
> Another sad part is that Linux uses the following condition to assume
> constant TSC frequency:
> 
>   if ((c->x86 == 0xf && c->x86_model >= 0x03) ||
>   (c->x86 == 0x6 && c->x86_model >= 0x0e))
>   set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
> 
> which returns sets constant TSC for all modern processors.  It's not a
> problem on real hardware, because all modern processors likely have
> invariant TSC.
> 
> Fun fact: Linux shows constant_tsc flag in /proc/cpuinfo even if the
>   modern CPU doesn't expose TSC in CPUID.
> 
> Considering that Linux is fixed on Nehalem and newer processors, we have
> few options for the rest:
>  1) treat TSC like invariant TSC on those models (the guest cannot use
> ACPI state, so its OS might assume that they are equivalent)
>  2) hide TSC on those models
>  3) ignore the problem
>  4) remove those models
> 
> I don't know enough about QEMU design goals to guess which one is the
> most appropriate.  (4) is the clear winner for me, followed by (3). :)

(4) can't be implemented because it breaks existing
configurations. (3) is the current solution.

Option (2) sounds attractive to me, but seems risky. I would like
to understand the consequences for guests. What could stop
working if we remove TSC? What about kvmclock?

If we implement (2), we could even add an extra check that blocks
migration (or at least prints a warning) in case:
1) TSC is forcibly enabled in the configuration;
2) TSC scaling is not available on destination; and
3) the family/model values match the ones on the list above.

And we could even keep TSC enabled by default for users who don't
want migration (using migratable=false).

-- 
Eduardo



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-19 Thread Radim Krčmář
2016-10-18 19:05-0200, Eduardo Habkost:
> On Tue, Oct 18, 2016 at 10:52:14PM +0200, Radim Krčmář wrote:
> [...]
>> The main problem is that QEMU changes virtual_tsc_khz when migrating
>> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
>> 
>> If QEMU doesn't want to keep the TSC frequency constant, then it would
>> be better if it didn't expose TSC in CPUID -- guest would just use
>> kvmclock without being tempted by direct TSC accesses.
> 
> Isn't enough to simply not expose invtsc? Aren't guests expected
> to assume the TSC frequency can change if invtsc isn't set on
> CPUID?

There are exceptions.  An OS can assume constant TSC on some models that
QEMU emulates: coreduo, core2duo, Conroe, Penryn, n270, kvm32 and kvm64.
The list from SDM (17.15 TIME-STAMP COUNTER):

  Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H
  and higher]); Intel Core Solo and Intel Core Duo processors (family
  [06H], model [0EH]); the Intel Xeon processor 5100 series and Intel
  Core 2 Duo processors (family [06H], model [0FH]); Intel Core 2 and
  Intel Xeon processors (family [06H], DisplayModel [17H]); Intel Atom
  processors (family [06H], DisplayModel [1CH]))

Another sad part is that Linux uses the following condition to assume
constant TSC frequency:

if ((c->x86 == 0xf && c->x86_model >= 0x03) ||
(c->x86 == 0x6 && c->x86_model >= 0x0e))
set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);

which returns sets constant TSC for all modern processors.  It's not a
problem on real hardware, because all modern processors likely have
invariant TSC.

Fun fact: Linux shows constant_tsc flag in /proc/cpuinfo even if the
  modern CPU doesn't expose TSC in CPUID.

Considering that Linux is fixed on Nehalem and newer processors, we have
few options for the rest:
 1) treat TSC like invariant TSC on those models (the guest cannot use
ACPI state, so its OS might assume that they are equivalent)
 2) hide TSC on those models
 3) ignore the problem
 4) remove those models

I don't know enough about QEMU design goals to guess which one is the
most appropriate.  (4) is the clear winner for me, followed by (3). :)



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Eduardo Habkost
On Tue, Oct 18, 2016 at 10:52:14PM +0200, Radim Krčmář wrote:
[...]
> The main problem is that QEMU changes virtual_tsc_khz when migrating
> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
> 
> If QEMU doesn't want to keep the TSC frequency constant, then it would
> be better if it didn't expose TSC in CPUID -- guest would just use
> kvmclock without being tempted by direct TSC accesses.

Isn't enough to simply not expose invtsc? Aren't guests expected
to assume the TSC frequency can change if invtsc isn't set on
CPUID?

-- 
Eduardo



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Radim Krčmář
2016-10-18 15:09-0200, Marcelo Tosatti:
> On Tue, Oct 18, 2016 at 03:41:03PM +0200, Paolo Bonzini wrote:
>> On 18/10/2016 01:58, Marcelo Tosatti wrote:
>> > > We should also blacklist the TSC deadline timer when invtsc is not
>> > > available.
>> >
>> > Actually, a nicer fix would be to check the different 
>> > frequencies and scale the deadline relative to the difference. 
>> 
>> You cannot know what exactly the guest was thinking when it set the TSC
>> deadline.  Perhaps it wanted an interrupt when the TSC was exactly
>> 9876543210.

Yep, the spec just says that the timer fires when TSC >= deadline.

> You can't expect the correlation between TSC value and timer interrupt
> execution to be precise, because of the delay between HW timer
> expiration and interrupt execution.

Yes, this is valid.

> So you have to live with the fact that the TSC deadline timer can be
> late (which is the same thing as with your paravirt solution, in case 
> of migration to host with faster TSC freq) (which to me renders the
> argument above invalid).
> 
> Solution for old guests and new guests:
> Just save how far ahead in the future the TSC deadline timer is supposed
> to expire on the source (in ns), in the destination save all registers 
> (but disable TSC deadline timer injection), arm a timer in QEMU 
> for expiration time, reenable TSC deadline timer reinjection.

The interrupt can also fire early after migrating to a TSC with lower
frequency, which violates the definition of TSC deadline timer when an
OS could even RDTSC a value lower than the deadline in the interrupt
handler.  (An OS doesn't need to expect that.)

We already have vcpu->arch.virtual_tsc_khz that ought to remain constant
for a lifetime of the guest and KVM uses it to convert TSC delta into
correct nanoseconds.

The main problem is that QEMU changes virtual_tsc_khz when migrating
without hardware scaling, so KVM is forced to get nanoseconds wrong ...

If QEMU doesn't want to keep the TSC frequency constant, then it would
be better if it didn't expose TSC in CPUID -- guest would just use
kvmclock without being tempted by direct TSC accesses.



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Marcelo Tosatti
On Tue, Oct 18, 2016 at 03:41:03PM +0200, Paolo Bonzini wrote:
> 
> 
> On 18/10/2016 01:58, Marcelo Tosatti wrote:
> > > We should also blacklist the TSC deadline timer when invtsc is not
> > > available.
> >
> > Actually, a nicer fix would be to check the different 
> > frequencies and scale the deadline relative to the difference. 
> 
> You cannot know what exactly the guest was thinking when it set the TSC
> deadline.  Perhaps it wanted an interrupt when the TSC was exactly
> 9876543210.

You can't expect the correlation between TSC value and timer interrupt
execution to be precise, because of the delay between HW timer
expiration and interrupt execution.

So you have to live with the fact that the TSC deadline timer can be
late (which is the same thing as with your paravirt solution, in case 
of migration to host with faster TSC freq) (which to me renders the
argument above invalid).

Solution for old guests and new guests:
Just save how far ahead in the future the TSC deadline timer is supposed
to expire on the source (in ns), in the destination save all registers 
(but disable TSC deadline timer injection), arm a timer in QEMU 
for expiration time, reenable TSC deadline timer reinjection.






Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Radim Krčmář
2016-10-17 21:58-0200, Marcelo Tosatti:
> On Mon, Oct 17, 2016 at 07:11:01PM -0200, Eduardo Habkost wrote:
>> On Mon, Oct 17, 2016 at 06:24:38PM +0200, Paolo Bonzini wrote:
>> > On 17/10/2016 16:50, Radim Krčmář wrote:
>> > > 2016-10-17 07:47-0200, Marcelo Tosatti:
>> [...]
>> > >> since Linux guests use kvmclock and Windows guests use Hyper-V
>> > >> enlightenment, it should be fine to disable 2).
>> > 
>> > ... and 1 too.
>> > 
>> > We should also blacklist the TSC deadline timer when invtsc is not
>> > available.
> 
> Actually, a nicer fix would be to check the different 
> frequencies and scale the deadline relative to the difference. 

I think that KVM can already be configured to do that.

Paolo, we hit that TSC deadline bug bacause QEMU doesn't set the TSC
frequency if it would result in software scaling (which needs to update
guest TSC and kvmclock on every entry)?

Thanks.

(I just noticed a minor bug: KVM doesn't use hardware scaling when the
 TSC frequency delta is small.)

> This would take care of both patched and non-patched guests.
> 
> On a related note, what was the goal of Radim's paravirtual deadline
> TSC timer?

It's be paravirtual kvmclock timer -- just giving the deadline in other
another time frame.  It won't confuse OS that expect the deadline timer
to behave like it should. :)



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Paolo Bonzini


On 18/10/2016 01:58, Marcelo Tosatti wrote:
> > We should also blacklist the TSC deadline timer when invtsc is not
> > available.
>
> Actually, a nicer fix would be to check the different 
> frequencies and scale the deadline relative to the difference. 

You cannot know what exactly the guest was thinking when it set the TSC
deadline.  Perhaps it wanted an interrupt when the TSC was exactly
9876543210.

Paolo



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Radim Krčmář
2016-10-17 18:24+0200, Paolo Bonzini:
> On 17/10/2016 16:50, Radim Krčmář wrote:
>> 2016-10-17 07:47-0200, Marcelo Tosatti:
>>> On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
 I have been wondering: should we allow live migration with the
 invtsc flag enabled, if TSC scaling is available on the
 destination?
>>>
>>> TSC scaling and invtsc flag, yes.
>> 
>> Yes, if we have well synchronized time between hosts, then we might be
>> able to migrate with a TSC shift that cannot be perceived by the guest.
>> 
>> Unless the VM also has a migratable assigned PCI device that uses ART,
>> because we have no protocol to update the setting of ART (in CPUID), so
>> we should keep migration forbidden then.
> 
> We don't publish the ART leaf at all, do we?

Yes, it's a matter of time, though -- someone already asked for PTP in
guests, so we'll have to either provide a paravirtual host<->guest time
synchronization protocol that shares PTP from the host or let them use
assigned devices ...

>>> 1) Migration: to host with different TSC frequency.
>> 
>> We shouldn't have done this even now when emulating anything newer than
>> Pentium 4, because those CPUs have constant TSC, which only lacks the
>> guarantee that it doesn't stop in deep C-states:
> 
> Right, but:
> 
>>> since Linux guests use kvmclock and Windows guests use Hyper-V
>>> enlightenment, it should be fine to disable 2).
> 
> ... and 1 too.

Yes.  We could stop exposing TSC then, because it should have no direct
users -- kvmclock can work even if TSC is not in CPUID, because we
paravirtualize it.

> We should also blacklist the TSC deadline timer when invtsc is not
> available.

True.

I was thinking that with Wanpeng's VMX preemption patches, we might not
need the TSC nor paravirtual deadline timer, the because performance of
LAPIC one-shot could be very similar.



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Radim Krčmář
2016-10-18 15:36+0200, Radim Krčmář:
> 2016-10-17 18:24+0200, Paolo Bonzini:
>> We should also blacklist the TSC deadline timer when invtsc is not
>> available.
> 
> True.
> 
> I was thinking that with Wanpeng's VMX preemption patches, we might not
> need the TSC nor paravirtual deadline timer, the because performance of
> LAPIC one-shot could be very similar.

No, I sent before thinking -- deadline timer also has better precision,
so it is still useful.



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Radim Krčmář
2016-10-17 15:20-0200, Marcelo Tosatti:
> On Mon, Oct 17, 2016 at 04:50:09PM +0200, Radim Krčmář wrote:
>> 2016-10-17 07:47-0200, Marcelo Tosatti:
>> > On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
>> >> I have been wondering: should we allow live migration with the
>> >> invtsc flag enabled, if TSC scaling is available on the
>> >> destination?
>> > 
>> > TSC scaling and invtsc flag, yes.
>> 
>> Yes, if we have well synchronized time between hosts, then we might be
>> able to migrate with a TSC shift that cannot be perceived by the guest.
> 
> Even if the guest can't detect the TSC difference (relative to realtime),
> i suppose TSC should be advanced to account for the migration stopped 
> time (so that TSC appears to have incremented at a "constant rate").

I agree.

>> Unless the VM also has a migratable assigned PCI device that uses ART,
>> because we have no protocol to update the setting of ART (in CPUID), so
>> we should keep migration forbidden then.
> 
> What is the use case for ART again? (need to catchup on that).

It allows PCI devices to read a TSC-compatible timestamp and tag some
data with it for the OS -- PTP uses it to avoid OS latency.
I'm not sure if other devices use it right now.

>> > 2) Savevm: It is not safe to use the TSC for wall clock timer
>> > services.
>> 
>> With constant TSC, we could argue that a shift to deep C-state happened
>> and paused TSC, which is not a good behavior, but somewhat defensible.
>> 
>> > By allowing savevm, you make a commitment to allow a feature
>> > at the expense of not complying with the spec (specifically the "
>> > the OS may use the TSC for wall clock timer services", because the
>> > TSC stops relative to realtime for the duration of the savevm stop
>> > window).
>> 
>> Yep, we should at least guesstimate the TSC to allow the guest to resume
>> with as small TSC-shift as possible and check that hosts were somewhat
>> synchronized with UTC (or something we choose for time).
> 
> There are two options for savevm:
> 
> Option 1) Stop the TSC for savevm duration.
> 
> Option 2) Advance TSC to match realtime (this is known to overflow Linux
> timekeeping though).

Thanks, I wondered why we use (1).
Seems like something we'd like to fix.

>> > But since Linux guests use kvmclock and Windows guests use Hyper-V
>> > enlightenment, it should be fine to disable 2).
>> > 
>> > There is a bug open for this, btw: 
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1353073
>> 
>> These people should be happy with just live-migrations, so can't we just
>> keep savevm forbidden?
>
> Don't see why. Perhaps savevm should be considered a "special type of
> operation" that deviates from baremetal behaviour and that if
> the user does savevm, then it knows TSC does not count "at a constant
> rate" (so savevm breaks invariant tsc behaviour).

Yes, it's an option, but some users are still going to complain;
after which we'll point them to documentation and that leaves a bad
impression on both sides.
I don't like to expose something that is well defined and say that we
redefined it.  We could exposed something new (paravirtual), because the
guest application needs to be paravirtualized to expect the change
anyway.

Migration can perform simple time synchronization by exchanging
timestamps to ensure at least some level of "constant rate" illusion,
which is harder to do with savevm.



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-18 Thread Dr. David Alan Gilbert
* Radim Krčmář (rkrc...@redhat.com) wrote:
> 2016-10-17 07:47-0200, Marcelo Tosatti:
> > On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
> >> I have been wondering: should we allow live migration with the
> >> invtsc flag enabled, if TSC scaling is available on the
> >> destination?
> > 
> > TSC scaling and invtsc flag, yes.
> 
> Yes, if we have well synchronized time between hosts, then we might be
> able to migrate with a TSC shift that cannot be perceived by the guest.
> 
> Unless the VM also has a migratable assigned PCI device that uses ART,
> because we have no protocol to update the setting of ART (in CPUID), so
> we should keep migration forbidden then.
> 
> >> For reference, this is what the Intel SDM says about invtsc:
> >> 
> >>   The time stamp counter in newer processors may support an
> >>   enhancement, referred to as invariant TSC. Processor’s support
> >>   for invariant TSC is indicated by CPUID.8007H:EDX[8].
> >> 
> >>   The invariant TSC will run at a constant rate in all ACPI P-,
> >>   C-. and T-states. This is the architectural behavior moving
> >>   forward. On processors with invariant TSC support, the OS may
> >>   use the TSC for wall clock timer services (instead of ACPI or
> >>   HPET timers). TSC reads are much more efficient and do not
> >>   incur the overhead associated with a ring transition or access
> >>   to a platform resource.
> >
> > Yes. The blockage happened for different reasons:
> > 
> > 1) Migration: to host with different TSC frequency.
> 
> We shouldn't have done this even now when emulating anything newer than
> Pentium 4, because those CPUs have constant TSC, which only lacks the
> guarantee that it doesn't stop in deep C-states:
> 
>   For [a list of processors we emulate]: the time-stamp counter
>   increments at a constant rate. That rate may be set by the maximum
>   core-clock to bus-clock ratio of the processor or may be set by the
>   maximum resolved frequency at which the processor is booted. The
>   maximum resolved frequency may differ from the processor base
>   frequency, see Section 18.18.2 for more detail. On certain processors,
>   the TSC frequency may not be the same as the frequency in the brand
>   string.
> 
>   The specific processor configuration determines the behavior. Constant
>   TSC behavior ensures that the duration of each clock tick is uniform
>   and supports the use of the TSC as a wall clock timer even if the
>   processor core changes frequency. This is the architectural behavior
>   moving forward.
> 
> Invariant TSC is more useful, though, so more applications would break
> when migrating to a different TSC frequency.
> 
> > 2) Savevm: It is not safe to use the TSC for wall clock timer
> > services.
> 
> With constant TSC, we could argue that a shift to deep C-state happened
> and paused TSC, which is not a good behavior, but somewhat defensible.
> 
> > By allowing savevm, you make a commitment to allow a feature
> > at the expense of not complying with the spec (specifically the "
> > the OS may use the TSC for wall clock timer services", because the
> > TSC stops relative to realtime for the duration of the savevm stop
> > window).
> 
> Yep, we should at least guesstimate the TSC to allow the guest to resume
> with as small TSC-shift as possible and check that hosts were somewhat
> synchronized with UTC (or something we choose for time).
> 
> > But since Linux guests use kvmclock and Windows guests use Hyper-V
> > enlightenment, it should be fine to disable 2).
> > 
> > There is a bug open for this, btw: 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1353073
> 
> These people should be happy with just live-migrations, so can't we just
> keep savevm forbidden?

Why is savevm so much harder? Is it just the difference in real time?
If so then I do worry about how small a difference you're hoping
to guarentee in live-migration.

Dave

> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-17 Thread Marcelo Tosatti
On Mon, Oct 17, 2016 at 07:11:01PM -0200, Eduardo Habkost wrote:
> On Mon, Oct 17, 2016 at 06:24:38PM +0200, Paolo Bonzini wrote:
> > On 17/10/2016 16:50, Radim Krčmář wrote:
> > > 2016-10-17 07:47-0200, Marcelo Tosatti:
> [...]
> > >> since Linux guests use kvmclock and Windows guests use Hyper-V
> > >> enlightenment, it should be fine to disable 2).
> > 
> > ... and 1 too.
> > 
> > We should also blacklist the TSC deadline timer when invtsc is not
> > available.

Actually, a nicer fix would be to check the different 
frequencies and scale the deadline relative to the difference. 

This would take care of both patched and non-patched guests.

On a related note, what was the goal of Radim's paravirtual deadline
TSC timer?

> You mean on the guest-side? On the host side, it would make
> existing VMs refuse to run.
> 
> -- 
> Eduardo



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-17 Thread Eduardo Habkost
On Mon, Oct 17, 2016 at 06:24:38PM +0200, Paolo Bonzini wrote:
> On 17/10/2016 16:50, Radim Krčmář wrote:
> > 2016-10-17 07:47-0200, Marcelo Tosatti:
[...]
> >> since Linux guests use kvmclock and Windows guests use Hyper-V
> >> enlightenment, it should be fine to disable 2).
> 
> ... and 1 too.
> 
> We should also blacklist the TSC deadline timer when invtsc is not
> available.

You mean on the guest-side? On the host side, it would make
existing VMs refuse to run.

-- 
Eduardo



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-17 Thread Marcelo Tosatti
On Mon, Oct 17, 2016 at 04:50:09PM +0200, Radim Krčmář wrote:
> 2016-10-17 07:47-0200, Marcelo Tosatti:
> > On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
> >> I have been wondering: should we allow live migration with the
> >> invtsc flag enabled, if TSC scaling is available on the
> >> destination?
> > 
> > TSC scaling and invtsc flag, yes.
> 
> Yes, if we have well synchronized time between hosts, then we might be
> able to migrate with a TSC shift that cannot be perceived by the guest.

Even if the guest can't detect the TSC difference (relative to realtime),
i suppose TSC should be advanced to account for the migration stopped 
time (so that TSC appears to have incremented at a "constant rate").

> Unless the VM also has a migratable assigned PCI device that uses ART,
> because we have no protocol to update the setting of ART (in CPUID), so
> we should keep migration forbidden then.

What is the use case for ART again? (need to catchup on that).

> 
> >> For reference, this is what the Intel SDM says about invtsc:
> >> 
> >>   The time stamp counter in newer processors may support an
> >>   enhancement, referred to as invariant TSC. Processor’s support
> >>   for invariant TSC is indicated by CPUID.8007H:EDX[8].
> >> 
> >>   The invariant TSC will run at a constant rate in all ACPI P-,
> >>   C-. and T-states. This is the architectural behavior moving
> >>   forward. On processors with invariant TSC support, the OS may
> >>   use the TSC for wall clock timer services (instead of ACPI or
> >>   HPET timers). TSC reads are much more efficient and do not
> >>   incur the overhead associated with a ring transition or access
> >>   to a platform resource.
> >
> > Yes. The blockage happened for different reasons:
> > 
> > 1) Migration: to host with different TSC frequency.
> 
> We shouldn't have done this even now when emulating anything newer than
> Pentium 4, because those CPUs have constant TSC, which only lacks the
> guarantee that it doesn't stop in deep C-states:
> 
>   For [a list of processors we emulate]: the time-stamp counter
>   increments at a constant rate. That rate may be set by the maximum
>   core-clock to bus-clock ratio of the processor or may be set by the
>   maximum resolved frequency at which the processor is booted. The
>   maximum resolved frequency may differ from the processor base
>   frequency, see Section 18.18.2 for more detail. On certain processors,
>   the TSC frequency may not be the same as the frequency in the brand
>   string.
> 
>   The specific processor configuration determines the behavior. Constant
>   TSC behavior ensures that the duration of each clock tick is uniform
>   and supports the use of the TSC as a wall clock timer even if the
>   processor core changes frequency. This is the architectural behavior
>   moving forward.
> 
> Invariant TSC is more useful, though, so more applications would break
> when migrating to a different TSC frequency.
> 
> > 2) Savevm: It is not safe to use the TSC for wall clock timer
> > services.
> 
> With constant TSC, we could argue that a shift to deep C-state happened
> and paused TSC, which is not a good behavior, but somewhat defensible.
> 
> > By allowing savevm, you make a commitment to allow a feature
> > at the expense of not complying with the spec (specifically the "
> > the OS may use the TSC for wall clock timer services", because the
> > TSC stops relative to realtime for the duration of the savevm stop
> > window).
> 
> Yep, we should at least guesstimate the TSC to allow the guest to resume
> with as small TSC-shift as possible and check that hosts were somewhat
> synchronized with UTC (or something we choose for time).

There are two options for savevm:

Option 1) Stop the TSC for savevm duration.

Option 2) Advance TSC to match realtime (this is known to overflow Linux
timekeeping though).


> 
> > But since Linux guests use kvmclock and Windows guests use Hyper-V
> > enlightenment, it should be fine to disable 2).
> > 
> > There is a bug open for this, btw: 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1353073
> 
> These people should be happy with just live-migrations, so can't we just
> keep savevm forbidden?

Don't see why. Perhaps savevm should be considered a "special type of
operation" that deviates from baremetal behaviour and that if
the user does savevm, then it knows TSC does not count "at a constant
rate" (so savevm breaks invariant tsc behaviour).





Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-17 Thread Paolo Bonzini


On 17/10/2016 16:50, Radim Krčmář wrote:
> 2016-10-17 07:47-0200, Marcelo Tosatti:
>> On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
>>> I have been wondering: should we allow live migration with the
>>> invtsc flag enabled, if TSC scaling is available on the
>>> destination?
>>
>> TSC scaling and invtsc flag, yes.
> 
> Yes, if we have well synchronized time between hosts, then we might be
> able to migrate with a TSC shift that cannot be perceived by the guest.
> 
> Unless the VM also has a migratable assigned PCI device that uses ART,
> because we have no protocol to update the setting of ART (in CPUID), so
> we should keep migration forbidden then.

We don't publish the ART leaf at all, do we?

>> 1) Migration: to host with different TSC frequency.
> 
> We shouldn't have done this even now when emulating anything newer than
> Pentium 4, because those CPUs have constant TSC, which only lacks the
> guarantee that it doesn't stop in deep C-states:

Right, but:

>> since Linux guests use kvmclock and Windows guests use Hyper-V
>> enlightenment, it should be fine to disable 2).

... and 1 too.

We should also blacklist the TSC deadline timer when invtsc is not
available.

Paolo



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-17 Thread Radim Krčmář
2016-10-17 07:47-0200, Marcelo Tosatti:
> On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
>> I have been wondering: should we allow live migration with the
>> invtsc flag enabled, if TSC scaling is available on the
>> destination?
> 
> TSC scaling and invtsc flag, yes.

Yes, if we have well synchronized time between hosts, then we might be
able to migrate with a TSC shift that cannot be perceived by the guest.

Unless the VM also has a migratable assigned PCI device that uses ART,
because we have no protocol to update the setting of ART (in CPUID), so
we should keep migration forbidden then.

>> For reference, this is what the Intel SDM says about invtsc:
>> 
>>   The time stamp counter in newer processors may support an
>>   enhancement, referred to as invariant TSC. Processor’s support
>>   for invariant TSC is indicated by CPUID.8007H:EDX[8].
>> 
>>   The invariant TSC will run at a constant rate in all ACPI P-,
>>   C-. and T-states. This is the architectural behavior moving
>>   forward. On processors with invariant TSC support, the OS may
>>   use the TSC for wall clock timer services (instead of ACPI or
>>   HPET timers). TSC reads are much more efficient and do not
>>   incur the overhead associated with a ring transition or access
>>   to a platform resource.
>
> Yes. The blockage happened for different reasons:
> 
> 1) Migration: to host with different TSC frequency.

We shouldn't have done this even now when emulating anything newer than
Pentium 4, because those CPUs have constant TSC, which only lacks the
guarantee that it doesn't stop in deep C-states:

  For [a list of processors we emulate]: the time-stamp counter
  increments at a constant rate. That rate may be set by the maximum
  core-clock to bus-clock ratio of the processor or may be set by the
  maximum resolved frequency at which the processor is booted. The
  maximum resolved frequency may differ from the processor base
  frequency, see Section 18.18.2 for more detail. On certain processors,
  the TSC frequency may not be the same as the frequency in the brand
  string.

  The specific processor configuration determines the behavior. Constant
  TSC behavior ensures that the duration of each clock tick is uniform
  and supports the use of the TSC as a wall clock timer even if the
  processor core changes frequency. This is the architectural behavior
  moving forward.

Invariant TSC is more useful, though, so more applications would break
when migrating to a different TSC frequency.

> 2) Savevm: It is not safe to use the TSC for wall clock timer
> services.

With constant TSC, we could argue that a shift to deep C-state happened
and paused TSC, which is not a good behavior, but somewhat defensible.

> By allowing savevm, you make a commitment to allow a feature
> at the expense of not complying with the spec (specifically the "
> the OS may use the TSC for wall clock timer services", because the
> TSC stops relative to realtime for the duration of the savevm stop
> window).

Yep, we should at least guesstimate the TSC to allow the guest to resume
with as small TSC-shift as possible and check that hosts were somewhat
synchronized with UTC (or something we choose for time).

> But since Linux guests use kvmclock and Windows guests use Hyper-V
> enlightenment, it should be fine to disable 2).
> 
> There is a bug open for this, btw: 
> https://bugzilla.redhat.com/show_bug.cgi?id=1353073

These people should be happy with just live-migrations, so can't we just
keep savevm forbidden?



Re: [Qemu-devel] invtsc + migration + TSC scaling

2016-10-17 Thread Marcelo Tosatti
On Fri, Oct 14, 2016 at 06:20:31PM -0300, Eduardo Habkost wrote:
> I have been wondering: should we allow live migration with the
> invtsc flag enabled, if TSC scaling is available on the
> destination?

TSC scaling and invtsc flag, yes.

> 
> For reference, this is what the Intel SDM says about invtsc:
> 
>   The time stamp counter in newer processors may support an
>   enhancement, referred to as invariant TSC. Processor’s support
>   for invariant TSC is indicated by CPUID.8007H:EDX[8].
> 
>   The invariant TSC will run at a constant rate in all ACPI P-,
>   C-. and T-states. This is the architectural behavior moving
>   forward. On processors with invariant TSC support, the OS may
>   use the TSC for wall clock timer services (instead of ACPI or
>   HPET timers). TSC reads are much more efficient and do not
>   incur the overhead associated with a ring transition or access
>   to a platform resource.
> 
> -- 
> Eduardo

Yes. The blockage happened for different reasons:

1) Migration: to host with different TSC frequency.

2) Savevm: It is not safe to use the TSC for wall clock timer
services.

By allowing savevm, you make a commitment to allow a feature
at the expense of not complying with the spec (specifically the "
the OS may use the TSC for wall clock timer services", because the
TSC stops relative to realtime for the duration of the savevm stop
window).

But since Linux guests use kvmclock and Windows guests use Hyper-V
enlightenment, it should be fine to disable 2).

There is a bug open for this, btw: 
https://bugzilla.redhat.com/show_bug.cgi?id=1353073