[RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Alok Kataria
Hi,

Please find below the proposal for the generic use of cpuid space
allotted for hypervisors. Apart from this cpuid space another thing
worth noting would be that, Intel  AMD reserve the MSRs from 0x4000
- 0x40FF for software use. Though the proposal doesn't talk about
MSR's right now, we should be aware of these reservations as we may want
to extend the way we use CPUID to MSR usage as well.

While we are at it, we also think we should form a group which has at
least one person representing each of the hypervisors interested in
generalizing the hypervisor CPUID space for Linux guest OS. This group
will be informed whenever a new CPUID leaf from the generic space is to
be used. This would help avoid any duplicate definitions for a CPUID
semantic by two different hypervisors. I think most of the people are
subscribed to LKML or the virtualization lists and we should use these
lists as a platform to decide on things. 

Thanks,
Alok

---

Hypervisor CPUID Interface Proposal
---

Intel  AMD have reserved cpuid levels 0x4000 - 0x40FF for
software use.  Hypervisors can use these levels to provide an interface
to pass information from the hypervisor to the guest running inside a
virtual machine.

This proposal defines a standard framework for the way in which the
Linux and hypervisor communities incrementally define this CPUID space.

(This proposal may be adopted by other guest OSes.  However, that is not
a requirement because a hypervisor can expose a different CPUID
interface depending on the guest OS type that is specified by the VM
configuration.)

Hypervisor Present Bit:
Bit 31 of ECX of CPUID leaf 0x1.

This bit has been reserved by Intel  AMD for use by
hypervisors, and indicates the presence of a hypervisor.

Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's
(all existing and future cpu's) set this bit to zero.  This bit
can be probed by the guest software to detect whether they are
running inside a virtual machine.

Hypervisor CPUID Information Leaf:
Leaf 0x4000.

This leaf returns the CPUID leaf range supported by the
hypervisor and the hypervisor vendor signature.

# EAX: The maximum input value for CPUID supported by the hypervisor.
# EBX, ECX, EDX: Hypervisor vendor ID signature.

Hypervisor Specific Leaves:
Leaf range 0x4001 - 0x400F.

These cpuid leaves are reserved as hypervisor specific leaves.
The semantics of these 15 leaves depend on the signature read
from the Hypervisor Information Leaf.

Generic Leaves:
Leaf range 0x4010 - 0x400FF.

The semantics of these leaves are consistent across all
hypervisors.  This allows the guest kernel to probe and
interpret these leaves without checking for a hypervisor
signature.

A hypervisor can indicate that a leaf or a leaf's field is
unsupported by returning zero when that leaf or field is probed.

To avoid the situation where multiple hypervisors attempt to define the
semantics for the same leaf during development, we can partition
the generic leaf space to allow each hypervisor to define a part
of the generic space.

For instance:
  VMware could define 0x401X
  Xen could define 0x402X
  KVM could define 0x403X
  and so on...

Note that hypervisors can implement any leaves that have been
defined in the generic leaf space whenever common features can
be found.  For example, VMware hypervisors can implement leafs
that have been defined in the KVM area 0x403X and vice
versa.

The kernel can detect the support for a generic field inside 
leaf 0x40XY using the following algorithm:

1.  Get EAX from Leaf 0x4, Hypervisor CPUID information.
EAX returns the maximum input value for the hypervisor CPUID
space.

If EAX  0x40XY, then the field is not available.

2.  Else, extract the field from the target Leaf 0x40XY 
by doing cpuid(0x40XY).

If (field == 0), this feature is unsupported/unimplemented
by the hypervisor.  The kernel should handle this case 
gracefully so that a hypervisor is never required to 
support or implement any particular generic leaf.



Definition of the Generic CPUID space.
Leaf 0x4010, Timing Information.

VMware has defined the first generic leaf to provide timing
information.  This leaf returns the current TSC frequency and
current Bus frequency in kHz.

# EAX: (Virtual) TSC frequency in kHz.
   

Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread H. Peter Anvin
Alok Kataria wrote:
 
 (This proposal may be adopted by other guest OSes.  However, that is not
 a requirement because a hypervisor can expose a different CPUID
 interface depending on the guest OS type that is specified by the VM
 configuration.)
 

Excuse me, but that is blatantly idiotic.  Expecting the user having to 
configure a VM to match the target OS is *exactly* as stupid as 
expecting the user to reconfigure the BIOS.  It's totally the wrong 
thing to do.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread H. Peter Anvin
Alok Kataria wrote:
 
 Hypervisor CPUID Interface Proposal
 ---
 
 Intel  AMD have reserved cpuid levels 0x4000 - 0x40FF for
 software use.  Hypervisors can use these levels to provide an interface
 to pass information from the hypervisor to the guest running inside a
 virtual machine.
 
 This proposal defines a standard framework for the way in which the
 Linux and hypervisor communities incrementally define this CPUID space.
 

I also observe that your proposal provides no mean of positive 
identification, i.e. that a hypervisor actually conforms to your proposal.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Jeremy Fitzhardinge
Alok Kataria wrote:
 Hi,

 Please find below the proposal for the generic use of cpuid space
 allotted for hypervisors. Apart from this cpuid space another thing
 worth noting would be that, Intel  AMD reserve the MSRs from 0x4000
 - 0x40FF for software use. Though the proposal doesn't talk about
 MSR's right now, we should be aware of these reservations as we may want
 to extend the way we use CPUID to MSR usage as well.

 While we are at it, we also think we should form a group which has at
 least one person representing each of the hypervisors interested in
 generalizing the hypervisor CPUID space for Linux guest OS. This group
 will be informed whenever a new CPUID leaf from the generic space is to
 be used. This would help avoid any duplicate definitions for a CPUID
 semantic by two different hypervisors. I think most of the people are
 subscribed to LKML or the virtualization lists and we should use these
 lists as a platform to decide on things. 

 Thanks,
 Alok

 ---

 Hypervisor CPUID Interface Proposal
 ---

 Intel  AMD have reserved cpuid levels 0x4000 - 0x40FF for
 software use.  Hypervisors can use these levels to provide an interface
 to pass information from the hypervisor to the guest running inside a
 virtual machine.

 This proposal defines a standard framework for the way in which the
 Linux and hypervisor communities incrementally define this CPUID space.

 (This proposal may be adopted by other guest OSes.  However, that is not
 a requirement because a hypervisor can expose a different CPUID
 interface depending on the guest OS type that is specified by the VM
 configuration.)

 Hypervisor Present Bit:
 Bit 31 of ECX of CPUID leaf 0x1.

 This bit has been reserved by Intel  AMD for use by
 hypervisors, and indicates the presence of a hypervisor.

 Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's
 (all existing and future cpu's) set this bit to zero.  This bit
   can be probed by the guest software to detect whether they are
   running inside a virtual machine.

 Hypervisor CPUID Information Leaf:
 Leaf 0x4000.

 This leaf returns the CPUID leaf range supported by the
 hypervisor and the hypervisor vendor signature.

 # EAX: The maximum input value for CPUID supported by the hypervisor.
 # EBX, ECX, EDX: Hypervisor vendor ID signature.

 Hypervisor Specific Leaves:
 Leaf range 0x4001 - 0x400F.

 These cpuid leaves are reserved as hypervisor specific leaves.
 The semantics of these 15 leaves depend on the signature read
 from the Hypervisor Information Leaf.

 Generic Leaves:
 Leaf range 0x4010 - 0x400FF.

 The semantics of these leaves are consistent across all
 hypervisors.  This allows the guest kernel to probe and
 interpret these leaves without checking for a hypervisor
 signature.

 A hypervisor can indicate that a leaf or a leaf's field is
 unsupported by returning zero when that leaf or field is probed.

 To avoid the situation where multiple hypervisors attempt to define 
 the
 semantics for the same leaf during development, we can partition
 the generic leaf space to allow each hypervisor to define a part
 of the generic space.

 For instance:
   VMware could define 0x401X
   Xen could define 0x402X
   KVM could define 0x403X
 and so on...
   

No, we're not getting anywhere.  This is an outright broken idea.  The 
space is too small to be able to chop up in this way, and the number of 
vendors too large to be able to do it without having a central oversight.

The only way this can work is by having explicit positive identification 
of each group of leaves with a signature.  If there's a recognizable 
signature, then you can inspect the rest of the group; if not, then you 
can't.  That way, you can avoid any leaf usage which doesn't conform to 
this model, and you can also simultaneously support multiple hypervisor 
ABIs.  It also accommodates existing hypervisor use of this leaf space, 
even if they currently use a fixed location within it.

A concrete counter-proposal:

The space 0x4000-0x40ff is reserved for hypervisor usage.

This region is divided into 16 16-leaf blocks.  Each block has the 
structure:

0x40x0:
eax: max used leaf within the leaf block (max 0x40xf)
e[bcd]x: leaf block signature.  This may be a hypervisor-specific 
signature, or a generic signature, depending on the contents of the block

A guest may search for any supported Hypervisor ABIs by inspecting each 
leaf at 0x40x0 for a known signature, and then may choose its mode 
of operation accordingly.  It must ignore any unknown signatures, and 
not touch any of the leaves within an unknown leaf block.

Hypervisor vendors who want to add a 

Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
 Jeremy Fitzhardinge wrote:

 No, we're not getting anywhere.  This is an outright broken idea.  
 The space is too small to be able to chop up in this way, and the 
 number of vendors too large to be able to do it without having a 
 central oversight.


 I suspect we can get a larger number space if we ask Intel  AMD.  In 
 fact, I think we should request that the entire 0x40xx numberspace 
 is assigned to virtualization *anyway*.

Yes, that would be good.  In that case I'd revise my proposal to back 
each leaf block 256 leaves instead of 16.  But it still needs to be a 
proper enumeration with signatures, rather than assigning fixed points 
in that space to specific interfaces.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread H. Peter Anvin
Jeremy Fitzhardinge wrote:
 
 No, we're not getting anywhere.  This is an outright broken idea.  The 
 space is too small to be able to chop up in this way, and the number of 
 vendors too large to be able to do it without having a central oversight.
 

I suspect we can get a larger number space if we ask Intel  AMD.  In 
fact, I think we should request that the entire 0x40xx numberspace 
is assigned to virtualization *anyway*.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
 With a sufficiently large block, we could use fixed points, e.g. by 
 having each vendor create interfaces in the 0x40XX range, where 
  is the PCI ID they use for PCI devices.

Sure, you could do that, but you'd still want to have a signature in 
0x4000 to positively identify the chunk.  And what if you wanted 
more than 256 leaves?

 Note that I said create interfaces.  It's important that all about 
 this is who specified the interface -- for what hypervisor is this 
 just use 0x4000 and disambiguate based on that.

What hypervisor is this? isn't a very interesting question; if you're 
even asking it then it suggests that something has gone wrong.  Its much 
more useful to ask what interfaces does this hypervisor support?, and 
enumerating a smallish range of well-known leaves looking for signatures 
is the simplest way to do that.  (We could use signatures derived from 
the PCI vendor IDs which would help with managing that namespace.)

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
 What you'd want, at least, is a standard CPUID identification and 
 range leaf at the top.  256 leaves is a *lot*, though; I'm not saying 
 one couldn't run out, but it'd be hard.  Keep in mind that for large 
 objects there are counting CPUID levels, as much as I personally 
 dislike them, and one could easily argue that if you're doing 
 something that would require anywhere near 256 leaves you probably are 
 storing bulk data that belongs elsewhere.

I agree, but it just makes the proposal a bit more brittle.

 Of course, if we had some kind of central authority assigning 8-bit 
 IDs that would be even better, especially since there are tools in the 
 field which already scan on 64K boundaries.  I don't know, though, how 
 likely it is that we'll have to deal with 256 hypervisors.

I'm assuming that the likelihood of getting all possible vendors - 
current and future - to agree to a scheme like this is pretty small.  We 
need to come up with something that will work well when there are 
non-cooperative parties to deal with.

 I agree completely, of course (except that what hypervisor is this 
 still has limited usage, especially when it comes to dealing with bug 
 workarounds.  Similar to the way we use CPU vendor IDs and stepping 
 numbers for physical CPUs.)

I guess.  Its certainly useful to be able to identify the hypervisor for 
bug reporting and just general status information.  But making 
functional changes on that basis should be a last resort.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Anthony Liguori
Jeremy Fitzhardinge wrote:
 Alok Kataria wrote:
 
 No, we're not getting anywhere.  This is an outright broken idea.  The 
 space is too small to be able to chop up in this way, and the number of 
 vendors too large to be able to do it without having a central oversight.
 
 The only way this can work is by having explicit positive identification 
 of each group of leaves with a signature.  If there's a recognizable 
 signature, then you can inspect the rest of the group; if not, then you 
 can't.  That way, you can avoid any leaf usage which doesn't conform to 
 this model, and you can also simultaneously support multiple hypervisor 
 ABIs.  It also accommodates existing hypervisor use of this leaf space, 
 even if they currently use a fixed location within it.
 
 A concrete counter-proposal:

Mmm, cpuid bikeshedding :-)

 The space 0x4000-0x40ff is reserved for hypervisor usage.
 
 This region is divided into 16 16-leaf blocks.  Each block has the 
 structure:
 
 0x40x0:
 eax: max used leaf within the leaf block (max 0x40xf)

Why even bother with this?  It doesn't seem necessary in your proposal.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Mmm, cpuid bikeshedding :-)

My shade of blue is better.

 The space 0x4000-0x40ff is reserved for hypervisor usage.

 This region is divided into 16 16-leaf blocks.  Each block has the 
 structure:

 0x40x0:
 eax: max used leaf within the leaf block (max 0x40xf)

 Why even bother with this?  It doesn't seem necessary in your proposal.

It allows someone to incrementally add things to their block in a fairly 
orderly way.  But more importantly, its the prevailing idiom, and the 
existing and proposed cpuid schemes already do this, so they'd fit in as-is.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Anthony Liguori
Jeremy Fitzhardinge wrote:
 Anthony Liguori wrote:
 Mmm, cpuid bikeshedding :-)
 
 My shade of blue is better.
 
 The space 0x4000-0x40ff is reserved for hypervisor usage.

 This region is divided into 16 16-leaf blocks.  Each block has the 
 structure:

 0x40x0:
 eax: max used leaf within the leaf block (max 0x40xf)
 Why even bother with this?  It doesn't seem necessary in your proposal.
 
 It allows someone to incrementally add things to their block in a fairly 
 orderly way.  But more importantly, its the prevailing idiom, and the 
 existing and proposed cpuid schemes already do this, so they'd fit in as-is.

We just leave eax as zero.  It wouldn't be that upsetting to change this 
as it would only keep new guests from working on older KVMs.

However, I see little incentive to change anything unless there's 
something compelling that we would get in return.  Since we're only 
talking about Linux guests, it's just as easy for us to add things to 
our paravirt_ops implementation as it would be to add things using this 
new model.

If this was something that other guests were all agreeing to support 
(even if it was just the BSDs and OpenSolaris), then there may be value 
to it.  Right now, I see no real value in changing the status quo.

Regards,

Anthony Liguori


 J

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Anthony Liguori
Alok Kataria wrote:
 On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote:
   
 2. Divergence in the interface provided by the hypervisors  : 
   The reason we brought up a flat hierarchy is because we think we should
 be moving towards a approach where the guest code doesn't diverge too
 much when running under different hypervisors. That is the guest
 essentially does the same thing if its running on say Xen or VMware.

 This design IMO, will take us a step backward to  what we already have
 seen with para virt ops. Each hypervisor (mostly) defines its own cpuid
 block, the guest correspondingly needs to have code to handle each of
 these cpuid blocks, with these blocks will mostly being exclusive.
   

What's wrong with what we have in paravirt_ops?  Just agreeing on CPUID 
doesn't help very much.  You still need a mechanism for doing hypercalls 
to implement anything meaningful.  We aren't going to agree on a 
hypercall mechanism.  KVM uses direct hypercall instructions, Xen uses a 
hypercall page, VMware uses VMI, Hyper-V uses MSR writes.  We all have 
already defined the hypercall namespace in a certain way.

We've already gone down the road of trying to make standard paravirtual 
interfaces (via virtio).  No one was sufficiently interested in 
collaborating.  I don't see why other paravirtualizations are going to 
be much different.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Chris Wright
* Anthony Liguori ([EMAIL PROTECTED]) wrote:
 We've already gone down the road of trying to make standard paravirtual  
 interfaces (via virtio).  No one was sufficiently interested in  
 collaborating.  I don't see why other paravirtualizations are going to  
 be much different.

The point is to be able to support those interfaces.  Presently a Linux guest
will test and find out which HV it's running on, and adapt.  Another
guest will fail to enlighten itself, and perf will suffer...yadda, yadda.

thanks,
-chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Jeremy Fitzhardinge
Alok Kataria wrote:
 1. Kernel complexity : Just thinking about the complexity that this will
 put in the kernel to handle these multiple ABI signatures and scanning
 all of these leaf block's is difficult to digest.
   

The scanning for the signatures is trivial; it's not a significant 
amount of code.  Actually implementing them is a different matter, but 
that's the same regardless of where they are placed or how they're 
discovered.  After discovery its the same either way: there's a leaf 
base with offsets from it.

 2. Divergence in the interface provided by the hypervisors  : 
   The reason we brought up a flat hierarchy is because we think we should
 be moving towards a approach where the guest code doesn't diverge too
 much when running under different hypervisors. That is the guest
 essentially does the same thing if its running on say Xen or VMware.
   

I guess, but the bulk of the uses of this stuff are going to be 
hypervisor-specific.  You're hard-pressed to come up with any other 
generic uses beyond tsc.  In general, if a hypervisor is going to put 
something in a special cpuid leaf, its because there's no other good way 
to represent it.  Generic things are generally going to appear as an 
emulated piece of the virtualized platform, in ACPI, DMI, a 
hardware-defined cpuid leaf, etc...

 3. Is their a need to do all this over engineering : 
   Aren't we over engineering a simple interface over here. The point is,
 there are right now 256 cpuid leafs do we realistically think we are
 ever going to exhaust all these leafs. We are really surprised to know
 that people may think this space is small enough. It would be
 interesting to know what all use you might want to put cpuid for.
   

Look, if you want to propose a way to use that cpuid space in a 
reasonably flexible way that allows it to be used as the need arises, 
then we can talk about it.  But I think your proposal is a poor way to 
achieve those ends

If you want blessing for something that you've already implemented and 
shipped, well, you don't need anyone's blessing for that.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Anthony Liguori
Alok Kataria wrote:
 Your explanation below answers the question you raised, the problem
 being we need to have support for each of these different hypercall
 mechanisms in the kernel. 
 I understand that this was the correct thing to do at that moment. 
 But do we want to go the same way again for CPUID when we can make it
 generic (flat enough) for anybody to use it in the same manner and
 expose a generic interface to the kernel.
   

But what sort of information can be stored in cpuid that's actually 
useful?  Right now we just it in KVM for feature bits.  Most of the 
stuff that's interesting is stored in shared memory because a guest can 
read that without taking a vmexit or via a hypercall.

We can all agree upon a common mechanism for doing something but if no 
one is using that mechanism to do anything significant, what purpose 
does it serve?

Regards,

Anthony Liguori

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Anthony Liguori
Chris Wright wrote:
 * Anthony Liguori ([EMAIL PROTECTED]) wrote:
   
 We've already gone down the road of trying to make standard paravirtual  
 interfaces (via virtio).  No one was sufficiently interested in  
 collaborating.  I don't see why other paravirtualizations are going to  
 be much different.
 

 The point is to be able to support those interfaces.  Presently a Linux guest
 will test and find out which HV it's running on, and adapt.  Another
 guest will fail to enlighten itself, and perf will suffer...yadda, yadda.
   

Agreeing on CPUID does not get us close at all to having shared 
interfaces for paravirtualization.  As I said in another note, there are 
more fundamental things that we differ on (like hypercall mechanism) 
that's going to make that challenging.

We already are sharing code, when appropriate (see the Xen/KVM PV clock 
interface).

Regards,

Anthony Liguori

 thanks,
 -chris
   

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Anthony Liguori
Jeremy Fitzhardinge wrote:
 Alok Kataria wrote:

 I guess, but the bulk of the uses of this stuff are going to be 
 hypervisor-specific.  You're hard-pressed to come up with any other 
 generic uses beyond tsc.

And arguably, storing TSC frequency in CPUID is a terrible interface 
because the TSC frequency can change any time a guest is entered.  It 
really should be a shared memory area so that a guest doesn't have to 
vmexit to read it (like it is with the Xen/KVM paravirt clock).

Regards,

Anthony Liguori

   In general, if a hypervisor is going to put something in a special 
 cpuid leaf, its because there's no other good way to represent it.  
 Generic things are generally going to appear as an emulated piece of 
 the virtualized platform, in ACPI, DMI, a hardware-defined cpuid leaf, 
 etc...

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Zachary Amsden
On Wed, 2008-10-01 at 14:34 -0700, Anthony Liguori wrote:
 Jeremy Fitzhardinge wrote:
  Alok Kataria wrote:
 
  I guess, but the bulk of the uses of this stuff are going to be
  hypervisor-specific.  You're hard-pressed to come up with any other
  generic uses beyond tsc.
 
 And arguably, storing TSC frequency in CPUID is a terrible interface
 because the TSC frequency can change any time a guest is entered.  It
 really should be a shared memory area so that a guest doesn't have to
 vmexit to read it (like it is with the Xen/KVM paravirt clock).

It's not terrible, it's actually brilliant.  TSC is part of the
processor architecture, the processor should a way to tell us what speed
it is.

Having a TSC with no interface to determine the frequency is a terrible
design flaw.  This is what caused the problem in the first place.

And now we're trying to fiddle around with software wizardry what should
be done in hardware in the first place.  Once again, para-virtualization
is basically useless.  We can't agree on a solution without
over-designing some complex system with interface signatures and
multi-vendor cooperation and nonsense.  Solve the non-virtualized
problem and the virtualized problem goes away.

Jun, you work at Intel.  Can you ask for a new architecturally defined
MSR that returns the TSC frequency?  Not a virtualization specific MSR.
A real MSR that would exist on physical processors.  The TSC started as
an MSR anyway.  There should be another MSR that tells the frequency.
If it's hard to do in hardware, it can be a write-once MSR that gets
initialized by the BIOS.  It's really a very simple solution to a very
common problem.  Other MSRs are dedicated to bus speed and so on, this
seems remarkably similar.

Once the physical problem is solved, the virtualized problem doesn't
even exist.  We simply add support for the newly defined MSR and voilla.
Other chipmakers probably agree it's a good idea and go along with it
too, and in the meantime, reading a non-existent MSR is a fairly
harmlessly handled #GP.

I realize it's the wrong thing for us now, but long term, it's the only
architecturally 'correct' approach.  You can even extend it to have
visible TSC frequency changes clocked via performance counter events
(and then get interrupts on those events if you so wish), solving the
dynamic problem too.

Paravirtualization is a symptom of an architectural problem.  We should
always be trying to fix the architecture first.

Zach

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread H. Peter Anvin
Zachary Amsden wrote:
 
 Jun, you work at Intel.  Can you ask for a new architecturally defined
 MSR that returns the TSC frequency?  Not a virtualization specific MSR.
 A real MSR that would exist on physical processors.  The TSC started as
 an MSR anyway.  There should be another MSR that tells the frequency.
 If it's hard to do in hardware, it can be a write-once MSR that gets
 initialized by the BIOS.  It's really a very simple solution to a very
 common problem.  Other MSRs are dedicated to bus speed and so on, this
 seems remarkably similar.
 

Ah, if it was only that simple.  Transmeta actually did this, but it's 
not as useful as you think.

There are at least three crystals in modern PCs: one at 32.768 kHz (for 
the RTC), one at 14.31818 MHz (PIT, PMTMR and HPET), and one at a higher 
frequency (often 200 MHz.)

All the main data distribution clocks in the system are derived from the 
third, which is subject to spread-spectrum modulation due to RFI 
concerns.  Therefore, relying on the *nominal* frequency of this clock 
is vastly incorrect; often by as much as 2%.  Spread-spectrum modulation 
is supposed to vary around zero enough that the spreading averages out, 
but the only way to know what the center frequency actually is is to 
average.  Furthermore, this high-frequency clock is generally not 
calibrated anywhere near as well as the 14 MHz clock; in good designs 
the 14 MHz is actually a TCXO (temperature compensated crystal 
oscillator), which is accurate to something like ±2 ppm.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

RE: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Nakajima, Jun
On 10/1/2008 3:46:45 PM, H. Peter Anvin wrote:
 Alok Kataria wrote:
   No, that's always a terrible idea.  Sure, its necessary to deal
   with some backward-compatibility issues, but we should even
   consider a new interface which assumes this kind of thing.  We
   want properly enumerable interfaces.
 
  The reason we still have to do this is because, Microsoft has
  already defined a CPUID format which is way different than what you
  or I are proposing ( with the current case of 256 leafs being
  available). And I doubt they would change the way they deal with it on 
  their OS.
  Any proposal that we go with, we will have to export different CPUID
  interface from the hypervisor for the 2 OS in question.
 
  So i think this is something that we anyways will have to do and not
  worth binging about in the discussion.

 No, that's a good hint that what you and I are proposing is utterly
 broken and exactly underscores what I have been stressing about
 noncompliant hypervisors.

 All I have seen out of Microsoft only covers CPUID levels 0x4000
 as an vendor identification leaf and 0x4001 as a hypervisor
 identification leaf, but you might have access to other information.

No, it says Leaf 0x4001 as hypervisor vendor-neutral interface 
identification, which determines the semantics of leaves from 0x4002 
through 0x40FF. The Leaf 0x4000 returns vendor identifier signature 
(i.e. hypervisor identification) and the hypervisor CPUID leaf range, as in the 
proposal.


 This further underscores my belief that using 0x40xx for anything
 standards-based at all is utterly futile, and that this space should
 be treated as vendor identification and the rest as vendor-specific.
 Any hope of creating a standard that's actually usable needs to be
 outside this space, e.g. in the 0x40xx space I proposed earlier.


Actually I'm not sure I'm following your logic. Are you saying using that 
0x40xx for anything standards-based is utterly futile because Microsoft 
said the range is hypervisor vendor-neutral? Or you were not sure what they 
meant there. If we are not clear, we can ask them.


 -hpa
 .
Jun Nakajima | Intel Open Source Technology Center
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Zachary Amsden
On Wed, 2008-10-01 at 17:39 -0700, H. Peter Anvin wrote:
 third, which is subject to spread-spectrum modulation due to RFI
 concerns.  Therefore, relying on the *nominal* frequency of this clock

I'm not suggesting using the nominal value.  I'm suggesting the
measurement be done in the one and only place where there is perfect
control of the system, the processor boot-strapping in the BIOS.

Only the platform designers themselves know the speed of the oscillator
which is modulating the clock and so only they should be calibrating the
speed of the TSC.

If this modulation really does alter the frequency by +/- 2% (seems high
to me, but hey, I don't design motherboards), using an LFO, then
basically all the calibration done in Linux is broken and has been for
some time.  You can't calibrate only once, or risk being off by 2%, you
can't calibrate repeatedly and take the fastest estimate, or you are off
by 2%, and you can't calibrate repeatedly and take the average without
risking SMI noise affecting the lowest clock speed measurement,
contributing unknown error.

Hmm.  Re-reading your e-mail, I see you are saying the nominal frequency
may be off by 2% (and I easily believe that), not necessarily that the
frequency modulation may be 2% (which I still think is high).  Does
anyone know what the actual bounds on spread spectrum modulation are or
how fast the clock is modulated?

Zach

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread H. Peter Anvin
Zachary Amsden wrote:
 
 I'm not suggesting using the nominal value.  I'm suggesting the
 measurement be done in the one and only place where there is perfect
 control of the system, the processor boot-strapping in the BIOS.
 
 Only the platform designers themselves know the speed of the oscillator
 which is modulating the clock and so only they should be calibrating the
 speed of the TSC.
 

No.  *Noone*, including the manufacturers, know the speed of the 
oscillator which is modulating the clock.  What you have to do is 
average over a timespan which is long enough that the SSM averages out 
(a relatively small fraction of a second.)

As for trusting the BIOS on this, that's a total joke.  Firmware vendors 
can't get the most basic details right.

 If this modulation really does alter the frequency by +/- 2% (seems high
 to me, but hey, I don't design motherboards), using an LFO, then
 basically all the calibration done in Linux is broken and has been for
 some time.  You can't calibrate only once, or risk being off by 2%, you
 can't calibrate repeatedly and take the fastest estimate, or you are off
 by 2%, and you can't calibrate repeatedly and take the average without
 risking SMI noise affecting the lowest clock speed measurement,
 contributing unknown error.

You have to calibrate over a sample interval long enough that the SSM 
averages out.

 Hmm.  Re-reading your e-mail, I see you are saying the nominal frequency
 may be off by 2% (and I easily believe that), not necessarily that the
 frequency modulation may be 2% (which I still think is high).  Does
 anyone know what the actual bounds on spread spectrum modulation are or
 how fast the clock is modulated?

No, I'm saying the frequency modulation may be up to 2%.  Typically it 
is something like [-2%,+0%].

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization