from:"Tian, Kevin"

RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-10 Thread Tian, Kevin

 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 Sent: Thursday, December 11, 2014 12:59 AM
 
 On 09/12/2014 03:49, Tian, Kevin wrote:
  - Now we have XenGT/KVMGT separately maintained, and KVMGT lags
  behind XenGT regarding to features and qualities. Likely you'll continue
  see stale code (like Xen inst decoder) for some time. In the future we
  plan to maintain a single kernel repo for both, so KVMGT can share
  same quality as XenGT once KVM in-kernel dm framework is stable.
 
  - Regarding to Qemu hacks, KVMGT really doesn't have any different
  requirements as what have been discussed for GPU pass-through, e.g.
  about ISA bridge. Our implementation is based on an old Qemu repo,
  and honestly speaking not cleanly developed, because we know we
  can leverage from GPU pass-through support once it's in Qemu. At
  that time we'll leverage the same logic with minimal changes to
  hook KVMGT mgmt. APIs (e.g. create/destroy a vGPU instance). So
  we can ignore this area for now. :-)
 
 Could the virtual device model introduce new registers in order to avoid
 poking at the ISA bridge?  I'm not sure that you can leverage from GPU
 pass-through support once it's in Qemu, since the Xen IGD passthrough
 support is being added to a separate machine that is specific to Xen IGD
 passthrough; no ISA bridge hacking will probably be allowed on the -M
 pc and -M q35 machine types.
 

My point is that KVMGT doesn't introduce new requirements as what's
required in IGD passthrough case, because all the hacks you see now
is to satisfy guest graphics driver's expectation. I haven't follow up the
KVM IGD passthrough progress, but if it doesn't require ISA bridge hacking
the same trick can be adopted by KVMGT too. You may know Allen is
working on driver changes to avoid causing those hacks in Qemu side.
That effort will benefit us too. So I don't think this is a KVMGT specific
issue, and we need a common solution to close this gap instead of 
hacking vGPU device model alone.

Thanks
Kevin

RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-09 Thread Tian, Kevin

 From: Song, Jike
 Sent: Wednesday, December 10, 2014 2:34 PM
 
 CC Kevin.
 
 
 On 12/09/2014 05:54 PM, Jan Kiszka wrote:
  On 2014-12-04 03:24, Jike Song wrote:
  Hi all,
 
We are pleased to announce the first release of KVMGT project. KVMGT
 is
  the implementation of Intel GVT-g technology, a full GPU virtualization
  solution. Under Intel GVT-g, a virtual GPU instance is maintained for
  each VM, with part of performance critical resources directly assigned.
  The capability of running native graphics driver inside a VM, without
  hypervisor intervention in performance critical paths, achieves a good
  balance of performance, feature, and sharing capability.
 
 
KVMGT is still in the early stage:
 
 - Basic functions of full GPU virtualization works, guest can see a
  full-featured vGPU.
   We ran several 3D workloads such as lightsmark, nexuiz, urbanterror
  and warsow.
 
 - Only Linux guest supported so far, and PPGTT must be disabled in
  guest through a
   kernel parameter(see README.kvmgt in QEMU).
 
 - This drop also includes some Xen specific changes, which will be
  cleaned up later.
 
 - Our end goal is to upstream both XenGT and KVMGT, which shares
 ~90%
  logic for vGPU
   device model (will be part of i915 driver), with only difference in
  hypervisor
   specific services
 
 - insufficient test coverage, so please bear with stability issues :)
 
 
 
There are things need to be improved, esp. the KVM interfacing part:
 
   1a domid was added to each KVMGT guest
 
   An ID is needed for foreground OS switching, e.g.
 
   # echo domid
 /sys/kernel/vgt/control/foreground_vm
 
   domid 0 is reserved for host OS.
 
 
2SRCU workarounds.
 
   Some KVM functions, such as:
 
   kvm_io_bus_register_dev
   install_new_memslots
 
   must be called *without* kvm-srcu read-locked. Otherwise it
  hangs.
 
   In KVMGT, we need to register an iodev only *after* BAR
  registers are
   written by guest. That means, we already have kvm-srcu
 hold -
   trapping/emulating PIO(BAR registers) makes us in such a
 condition.
   That will make kvm_io_bus_register_dev hangs.
 
   Currently we have to disable rcu_assign_pointer() in such
  functions.
 
   These were dirty workarounds, your suggestions are high
 welcome!
 
 
   3syscalls were called to access /dev/mem from kernel
 
   An in-kernel memslot was added for aperture, but using syscalls
  like
   open and mmap to open and access the character device
 /dev/mem,
   for pass-through.
 
 
 
 
  The source codes(kernel, qemu as well as seabios) are available at github:
 
   git://github.com/01org/KVMGT-kernel
   git://github.com/01org/KVMGT-qemu
   git://github.com/01org/KVMGT-seabios
 
  In the KVMGT-qemu repository, there is a README.kvmgt to be referred.
 
 
 
  More information about Intel GVT-g and KVMGT can be found at:
 
 
 https://www.usenix.org/conference/atc14/technical-sessions/presentation/tia
 n
 
 
 http://events.linuxfoundation.org/sites/events/files/slides/KVMGT-a%20Full%2
 0GPU%20Virtualization%20Solution_1.pdf
 
 
 
  Appreciate your comments, BUG reports, and contributions!
 
 
  There is an even increasing interest to keep KVM's in-kernel guest
  interface as small as possible, specifically for security reasons. I'm
  sure there are some good performance reasons to create a new in-kernel
  device model, but I suppose those will need good evidences why things
  are done in the way they finally should be - and not via a user-space
  device model. This is likely not a binary decision (all userspace vs. no
  userspace), it is more about the size and robustness of the in-kernel
  model vs. its performance.

Thanks for explaining the background. We're not against the userspace
model if applied, but based on our analysis we figured out the in-kernel
model is the best suite, not just for performance reason, but also for
the tight couple to i915 functionalities (scheduling, interrupt, security, etc.)
and hypervisor functionalities (GPU shadow page table, etc.) which are 
best handled in kernel directly. Definitely we don't want to split it just for
performance reason, w/o a functionally clear separation, because that 
just creates unnecessary/messy user/kernel interfaces. And now we've
got i915 community's signal that they're willing to pick the core code
into i915 driver, which we're currently working on.

So, not to eliminate the possibility of user/kernel split, how about we first
look at those in-kernel dm changes for KVM? Then you can help judge
whether it's a reasonable change or instead there's a better option. Jike 
will summarize and start the discussion in a separate thread. 

 
  One aspect could also be important: Are there hardware improvements in
  sight that will eventually help to reduce the in-kernel

RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-08 Thread Tian, Kevin

Here is some background of this KVMGT release:

- the major purpose is for early experiment of this technique in KVM, and 
throw out issues about adding in-kernel device model (or mediated pass-through 
framework) in KVM.

- KVMGT shares 90% code as XenGT, regarding to vGPU device model. The
only difference is the in-kernel dm interface. The vGPU device model will be
split and integrated in i915 driver. It will register to in-kernel dm framework
provided either by Xen or KVM at boot time. Upstreaming of vGPU device
model is already in progress, with valuable comments received from i915 
community. However the refactoring mostly happen in XenGT repo now 

- Now we have XenGT/KVMGT separately maintained, and KVMGT lags
behind XenGT regarding to features and qualities. Likely you'll continue
see stale code (like Xen inst decoder) for some time. In the future we
plan to maintain a single kernel repo for both, so KVMGT can share
same quality as XenGT once KVM in-kernel dm framework is stable.

- Regarding to Qemu hacks, KVMGT really doesn't have any different 
requirements as what have been discussed for GPU pass-through, e.g. 
about ISA bridge. Our implementation is based on an old Qemu repo, 
and honestly speaking not cleanly developed, because we know we
can leverage from GPU pass-through support once it's in Qemu. At 
that time we'll leverage the same logic with minimal changes to 
hook KVMGT mgmt. APIs (e.g. create/destroy a vGPU instance). So
we can ignore this area for now. :-)

Thanks
Kevin

 From: Paolo Bonzini
 Sent: Friday, December 05, 2014 9:04 PM
 
 
 
 On 05/12/2014 09:50, Gerd Hoffmann wrote:
  A few comments on the kernel stuff (brief look so far, also
  compile-tested only, intel gfx on my test machine is too old).
 
   * Noticed the kernel bits don't even compile when configured as
 module.  Everything (vgt, i915, kvm) must be compiled into the
 kernel.
 
 I'll add that the patch is basically impossible to review with all the
 XenGT bits still in.  For example, the x86 emulator seems to be
 unnecessary for KVMGT, but I am not 100% sure.
 
 I would like a clear understanding of why/how Andrew Barnes was able to
 do i915 passthrough (GVT-d) without hacking the ISA bridge, and why this
 does not apply to GVT-g.
 
 Paolo
 
   * Design approach still seems to be i915 on vgt not the other way
 around.
 
  Qemu/SeaBIOS bits:
 
  I've seen the host bridge changes identity from i440fx to
  copy-pci-ids-from-host.  Guess the reason for this is that seabios uses
  this device to figure whenever it is running on i440fx or q35.  Correct?
 
  What are the exact requirements for the device?  Must it match the host
  exactly, to not confuse the guest intel graphics driver?  Or would
  something more recent -- such as the q35 emulation qemu has -- be good
  enough to make things work (assuming we add support for the
  graphic-related pci config space registers there)?
 
  The patch also adds a dummy isa bridge at 0x1f.  Simliar question here:
  What exactly is needed here?  Would things work if we simply use the q35
  lpc device here?
 
  more to come after I've read the paper linked above ...
 
  cheers,
Gerd
 
 
  ___
  Intel-gfx mailing list
  intel-...@lists.freedesktop.org
  http://lists.freedesktop.org/mailman/listinfo/intel-gfx
 
 ___
 Intel-gfx mailing list
 intel-...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/intel-gfx

RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-08 Thread Tian, Kevin

 From: Daniel Vetter
 Sent: Monday, December 08, 2014 6:21 PM
 
 On Mon, Dec 08, 2014 at 10:55:01AM +0100, Gerd Hoffmann wrote:
  On Sa, 2014-12-06 at 12:17 +0800, Jike Song wrote:
   I don't know that is exactly needed, we also need to have Windows
   driver considered.  However, I'm quite confident that, if things gonna
   work for IGD passthrough, it gonna work for GVT-g.
 
  I'd suggest to focus on q35 emulation.  q35 is new enough that a version
  with integrated graphics exists, so the gap we have to close is *much*
  smaller.
 
  In case guests expect a northbridge matching the chipset generation of
  the graphics device (which I'd expect is the case, after digging a bit
  in the igd and agpgart linux driver code) I think we should add proper
  device emulation for them, i.e. comply q35-pcihost with
  sandybridge-pcihost + ivybridge-pcihost + haswell-pcihost instead of
  just copying over the pci ids from the host.  Most likely all those
  variants can share most of the emulation code.
 
 I don't think i915.ko should care about either northbridge nor pch on
 para-virtualized platforms. We do noodle around in there for the oddball
 memory controller setting and for some display stuff. But neither of that
 really applies to paravirtualized hw. And if there's any case like that we
 should  patch it out (like we do with some of the runtime pm code
 already).

Agree. Now Allen is working on how to avoid those tricky platform
stickiness in Windows gfx driver. We should do same thing in Linux
part too.

Thanks
Kevin

RE: [PATCH] KVM: Split up MSI-X assigned device IRQ handler

2011-09-13 Thread Tian, Kevin

 From: Jan Kiszka
 Sent: Tuesday, September 13, 2011 12:58 AM
 
 The threaded IRQ handler for MSI-X has almost nothing in common with the
 INTx/MSI handler. Move its code into a dedicated handler.

if it's desired to further go down this cleanup path, there's also no need to
register ack notifier for MSI-X type device since all the logic there is simply
a nop:

static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
{
struct kvm_assigned_dev_kernel *dev;

if (kian-gsi == -1)
return;

dev = container_of(kian, struct kvm_assigned_dev_kernel,
   ack_notifier);

kvm_set_irq(dev-kvm, dev-irq_source_id, dev-guest_irq, 0);

/* The guest irq may be shared so this ack may be
 * from another device.
 */
spin_lock(dev-intx_lock);
if (dev-host_irq_disabled) {
enable_irq(dev-host_irq);
dev-host_irq_disabled = false;
}
spin_unlock(dev-intx_lock);
}

Thanks
Kevin

 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  virt/kvm/assigned-dev.c |   32 +++-
  1 files changed, 19 insertions(+), 13 deletions(-)
 
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index 84ead54..7debe8c 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -58,8 +58,6 @@ static int find_index_from_host_irq(struct
 kvm_assigned_dev_kernel
  static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
  {
   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 - u32 vector;
 - int index;
 
   if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_INTX) {
   spin_lock(assigned_dev-intx_lock);
 @@ -68,20 +66,28 @@ static irqreturn_t kvm_assigned_dev_thread(int irq,
 void *dev_id)
   spin_unlock(assigned_dev-intx_lock);
   }
 
 - if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
 - index = find_index_from_host_irq(assigned_dev, irq);
 - if (index = 0) {
 - vector = assigned_dev-
 - guest_msix_entries[index].vector;
 - kvm_set_irq(assigned_dev-kvm,
 - assigned_dev-irq_source_id, vector, 1);
 - }
 - } else
 + kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
 + assigned_dev-guest_irq, 1);
 +
 + return IRQ_HANDLED;
 +}
 +
 +#ifdef __KVM_HAVE_MSIX
 +static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
 +{
 + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 + int index = find_index_from_host_irq(assigned_dev, irq);
 + u32 vector;
 +
 + if (index = 0) {
 + vector = assigned_dev-guest_msix_entries[index].vector;
   kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
 - assigned_dev-guest_irq, 1);
 + vector, 1);
 + }
 
   return IRQ_HANDLED;
  }
 +#endif
 
  /* Ack the irq line for an assigned device */
  static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
 @@ -279,7 +285,7 @@ static int assigned_device_enable_host_msix(struct
 kvm *kvm,
 
   for (i = 0; i  dev-entries_nr; i++) {
   r = request_threaded_irq(dev-host_msix_entries[i].vector,
 -  NULL, kvm_assigned_dev_thread,
 +  NULL, kvm_assigned_dev_thread_msix,
0, dev-irq_name, dev);
   if (r)
   goto err;
 --
 1.7.3.4
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] KVM: Split up MSI-X assigned device IRQ handler

2011-09-13 Thread Tian, Kevin

 From: Jan Kiszka [mailto:jan.kis...@siemens.com]
 Sent: Tuesday, September 13, 2011 3:30 PM

 On 2011-09-13 08:40, Tian, Kevin wrote:
  From: Jan Kiszka
  Sent: Tuesday, September 13, 2011 12:58 AM

  The threaded IRQ handler for MSI-X has almost nothing in common with the
  INTx/MSI handler. Move its code into a dedicated handler.

  if it's desired to further go down this cleanup path, there's also no need 
  to
  register ack notifier for MSI-X type device since all the logic there is 
  simply
  a nop:

 You mean
 http://thread.gmane.org/gmane.comp.emulators.kvm.devel/79075? :)

yes, and sorry that I didn't catch that thread. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-30 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, August 30, 2011 6:57 PM
 
 On 08/30/2011 04:15 AM, Tian, Kevin wrote:
  v2 changes:
  define exit qualification fields for APIC-Access in vmx.h
  use apic_reg_write instead of apic_set_eoi, to avoid breaking tracing
  add fasteoi option to allow disabling this acceleration
 
  
 
  commit 2a66a12cb6928c962d24907e6d39b6eb9ac94b4b
  Author: Kevin Tiankevin.t...@intel.com
  Date:   Mon Aug 29 13:08:28 2011 +0800
 
   KVM: APIC: avoid instruction emulation for EOI writes
 
   Instruction emulation for EOI writes can be skipped, since sane
   guest simply uses MOV instead of string operations. This is a nice
   improvement when guest doesn't support x2apic or hyper-V EOI
   support.
 
   a single VM bandwidth is observed with ~8% bandwidth improvement
   (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.
 
 
 
 Thanks, applied.  Please use git format-patch (and git send-email)
 instead of git show in the future.
 

OK, and sorry for the formatting issue.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin

Hi, Avi,

Here comes the patch:

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Based on earlier work from:
Signed-off-by: Eddie Dong eddie.d...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (apic)
+   apic_set_eoi(vcpu-arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu-arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..35e4af7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification  12)  0xf;
+   offset = exit_qualification  0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction 
+* emulation.
+*/
+   if ((access_type == 1)  (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

Thanks
Kevin

 -Original Message-
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Thursday, August 25, 2011 12:39 PM
 To: Tian, Kevin
 Cc: kvm@vger.kernel.org; Nakajima, Jun
 Subject: Re: about vEOI optimization
 
 On 08/25/2011 05:24 AM, Tian, Kevin wrote:
  
Another option is the hyper-V EOI support, which can also eliminate the
EOI exit when no additional interrupt is pending.  This can improve EOI
performance even more.
  
 
  yes, and this is an orthogonal option.
 
  So if you agree, I'll send out an updated patch atop their work.
 
 
 
 Thanks.
 
 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.



20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch

RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-08-29 Thread Tian, Kevin

 From: Marcelo Tosatti
 Sent: Tuesday, August 23, 2011 6:47 PM
 
  +  if (!apic-lapic_timer.tscdeadline)
  +  return;
  +
  +  tsc_target = kvm_x86_ops-
  +  guest_to_host_tsc(apic-lapic_timer.tscdeadline);
  +  rdtscll(tsc_now);
  +  tsc_delta = tsc_target - tsc_now;
 
  This only works if we have a constant tsc, that's not true for large

that type of machine exposes tricky issues to time virtualization in 
general.

  multiboard machines.  Need to do this with irqs disabled as well
  (reading both 'now' and 'tsc_now' in the same critical section).
 
 Should look like this:
 
 local_irq_disable();
 u64 guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
 if (guest_tsc = tscdeadline)
 hrtimer_start(now);
 else {
   ns = convert_to_ns(guest_tsc - tscdeadline);
   hrtimer_start(now + ns);
 }
 local_irq_enable();

Above is an overkill. only calculating guest tsc delta needs be protected.

On the other hand, I don't think masking irq here adds obvious benefit.
Virtualization doesn't ensure micro-level time accuracy, since there're
many events delaying virtual time interrupt injection even when timer
emulation logic triggers it timely. That's even the case on bare metal
though with smaller scope, and thus most guests are able to handle 
such situation. masking irq in non-critical regions simply slow down the
system response on other events. 

 
 Note the vcpus tsc can have different frequency than the hosts, so
 vcpu_tsc_khz() should be used to convert to nanoseconds, not tsc_khz.
 

yes, that's a good suggestion.

btw, Jinsong is on vacation this week. He will update the patch according
to you comment when back.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Monday, August 29, 2011 3:24 PM
 
 On 08/29/2011 09:09 AM, Tian, Kevin wrote:
 
static int handle_apic_access(struct kvm_vcpu *vcpu)
{
  +   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  +   int access_type, offset;
  +
  +   access_type = (exit_qualification  12)  0xf;
  +   offset = exit_qualification  0xfff;
  +   /*
  +* Sane guest uses MOV instead of string operations to
  +* write EOI, with written value not cared. So make a
  +* short-circuit here by avoiding heavy instruction
  +* emulation.
  +*/
  +   if ((access_type == 1)  (offset == APIC_EOI)) {
 
 Please replace this 1 with a #define.
 

updated.

--

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Based on earlier work from:
Signed-off-by: Eddie Dong eddie.d...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (apic)
+   apic_set_eoi(vcpu-arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu-arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..34c094f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4538,8 +4538,25 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
return 1;
 }
 
+#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification  12)  0xf;
+   offset = exit_qualification  0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction 
+* emulation.
+*/
+   if ((access_type == APIC_ACCESS_TYPE_W)  (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

Thanks
Kevin


20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch

[PATCH v2] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin

v2 changes:
define exit qualification fields for APIC-Access in vmx.h
use apic_reg_write instead of apic_set_eoi, to avoid breaking tracing
add fasteoi option to allow disabling this acceleration



commit 2a66a12cb6928c962d24907e6d39b6eb9ac94b4b
Author: Kevin Tian kevin.t...@intel.com
Date:   Mon Aug 29 13:08:28 2011 +0800

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Based on earlier work from:
Signed-off-by: Eddie Dong eddie.d...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 2caf290..31f180c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -350,6 +350,18 @@ enum vmcs_field {
 #define DEBUG_REG_ACCESS_REG(eq)(((eq)  8)  0xf) /* 11:8, general 
purpose reg. */
 
 
+/*
+ * Exit Qualifications for APIC-Access
+ */
+#define APIC_ACCESS_OFFSET  0xfff   /* 11:0, offset within the 
APIC page */
+#define APIC_ACCESS_TYPE0xf000  /* 15:12, access type */
+#define TYPE_LINEAR_APIC_INST_READ  (0  12)
+#define TYPE_LINEAR_APIC_INST_WRITE (1  12)
+#define TYPE_LINEAR_APIC_INST_FETCH (2  12)
+#define TYPE_LINEAR_APIC_EVENT  (3  12)
+#define TYPE_PHYSICAL_APIC_EVENT(10  12)
+#define TYPE_PHYSICAL_APIC_INST (15  12)
+
 /* segment AR */
 #define SEGMENT_AR_L_MASK (1  13)
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..52645f2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (apic)
+   apic_reg_write(vcpu-arch.apic, APIC_EOI, 0);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu-arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..9d0364b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,6 +71,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+static int __read_mostly fasteoi = 1;
+module_param(fasteoi, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -4540,6 +4543,24 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   if (likely(fasteoi)) {
+   unsigned long exit_qualification = 
vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = exit_qualification  APIC_ACCESS_TYPE;
+   offset = exit_qualification  APIC_ACCESS_OFFSET;
+   /*
+* Sane guest uses MOV to write EOI, with written value 
+* not cared. So make a short-circuit here by avoiding 
+* heavy instruction emulation.
+*/
+   if ((access_type == TYPE_LINEAR_APIC_INST_WRITE) 
+   (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }


Thanks
Kevin


20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch

RE: about vEOI optimization

2011-08-24 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Wednesday, August 24, 2011 6:00 PM
 
 On 08/23/2011 11:09 AM, Tian, Kevin wrote:
  Hi, Avi,
 
  Both Eddie and Marcello once suggested vEOI optimization by skipping
  heavy-weight instruction decode, which reduces vEOI overhead greatly:
 
  http://www.mail-archive.com/kvm@vger.kernel.org/msg18619.html
  http://www.spinics.net/lists/kvm/msg36691.html
 
  Though virtual x2apic serves similar purpose, it depends on guest OS
  to support x2apic. Many Windows versions don't support x2apic though,
  including Win7, Windows server before 2008 R2, etc. Given that 
  virtualization
  need support various OS versions, any chance to incorporate above vEOI
  optimization in KVM as an alternative to boost performance when guest
  doesn't support x2apic?
 
 
 Yes.  There was a problem with the guest using MOVSD or STOSD to write
 the EOI; if we don't emulate, then registers don't get updated.  I guess
 we can ignore it since no sane guest will use those instructions for EOI.

yes, sane guests all use MOV for EOI. btw, Xen has integrated a similar
acceleration for several releases. When we're measuring 10G SR-IOV
network performance, vEOI access overhead may be up to 6%-8% based
on interrupt rate which is one factor for KVM to lag behind.

 
 Another option is the hyper-V EOI support, which can also eliminate the
 EOI exit when no additional interrupt is pending.  This can improve EOI
 performance even more.
 

yes, and this is an orthogonal option.

So if you agree, I'll send out an updated patch atop their work.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

about vEOI optimization

2011-08-23 Thread Tian, Kevin

Hi, Avi,

Both Eddie and Marcello once suggested vEOI optimization by skipping
heavy-weight instruction decode, which reduces vEOI overhead greatly:

http://www.mail-archive.com/kvm@vger.kernel.org/msg18619.html 
http://www.spinics.net/lists/kvm/msg36691.html

Though virtual x2apic serves similar purpose, it depends on guest OS
to support x2apic. Many Windows versions don't support x2apic though,
including Win7, Windows server before 2008 R2, etc. Given that virtualization
need support various OS versions, any chance to incorporate above vEOI
optimization in KVM as an alternative to boost performance when guest
doesn't support x2apic?

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: large amount of NMI_INTERRUPT disgrade winxp VM performance much.

2011-08-10 Thread Tian, Kevin

 From: ya su
 Sent: Thursday, August 11, 2011 11:57 AM
 
  When I run winxp guest on one server, copy one file about 4G will
 take a time of 40-50 min; if I run a FC14 guest, it will take about
 2-3 min;
 
  I copy and run the winxp image on another server, it works well, take
 about 3min.
 
  I run trace-cmd while copying files, the main difference of  the two
 outputs is that: the slow one's output have many NMI_INTERRUPT
 vm_exit, while the fast output has no such vm_exit. both of the two
 servers have NMI enabled default. the slow one's output as the
 following:
  qemu-system-x86-4454  [004]   549.958147: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958172: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d5e1
  qemu-system-x86-4454  [004]   549.958172: kvm_page_fault:
 address c8f8a000 error_code b
  qemu-system-x86-4454  [004]   549.958177: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958202: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d5e1
  qemu-system-x86-4454  [004]   549.958204: kvm_page_fault:
 address c8f8b000 error_code b
  qemu-system-x86-4454  [004]   549.958209: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958234: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d5e1
  qemu-system-x86-4454  [004]   549.958234: kvm_page_fault:
 address c8f8c000 error_code b
  qemu-system-x86-4454  [004]   549.958239: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958264: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d5e1
  qemu-system-x86-4454  [004]   549.958264: kvm_page_fault:
 address c8f8d000 error_code b
  qemu-system-x86-4454  [004]   549.958267: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958292: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d5e1
  qemu-system-x86-4454  [004]   549.958294: kvm_page_fault:
 address c8f8e000 error_code b
  qemu-system-x86-4454  [004]   549.958299: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958324: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d5e1
  qemu-system-x86-4454  [004]   549.958324: kvm_page_fault:
 address c8f8f000 error_code b
  qemu-system-x86-4454  [004]   549.958329: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958447: kvm_exit:
 reason EXTERNAL_INTERRUPT rip 0x80547ac8
  qemu-system-x86-4454  [004]   549.958450: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958461: kvm_exit:
 reason CR_ACCESS rip 0x8054428c
  qemu-system-x86-4454  [004]   549.958461: kvm_cr:
 cr_write 0 = 0x80010031
  qemu-system-x86-4454  [004]   549.958541: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958573: kvm_exit:
 reason CR_ACCESS rip 0x80546beb
  qemu-system-x86-4454  [004]   549.958575: kvm_cr:
 cr_write 0 = 0x8001003b
  qemu-system-x86-4454  [004]   549.958585: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958610: kvm_exit:
 reason CR_ACCESS rip 0x80546b6c
  qemu-system-x86-4454  [004]   549.958610: kvm_cr:
 cr_write 3 = 0x6e00020
  qemu-system-x86-4454  [004]   549.958621: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958645: kvm_exit:
 reason EXCEPTION_NMI rip 0x8051d7f4
  qemu-system-x86-4454  [004]   549.958645: kvm_page_fault:
 address c0648200 error_code 3
  qemu-system-x86-4454  [004]   549.958653: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958725: kvm_exit:
 reason EXCEPTION_NMI rip 0x8050a26a
  qemu-system-x86-4454  [004]   549.958726: kvm_page_fault:
 address c0796994 error_code 3
  qemu-system-x86-4454  [004]   549.958738: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958750: kvm_exit:
 reason IO_INSTRUCTION rip 0x806edad0
  qemu-system-x86-4454  [004]   549.958750: kvm_pio:
 pio_write at 0xc050 size 2 count 1
  qemu-system-x86-4454  [004]   549.958838: kvm_entry:
 vcpu 0
  qemu-system-x86-4454  [004]   549.958844: kvm_exit:
 reason APIC_ACCESS rip 0x806e7b85
  qemu-system-x86-4454  [004]   549.958852: kvm_apic:
 apic_read APIC_ICR = 0x40041
  qemu-system-x86-4454  [004]   549.958855: kvm_mmio:
 mmio
 read len 4 gpa 0xfee00300 val 0x40041
  qemu-system-x86-4454  [004]   549.958857: kvm_mmio:
 mmio
 write len 4 gpa 0xfee00300 val 0x40041
  qemu-system-x86-4454  [004]   549.958858: kvm_apic:
 apic_write APIC_ICR = 0x40041
  qemu-system-x86-4454  [004]   549.958860: kvm_apic_ipi: dst 1
 vec 65 (Fixed|physical|de-assert|edge|self)
  qemu-system-x86-4454  [004]   549.958860: kvm_apic_accept_irq:
 apicid 0 vec 65 (Fixed|edge)
 
  Even I disable the NMI when I boot the kernel with nmi_watchdog=0,
 the trace-cmd output still show there are many NMI_INTERRUPT. I find
 that in /proc/interrupts, the amount of NMI is 0. Does this mean that
 NMI is produced in winxp guest OS, or this setting can not hinder kvm
 to catch NMI interrupt?
 
   I think the difference between FC14 and winxp is that: fc14 process
 the NMI interrupt correctly, but winxp can not, is this right?
 
   I run qemu-kvm with version 0.14.0, kernel version 2.6.32-131.6.4. I
 change kvm-kmod to 2.6.32-27, it produce the same result.
 
   Any suggestions? thanks.
 

Here  reason EXCEPTION_NMI implicates

RE: kvm upstream build error..

2011-07-13 Thread Tian, Kevin

it works in my side, due to config difference. It is caused by recent
steal time feature.

int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_KVM_STEAL_TIME:

if (unlikely(!sched_info_on()))
return 1;

static inline int sched_info_on(void)
{
#ifdef CONFIG_SCHEDSTATS
return 1;
#elif defined(CONFIG_TASK_DELAY_ACCT)
extern int delayacct_on;
return delayacct_on;
#else
return 0;
#endif
}

I have CONFIG_SCHEDSTATS enabled, while yours has CONFIG_SCHEDSTATS
as 'n' while CONFIG_TASK_DELAY_ACCT) as 'y'. However delayacct_on is
not an exposed symbol to modules.

Thanks
Kevin

 From: Ren, Yongjie
 Sent: Wednesday, July 13, 2011 10:05 AM
 
 Hi Avi,
 KVM upstream have a build error in our build system. Do you have some
 comments on this issue? my kvm commit is
 0af9df4e80a3a620c52c3a9777191c54e615d068
 build error info:
   LD  arch/x86/boot/setup.elf
   OBJCOPY arch/x86/boot/setup.bin
   BUILD   arch/x86/boot/bzImage
 Root device is (8, 1)
 Setup is 15672 bytes (padded to 15872 bytes).
 System is 2556 kB
 CRC c5b53854
 Kernel: arch/x86/boot/bzImage is ready  (#1)
 ERROR: delayacct_on [arch/x86/kvm/kvm.ko] undefined!
 make[3]: *** [__modpost] Error 1
 make[2]: *** [modules] Error 2
 error: Bad exit status from /var/tmp/rpm-tmp.81681 (%build)
 
 Best Regards,
  Yongjie Ren  (Jay)
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/2] vmx,svm: Print errors if SVM or VMX were already set

2011-07-05 Thread Tian, Kevin

 From: Sasha Levin
 Sent: Tuesday, July 05, 2011 7:09 AM
 
 Instead of exiting quietly, print an error if the VMX or the SVM bits
 were already set when loading the module.
 
 Having VMX/SVM bits set means that either there is someone else doing
 hardware virtualization, or that the BIOS is buggy and sets it on
 by default.
 
 Cc: Avi Kivity a...@redhat.com
 Cc: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---
  arch/x86/kvm/svm.c |5 -
  arch/x86/kvm/vmx.c |4 +++-
  2 files changed, 7 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 5ca76e3..2a1df2e 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -590,8 +590,11 @@ static int svm_hardware_enable(void *garbage)
   int me = raw_smp_processor_id();
 
   rdmsrl(MSR_EFER, efer);
 - if (check_inuse  (efer  EFER_SVME))
 + if (check_inuse  (efer  EFER_SVME)) {
 + printk(KERN_ERR svm_hardware_enable: SVM already set
 on %d\n,
 +me);
   return -EBUSY;
 + }
 
   if (!has_svm()) {
   printk(KERN_ERR svm_hardware_enable: err EOPNOTSUPP
 on %d\n,
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 3046b07..df69b1d 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2233,8 +2233,10 @@ static int hardware_enable(void *garbage)
   u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
   u64 old, test_bits;
 
 - if (check_inuse  (read_cr4()  X86_CR4_VMXE))
 + if (check_inuse  (read_cr4()  X86_CR4_VMXE)) {
 + printk(KERN_ERR hardware_enable: VMX already set\n);
   return -EBUSY;
 + }
 

make the error message consistent between vmx and svm, i.e. adding
cpu id for vmx too.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/2] vmx,svm: Print errors if SVM or VMX were already set

2011-07-05 Thread Tian, Kevin

 From: Alexander Graf
 Sent: Tuesday, July 05, 2011 8:42 AM
 
 
 On 05.07.2011, at 01:09, Sasha Levin wrote:
 
  Instead of exiting quietly, print an error if the VMX or the SVM bits
  were already set when loading the module.
 
  Having VMX/SVM bits set means that either there is someone else doing
  hardware virtualization, or that the BIOS is buggy and sets it on
  by default.
 
  Cc: Avi Kivity a...@redhat.com
  Cc: Marcelo Tosatti mtosa...@redhat.com
  Signed-off-by: Sasha Levin levinsasha...@gmail.com
  ---
  arch/x86/kvm/svm.c |5 -
  arch/x86/kvm/vmx.c |4 +++-
  2 files changed, 7 insertions(+), 2 deletions(-)
 
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index 5ca76e3..2a1df2e 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -590,8 +590,11 @@ static int svm_hardware_enable(void *garbage)
  int me = raw_smp_processor_id();
 
  rdmsrl(MSR_EFER, efer);
  -   if (check_inuse  (efer  EFER_SVME))
  +   if (check_inuse  (efer  EFER_SVME)) {
  +   printk(KERN_ERR svm_hardware_enable: SVM already set
 on %d\n,
 
 CPU%d
 
 Also I'd rephrase it as already in use on. Otherwise looks good :)
 

A more elaborative message sounds better, like the explanation for possible
cause in the commit message. Also an advertisement about check_inuse
option is a good thing here in the message. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Patch v5 1/4] Remove SMEP bit from CR4_RESERVED_BITS

2011-06-01 Thread Tian, Kevin

 From: Ingo Molnar
 Sent: Monday, May 30, 2011 3:41 PM
 
 
 * Yang, Wei Y wei.y.y...@intel.com wrote:
 
  This patch removes SMEP bit from CR4_RESERVED_BITS.
 
 I'm wondering, what is the best-practice way for tools/kvm/ to set
 SMEP for the guest kernel automatically, even if the guest kernel
 itsef has not requested SMEP?
 

enabling SMEP w/o guest's knowledge can be problematic if the guest
is doing U/S 0-1 bit change w/o TLB invalidation, which is a required
action to ensure SMEP protection working correctly. Linux versions 
known so far don't have this behavior because TLB invalidation due to
P bit change covers U/S 0-1 change. But given that end users may
deploy various OS within the guest, to enable SMEP this way requires
solid understanding on internals of those OSes. Or else it's uncertain
whether SMEP protection fully works on such uncertain guests.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Patch v5 0/4] Enable SMEP feature support for kvm

2011-05-30 Thread Tian, Kevin

 From: Avi Kivity
 Sent: Monday, May 30, 2011 4:52 PM
 
 On 05/30/2011 06:01 AM, Yang, Wei Y wrote:
  This patchset enables a new CPU feature SMEP (Supervisor Mode Execution
  Protection) in KVM. SMEP prevents kernel from executing code in application.
  Updated Intel SDM describes this CPU feature. The document will be
  published soon.
 
  This patchset is based on Fenghua's SMEP patch series, as referred by:
  https://lkml.org/lkml/2011/5/17/523
 
 Looks good.  I'll post the cr0.wp=0 fixup soon.
 

what's your planned fix? through NX bit? :-)

btw, why is current scheme used to emulate cr0.wp=0 case instead of simply
emulating it?

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Patch v5 0/4] Enable SMEP feature support for kvm

2011-05-30 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Monday, May 30, 2011 5:14 PM

 On 05/30/2011 12:08 PM, Tian, Kevin wrote:
From: Avi Kivity
Sent: Monday, May 30, 2011 4:52 PM

On 05/30/2011 06:01 AM, Yang, Wei Y wrote:
  This patchset enables a new CPU feature SMEP (Supervisor Mode
 Execution
  Protection) in KVM. SMEP prevents kernel from executing code in
 application.
  Updated Intel SDM describes this CPU feature. The document will be
  published soon.

  This patchset is based on Fenghua's SMEP patch series, as referred
 by:
  https://lkml.org/lkml/2011/5/17/523

Looks good.  I'll post the cr0.wp=0 fixup soon.

  what's your planned fix? through NX bit? :-)

 Yes.

  btw, why is current scheme used to emulate cr0.wp=0 case instead of simply
  emulating it?

 How would you simply emulate it?

 We have to force cr0.wp=1, otherwise we cannot write-protect guest page
 tables.  Once we do that, we have to set U=1 to allow user reads or U=0
 to allow kernel writes.

I mean using instruction emulation instead of changing permission to re-execute
faulting instruction. Or is current KVM instruction emulator not complete enough
to handle various memory access instructions (just designed for page table 
access
and real mode instructions?)?

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Patch v5 0/4] Enable SMEP feature support for kvm

2011-05-30 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Monday, May 30, 2011 6:00 PM

 On 05/30/2011 12:18 PM, Tian, Kevin wrote:
From: Avi Kivity [mailto:a...@redhat.com]
Sent: Monday, May 30, 2011 5:14 PM

On 05/30/2011 12:08 PM, Tian, Kevin wrote:
 From: Avi Kivity
 Sent: Monday, May 30, 2011 4:52 PM

 On 05/30/2011 06:01 AM, Yang, Wei Y wrote:
This patchset enables a new CPU feature SMEP (Supervisor
 Mode
Execution
Protection) in KVM. SMEP prevents kernel from executing
 code in
application.
Updated Intel SDM describes this CPU feature. The
 document will be
published soon.

This patchset is based on Fenghua's SMEP patch series, as
 referred
by:
https://lkml.org/lkml/2011/5/17/523

 Looks good.  I'll post the cr0.wp=0 fixup soon.

  what's your planned fix? through NX bit? :-)

Yes.

  btw, why is current scheme used to emulate cr0.wp=0 case instead of
 simply
  emulating it?

How would you simply emulate it?

We have to force cr0.wp=1, otherwise we cannot write-protect guest
 page
tables.  Once we do that, we have to set U=1 to allow user reads or U=0
to allow kernel writes.

  I mean using instruction emulation instead of changing permission to
 re-execute
  faulting instruction. Or is current KVM instruction emulator not complete
 enough
  to handle various memory access instructions (just designed for page table
 access
  and real mode instructions?)?

 I think by now it's complete enough (it wasn't when the shadow mmu was
 written).  But emulation will be slow if the guest writes a lot of data
 to the page.

OK, got it.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Patch v3] Enable CPU SMEP feature for KVM

2011-05-26 Thread Tian, Kevin

 From: Yang, Wei Y
 Sent: Thursday, May 26, 2011 9:29 PM
 
 This patchset enables a new CPU feature SMEP (Supervisor Mode Execution
 Protection) in KVM. SMEP prevents kernel from executing code in application.
 Updated Intel SDM describes this CPU feature. The document will be published
 soon.
 
 This patchset is based on Fenghua's SMEP patch series, as referred by:
 https://lkml.org/lkml/2011/5/17/523
 
 Changes since v2: enable SMEP for spt mode.

another change in this version is to avoid adding SMEP to cr4_guest_owned_bits,
because architecturally it's required to flush TLB when CR4.SMEP is changed
which has to be emulated.

Also based on your comment SMEP is now permitted based on cpuid setting. One 
pending 
issue though is check_cr_write in emulation path. We changed cr4_reserved_bits
to vcpu specific instead of global, but check_cr_write doesn't provide a vcpu 
parameter.
Looks the whole emulation logic avoids using vcpu context to be neutral. Avi, 
do you
have any suggestion for a clean change here? Add cr_reserved_bits array into
emulation context instead of hard-coding?

Thanks
Kevin

 
  Signed-off-by: Yang Wei wei.y.y...@intel.com
  Signed-off-by: Shan Haitao haitao.s...@intel.com
 
 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/paging_tmpl.h  |   15 +--
  arch/x86/kvm/vmx.c  |9 +
  arch/x86/kvm/x86.c  |7 +--
  4 files changed, 28 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h
 b/arch/x86/include/asm/kvm_host.h
 index d2ac8e2..154287b 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -306,6 +306,7 @@ struct kvm_vcpu_arch {
   unsigned long cr3;
   unsigned long cr4;
   unsigned long cr4_guest_owned_bits;
 + unsigned long cr4_reserved_bits;
   unsigned long cr8;
   u32 hflags;
   u64 efer;
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 6c4dc01..7e0b2f8 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -120,7 +120,7 @@ static int FNAME(walk_addr_generic)(struct
 guest_walker *walker,
   struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
   gva_t addr, u32 access)
  {
 - pt_element_t pte;
 + pt_element_t pte, pte_smep;
   pt_element_t __user *ptep_user;
   gfn_t table_gfn;
   unsigned index, pt_access, uninitialized_var(pte_access);
 @@ -150,7 +150,11 @@ walk:
   }
   --walker-level;
   }
 + pte_smep = ~0ULL;
 +#else
 + pte_smep = ~0U;
  #endif
 +
   ASSERT((!is_long_mode(vcpu)  is_pae(vcpu)) ||
  (mmu-get_cr3(vcpu)  CR3_NONPAE_RESERVED_BITS) == 0);
 
 @@ -234,6 +238,8 @@ walk:
 
   walker-ptes[walker-level - 1] = pte;
 
 + pte_smep = pte;
 +
   if ((walker-level == PT_PAGE_TABLE_LEVEL) ||
   ((walker-level == PT_DIRECTORY_LEVEL) 
   is_large_pte(pte) 
 @@ -246,6 +252,11 @@ walk:
   gfn_t gfn;
   u32 ac;
 
 + if (unlikely(fetch_fault  !user_fault))
 + if ((vcpu-arch.cr4  X86_CR4_SMEP)
 +  (pte_smep  PT_USER_MASK))
 + eperm = true;
 +
   gfn = gpte_to_gfn_lvl(pte, lvl);
   gfn += (addr  PT_LVL_OFFSET_MASK(lvl))  PAGE_SHIFT;
 
 @@ -305,7 +316,7 @@ error:
 
   walker-fault.error_code |= write_fault | user_fault;
 
 - if (fetch_fault  mmu-nx)
 + if (fetch_fault  (mmu-nx || (vcpu-arch.cr4  X86_CR4_SMEP)))
   walker-fault.error_code |= PFERR_FETCH_MASK;
   if (rsvd_fault)
   walker-fault.error_code |= PFERR_RSVD_MASK;
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 4c3fa0f..7ad24fd 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4507,6 +4507,15 @@ static void vmx_cpuid_update(struct kvm_vcpu
 *vcpu)
   }
   }
   }
 +
 + best = kvm_find_cpuid_entry(vcpu, 7, 0);
 + if (best  (best-ebx  bit(X86_FEATURE_SMEP))) {
 + if (boot_cpu_has(X86_FEATURE_SMEP))
 + vcpu-arch.cr4_reserved_bits =
 + ~((unsigned long)X86_CR4_SMEP);
 + else
 + best-ebx = ~(bit(X86_FEATURE_SMEP));
 + }
  }
 
  static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2
 *entry)
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 77c9d86..6ead39e 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -598,9 +598,10 @@ static void update_cpuid(struct kvm_vcpu *vcpu)
  int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
  {
   unsigned long old_cr4 = kvm_read_cr4(vcpu);
 - unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE;
 + unsigned

RE: [PATCH 21/31] nVMX: vmcs12 checks on nested entry

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Wednesday, May 25, 2011 1:38 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 21/31] nVMX:
 vmcs12 checks on nested entry:
   + if (vmcs12-launch_state == launch) {
   + nested_vmx_failValid(vcpu,
   + launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
   +: VMXERR_VMRESUME_NONLAUNCHED_VMCS);
   + return 1;
   + }
 
  from SDM:
  ELSIF (VMLAUNCH and launch state of current VMCS is not clear)
  THEN VMfailValid(VMLAUNCH with non-clear VMCS);
  ELSIF (VMRESUME and launch state of current VMCS is not launched)
  THEN VMfailValid(VMRESUME with non-launched VMCS);
 
  So it's legal to use VMLAUNCH on a launched VMCS. However here you
  changes this behavior. On the other hand, do you want to add a 'clear' state
  along with L1 VMCLEAR to catch the failure here?
 
 I don't understand: I always understood the spec to mean that clear and
 launched the two opposite states of the launch state bit? If it isn't,
 what does clear mean?
 
 Is it really legal to use a VMLAUNCH on a launched VMCS?
 If it is, why does KVM, for example, go to great lengths to VMLAUNCH the
 first time, and VMRESUME all subsequent times?
 

You're correct. I've got my head messed on this point. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:55 AM
 
 This patch contains the logic of whether an L2 exit should be handled by L0
 and then L2 should be resumed, or whether L1 should be run to handle this
 exit (using the nested_vmx_vmexit() function of the previous patch).
 
 The basic idea is to let L1 handle the exit only if it actually asked to
 trap this sort of event. For example, when L2 exits on a change to CR0,
 we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
 bit which changed; If it did, we exit to L1. But if it didn't it means that
 it is we (L0) that wished to trap this event, so we handle it ourselves.
 
 The next two patches add additional logic of what to do when an interrupt or
 exception is injected: Does L0 need to do it, should we exit to L1 to do it,
 or should we resume L2 and keep the exception to be injected later.
 
 We keep a new flag, nested_run_pending, which can override the decision of
 which should run next, L1 or L2. nested_run_pending=1 means that we *must*
 run
 L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
 and therefore expects L2 to be run (and perhaps be injected with an event it
 specified, etc.). Nested_run_pending is especially intended to avoid switching
 to L1 in the injection decision-point described above.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  256
 ++-
  1 file changed, 255 insertions(+), 1 deletion(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -351,6 +351,8 @@ struct nested_vmx {
   /* Saving the VMCS that we used for running L1 */
   struct saved_vmcs saved_vmcs01;
   u64 vmcs01_tsc_offset;
 + /* L2 must run next, and mustn't decide to exit to L1. */
 + bool nested_run_pending;
   /*
* Guest pages referred to in vmcs02 with host-physical pointers, so
* we must keep them pinned while L2 runs.
 @@ -870,6 +872,20 @@ static inline bool nested_cpu_has2(struc
   (vmcs12-secondary_vm_exec_control  bit);
  }
 
 +static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
 +{
 + return is_guest_mode(vcpu) 
 + (get_vmcs12(vcpu)-pin_based_vm_exec_control 
 + PIN_BASED_VIRTUAL_NMIS);
 +}

any reason to add guest mode check here? I didn't see such check in your
earlier nested_cpu_has_xxx. It would be clearer to use existing 
nested_cpu_has_xxx
along with is_guest_mode explicitly which makes such usage consistent.

 +
 +static inline bool is_exception(u32 intr_info)
 +{
 + return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK))
 + == (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
 +}
 +
 +static void nested_vmx_vmexit(struct kvm_vcpu *vcpu);
  static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
   struct vmcs12 *vmcs12,
   u32 reason, unsigned long qualification);
 @@ -5281,6 +5297,232 @@ static int (*kvm_vmx_exit_handlers[])(st
  static const int kvm_vmx_max_exit_handlers =
   ARRAY_SIZE(kvm_vmx_exit_handlers);
 
 +/*
 + * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
 + * rather than handle it ourselves in L0. I.e., check whether L1 expressed
 + * disinterest in the current event (read or write a specific MSR) by using 
 an
 + * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps.
 + */
 +static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
 + struct vmcs12 *vmcs12, u32 exit_reason)
 +{
 + u32 msr_index = vcpu-arch.regs[VCPU_REGS_RCX];
 + gpa_t bitmap;
 +
 + if (!nested_cpu_has(get_vmcs12(vcpu),
 CPU_BASED_USE_MSR_BITMAPS))
 + return 1;
 +
 + /*
 +  * The MSR_BITMAP page is divided into four 1024-byte bitmaps,
 +  * for the four combinations of read/write and low/high MSR numbers.
 +  * First we need to figure out which of the four to use:
 +  */
 + bitmap = vmcs12-msr_bitmap;
 + if (exit_reason == EXIT_REASON_MSR_WRITE)
 + bitmap += 2048;
 + if (msr_index = 0xc000) {
 + msr_index -= 0xc000;
 + bitmap += 1024;
 + }
 +
 + /* Then read the msr_index'th bit from this bitmap: */
 + if (msr_index  1024*8) {
 + unsigned char b;
 + kvm_read_guest(vcpu-kvm, bitmap + msr_index/8, b, 1);
 + return 1  (b  (msr_index  7));
 + } else
 + return 1; /* let L1 handle the wrong parameter */
 +}
 +
 +/*
 + * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
 + * rather than handle it ourselves in L0. I.e., check if L1 wanted to
 + * intercept (via guest_host_mask etc.) the current event.
 + */
 +static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
 + struct vmcs12 *vmcs12)
 +{
 + unsigned long exit_qualification

RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:53 AM
 
 Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
 hypervisor to run its own guests.
 
 This patch does not include some of the necessary validity checks on
 vmcs12 fields before the entry. These will appear in a separate patch
 below.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---

[...]

 +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 +{
 + struct vmcs12 *vmcs12;
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + int cpu;
 + struct saved_vmcs *saved_vmcs02;
 +
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;
 + skip_emulated_instruction(vcpu);
 +
 + vmcs12 = get_vmcs12(vcpu);
 +
 + enter_guest_mode(vcpu);
 +
 + vmx-nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
 +
 + /*
 +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
 +  * vmcs01, on which CPU it was last loaded, and whether it was launched
 +  * (we need all these values next time we will use L1). Then recall
 +  * these values from the last time vmcs02 was used.
 +  */
 + saved_vmcs02 = nested_get_current_vmcs02(vmx);
 + if (!saved_vmcs02)
 + return -ENOMEM;
 +

we shouldn't return error after the guest mode is updated. Or else move
enter_guest_mode to a later place...

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Wednesday, May 25, 2011 4:06 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 20/31] nVMX:
 Exiting from L2 to L1:
  IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
  the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
  has no permission to set its bit effectively in this case.
 ...
  this is the problem:
 
  (vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) !=
  (vmcs02-guest_cr0  vmcs12-cr0_guest_host_mask)
 
 Sorry for arguing previously, this is a very good, and correct, point, which
 I missed. When both L0 and L1 are KVM, this didn't cause problems because
 the
 only problematic bit has been the TS bit, and when KVM wants to override this
 bit it always does it to 1.
 
 So I've rewritten this function, based on my new understanding following your
 insights. I believe it now implements your formula *exactly*. Please take a
 look at the comments and the code, and see if you now agree with them:
 
 /*
  * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
  * because L2 may have changed some cr0 bits directly
 (CRO_GUEST_HOST_MASK).
  * This function returns the new value we should put in vmcs12.guest_cr0.
  * It's not enough to just return the vmcs02 GUEST_CR0. Rather,
  *  1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are 
 now
  * available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0
  * didn't trap the bit, because if L1 did, so would L0).
  *  2. Bits that L1 asked to trap (and therefore L0 also did) could not have
  * been modified by L2, and L1 knows it. So just leave the old value of
  * the bit from vmcs12.guest_cr0. Note that the bit from vmcs02
 GUEST_CR0
  * isn't relevant, because if L0 traps this bit it can set it to anything.
  *  3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have
  * changed these bits, and therefore they need to be updated, but L0
  * didn't necessarily allow them to be changed in GUEST_CR0 - and
 rather
  * put them in vmcs02 CR0_READ_SHADOW. So take these bits from
 there.
  */
 static inline unsigned long
 vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 {
   return
   /*1*/   (vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits)
 |
   /*2*/   (vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) |
   /*3*/   (vmcs_readl(CR0_READ_SHADOW) 
 ~(vmcs12-cr0_guest_host_mask |
   vcpu-arch.cr0_guest_owned_bits));
 }
 
 

This looks good to me. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:56 AM
 
 The code in this patch correctly emulates external-interrupt injection
 while a nested guest L2 is running.
 
 Because of this code's relative un-obviousness, I include here a longer-than-
 usual justification for what it does - much longer than the code itself ;-)
 
 To understand how to correctly emulate interrupt injection while L2 is
 running, let's look first at what we need to emulate: How would things look
 like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
 an interrupt, we had hardware delivering an interrupt?
 
 Now we have L1 running on bare metal with a guest L2, and the hardware
 generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
 1, and
 VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
 what
 happens now is this: The processor exits from L2 to L1, with an external-
 interrupt exit reason but without an interrupt vector. L1 runs, with
 interrupts disabled, and it doesn't yet know what the interrupt was. Soon
 after, it enables interrupts and only at that moment, it gets the interrupt
 from the processor. when L1 is KVM, Linux handles this interrupt.
 
 Now we need exactly the same thing to happen when that L1-L2 system runs
 on top of L0, instead of real hardware. This is how we do this:
 
 When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
 external-interrupt exit reason (with an invalid interrupt vector), and run L1.
 Just like in the bare metal case, it likely can't deliver the interrupt to
 L1 now because L1 is running with interrupts disabled, in which case it turns
 on the interrupt window when running L1 after the exit. L1 will soon enable
 interrupts, and at that point L0 will gain control again and inject the
 interrupt to L1.
 
 Finally, there is an extra complication in the code: when nested_run_pending,
 we cannot return to L1 now, and must launch L2. We need to remember the
 interrupt we wanted to inject (and not clear it now), and do it on the
 next exit.
 
 The above explanation shows that the relative strangeness of the nested
 interrupt injection code in this patch, and the extra interrupt-window
 exit incurred, are in fact necessary for accurate emulation, and are not
 just an unoptimized implementation.
 
 Let's revisit now the two assumptions made above:
 
 If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
 does, by the way), things are simple: L0 may inject the interrupt directly
 to the L2 guest - using the normal code path that injects to any guest.
 We support this case in the code below.
 
 If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
 does), things look very different from the description above: L1 expects

Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
touch generic interrupt logic, and thus better to not ack it until interrupt
is enabled and then hardware will gear to the kernel interrupt handler
automatically.

 to see an exit from L2 with the interrupt vector already filled in the exit
 information, and does not expect to be interrupted again with this interrupt.
 The current code does not (yet) support this case, so we do not allow the
 VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Then just fill the interrupt vector field with the highest unmasked bit
from pending vIRR.

 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |   36 
  1 file changed, 36 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
 
   /* exit controls */
   nested_vmx_exit_ctls_low = 0;
 + /* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
 */
  #ifdef CONFIG_X86_64
   nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #else
 @@ -3733,9 +3734,25 @@ out:
   return ret;
  }
 
 +/*
 + * In nested virtualization, check if L1 asked to exit on external 
 interrupts.
 + * For most existing hypervisors, this will always return true.
 + */
 +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
 +{
 + return get_vmcs12(vcpu)-pin_based_vm_exec_control 
 + PIN_BASED_EXT_INTR_MASK;
 +}
 +

could be a similar common wrapper like nested_cpu_has...

Thanks,
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin

 From: Tian, Kevin
 Sent: Wednesday, May 25, 2011 4:40 PM
  If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
  does), things look very different from the description above: L1 expects
 
 Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
 really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
 touch generic interrupt logic, and thus better to not ack it until interrupt
 is enabled and then hardware will gear to the kernel interrupt handler
 automatically.
 
  to see an exit from L2 with the interrupt vector already filled in the exit
  information, and does not expect to be interrupted again with this 
  interrupt.
  The current code does not (yet) support this case, so we do not allow the
  VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
 
 Then just fill the interrupt vector field with the highest unmasked bit
 from pending vIRR.
 

And also ack virtual irqchip accordingly...

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:56 AM
 
 The code in this patch correctly emulates external-interrupt injection
 while a nested guest L2 is running.
 
 Because of this code's relative un-obviousness, I include here a longer-than-
 usual justification for what it does - much longer than the code itself ;-)
 
 To understand how to correctly emulate interrupt injection while L2 is
 running, let's look first at what we need to emulate: How would things look
 like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
 an interrupt, we had hardware delivering an interrupt?
 
 Now we have L1 running on bare metal with a guest L2, and the hardware
 generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
 1, and
 VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
 what
 happens now is this: The processor exits from L2 to L1, with an external-
 interrupt exit reason but without an interrupt vector. L1 runs, with
 interrupts disabled, and it doesn't yet know what the interrupt was. Soon
 after, it enables interrupts and only at that moment, it gets the interrupt
 from the processor. when L1 is KVM, Linux handles this interrupt.
 
 Now we need exactly the same thing to happen when that L1-L2 system runs
 on top of L0, instead of real hardware. This is how we do this:
 
 When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
 external-interrupt exit reason (with an invalid interrupt vector), and run L1.
 Just like in the bare metal case, it likely can't deliver the interrupt to
 L1 now because L1 is running with interrupts disabled, in which case it turns
 on the interrupt window when running L1 after the exit. L1 will soon enable
 interrupts, and at that point L0 will gain control again and inject the
 interrupt to L1.
 
 Finally, there is an extra complication in the code: when nested_run_pending,
 we cannot return to L1 now, and must launch L2. We need to remember the
 interrupt we wanted to inject (and not clear it now), and do it on the
 next exit.
 
 The above explanation shows that the relative strangeness of the nested
 interrupt injection code in this patch, and the extra interrupt-window
 exit incurred, are in fact necessary for accurate emulation, and are not
 just an unoptimized implementation.
 
 Let's revisit now the two assumptions made above:
 
 If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
 does, by the way), things are simple: L0 may inject the interrupt directly
 to the L2 guest - using the normal code path that injects to any guest.
 We support this case in the code below.
 
 If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
 does), things look very different from the description above: L1 expects
 to see an exit from L2 with the interrupt vector already filled in the exit
 information, and does not expect to be interrupted again with this interrupt.
 The current code does not (yet) support this case, so we do not allow the
 VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |   36 
  1 file changed, 36 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
 
   /* exit controls */
   nested_vmx_exit_ctls_low = 0;
 + /* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
 */
  #ifdef CONFIG_X86_64
   nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #else
 @@ -3733,9 +3734,25 @@ out:
   return ret;
  }
 
 +/*
 + * In nested virtualization, check if L1 asked to exit on external 
 interrupts.
 + * For most existing hypervisors, this will always return true.
 + */
 +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
 +{
 + return get_vmcs12(vcpu)-pin_based_vm_exec_control 
 + PIN_BASED_EXT_INTR_MASK;
 +}
 +
  static void enable_irq_window(struct kvm_vcpu *vcpu)
  {
   u32 cpu_based_vm_exec_control;
 + if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
 + /* We can get here when nested_run_pending caused
 +  * vmx_interrupt_allowed() to return false. In this case, do
 +  * nothing - the interrupt will be injected later.
 +  */

I think this is not a rare path? when vcpu is in guest mode with L2 as current
vmx context, this function could be invoked multiple times since kvm thread
can be scheduled out/in randomly. 

 + return;
 
   cpu_based_vm_exec_control =
 vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
   cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
 @@ -3858,6 +3875,17 @@ static void vmx_set_nmi_mask(struct kvm_
 
  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
  {
 + if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu)) {

RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:57 AM
 
 This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the
 nested
 case.
 
 When a guest exits while delivering an interrupt or exception, we get this
 information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
 there's nothing we need to do, because L1 will see this field in vmcs12, and
 handle it itself. However, when L2 exits and L0 handles the exit itself and
 plans to return to L2, L0 must inject this event to L2.
 
 In the normal non-nested case, the idt_vectoring_info case is discovered after
 the exit, and the decision to inject (though not the injection itself) is made
 at that point. However, in the nested case a decision of whether to return
 to L2 or L1 also happens during the injection phase (see the previous
 patches), so in the nested case we can only decide what to do about the
 idt_vectoring_info right after the injection, i.e., in the beginning of
 vmx_vcpu_run, which is the first time we know for sure if we're staying in
 L2.
 
 Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
 vmx_complete_interrupts() code which queues the idt_vectoring_info for
 injection on next entry - because such injection would not be appropriate
 if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
 and related fields in vmcs12 (which is a convenient place to save these
 fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
 potentially exiting to L1 to inject an event requested by user space), if
 we find ourselves in L1 we don't need to do anything with those values
 we saved (as explained above). But if we find that we're in L2, or rather
 *still* at L2 (it's not nested_run_pending, meaning that this is the first
 round of L2 running after L1 having just launched it), we need to inject
 the event saved in those fields - by writing the appropriate VMCS fields.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |   30 ++
  1 file changed, 30 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:50.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:50.0 +0300
 @@ -5804,6 +5804,8 @@ static void __vmx_complete_interrupts(st
 
  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
  {
 + if (is_guest_mode(vmx-vcpu))
 + return;
   __vmx_complete_interrupts(vmx, vmx-idt_vectoring_info,
 VM_EXIT_INSTRUCTION_LEN,
 IDT_VECTORING_ERROR_CODE);
 @@ -5811,6 +5813,8 @@ static void vmx_complete_interrupts(stru
 
  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
  {
 + if (is_guest_mode(vcpu))
 + return;
   __vmx_complete_interrupts(to_vmx(vcpu),
 vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 VM_ENTRY_INSTRUCTION_LEN,
 @@ -5831,6 +5835,21 @@ static void __noclone vmx_vcpu_run(struc
  {
   struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 + if (is_guest_mode(vcpu)  !vmx-nested.nested_run_pending) {
 + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 + if (vmcs12-idt_vectoring_info_field 
 + VECTORING_INFO_VALID_MASK) {
 + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
 + vmcs12-idt_vectoring_info_field);
 + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 + vmcs12-vm_exit_instruction_len);
 + if (vmcs12-idt_vectoring_info_field 
 + VECTORING_INFO_DELIVER_CODE_MASK)
 + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
 + vmcs12-idt_vectoring_error_code);
 + }
 + }

Here got one question. How about L2 has interrupt exiting disabled? That way
it's expect to have L0 directly inject virtual interrupt into L2, and thus 
simply
overwrite interrupt info field here looks incorrect. Though as you said typical
hypervisor doesn't turn interrupt exiting off, but it does be an architectural
correct thing. I think here you should compare current INTR_INFO_FIELD with
saved IDT_VECTOR and choose a higher priority, when L2 has interrupt
exiting disabled.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Wednesday, May 25, 2011 6:14 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 25/31] nVMX:
 Correct handling of idt vectoring info:
  Here got one question. How about L2 has interrupt exiting disabled? That
 way
 
 You are right, this is a bug in my code. It's a known bug - listed on my
 private bugzilla - but I didn't get around to fixing it.
 
 As I said, most hypervisors in the wild do not have interrupt exiting
 disabled, so this bug will not appear in practice, but definitely it should
 be fixed, and I will.
 
 When the nvmx patches are merged, I can can copy my private bug tracker's
 open issues to an open bug tracker (I'm hoping KVM has one?).
 

That's fine, or at least convert your private bug list into a list of TODO
comment in right code place.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 31/31] nVMX: Documentation

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 4:00 AM
 
 This patch includes a brief introduction to the nested vmx feature in the
 Documentation/kvm directory. The document also includes a copy of the
 vmcs12 structure, as requested by Avi Kivity.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  Documentation/kvm/nested-vmx.txt |  243
 +
  1 file changed, 243 insertions(+)
 
 --- .before/Documentation/kvm/nested-vmx.txt  2011-05-16
 22:36:51.0 +0300
 +++ .after/Documentation/kvm/nested-vmx.txt   2011-05-16
 22:36:51.0 +0300
 @@ -0,0 +1,243 @@
 +Nested VMX
 +==
 +
 +Overview
 +-
 +
 +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
 +to easily and efficiently run guest operating systems. Normally, these guests
 +*cannot* themselves be hypervisors running their own guests, because in
 VMX,
 +guests cannot use VMX instructions.

because in VMX, guests cannot use VMX instructions looks not correct or else
you can't add nVMX support. :-) It's just because currently KVM doesn't emulate
those VMX instructions.

 +
 +The Nested VMX feature adds this missing capability - of running guest
 +hypervisors (which use VMX) with their own nested guests. It does so by
 +allowing a guest to use VMX instructions, and correctly and efficiently
 +emulating them using the single level of VMX available in the hardware.
 +
 +We describe in much greater detail the theory behind the nested VMX
 feature,
 +its implementation and its performance characteristics, in the OSDI 2010
 paper
 +The Turtles Project: Design and Implementation of Nested Virtualization,
 +available at:
 +
 + http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
 +
 +
 +Terminology
 +---
 +
 +Single-level virtualization has two levels - the host (KVM) and the guests.
 +In nested virtualization, we have three levels: The host (KVM), which we call
 +L0, the guest hypervisor, which we call L1, and its nested guest, which we
 +call L2.

Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here, given
that this doc is a centralized place to gain quick picture of the nested VMX.

 +
 +
 +Known limitations
 +-
 +
 +The current code supports running Linux guests under KVM guests.
 +Only 64-bit guest hypervisors are supported.
 +
 +Additional patches for running Windows under guest KVM, and Linux under
 +guest VMware server, and support for nested EPT, are currently running in
 +the lab, and will be sent as follow-on patchsets.

any plan on nested VTD?

 +
 +
 +Running nested VMX
 +--
 +
 +The nested VMX feature is disabled by default. It can be enabled by giving
 +the nested=1 option to the kvm-intel module.
 +
 +No modifications are required to user space (qemu). However, qemu's default
 +emulated CPU type (qemu64) does not list the VMX CPU feature, so it must
 be
 +explicitly enabled, by giving qemu one of the following options:
 +
 + -cpu host  (emulated CPU has all features of the real
 CPU)
 +
 + -cpu qemu64,+vmx   (add just the vmx feature to a named CPU
 type)
 +
 +
 +ABIs
 +
 +
 +Nested VMX aims to present a standard and (eventually) fully-functional VMX
 +implementation for the a guest hypervisor to use. As such, the official
 +specification of the ABI that it provides is Intel's VMX specification,
 +namely volume 3B of their Intel 64 and IA-32 Architectures Software
 +Developer's Manual. Not all of VMX's features are currently fully supported,
 +but the goal is to eventually support them all, starting with the VMX 
 features
 +which are used in practice by popular hypervisors (KVM and others).

It'd be good to provide a list of known supported features. In your current 
code,
people have to look at code to understand current status. If you can keep a
supported and verified feature list here, it'd be great.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] new version of loaded_vmcs patch

2011-05-25 Thread Tian, Kevin

 From: Avi Kivity
 Sent: Tuesday, May 24, 2011 9:57 PM
 
 On 05/24/2011 04:14 PM, Avi Kivity wrote:
  On 05/24/2011 03:26 PM, Nadav Har'El wrote:
  Hi Avi, here is a updated version of the loaded_vmcs patch which you
  asked me
  to send before the rest of the nvmx patches.
 
  Please let me know when you'd like me to send you an updated version of
  the rest of the patches.
 
  Do you have a list of unresolved issues?
 
  If not, please repost the series.
 
 
 btw, since Kevin is now plowing through the patchset, please wait for
 the rest of his review.
 

I have finished my current round of review of nVMX-v10 with all comments
sent out.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 31/31] nVMX: Documentation

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Wednesday, May 25, 2011 7:55 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 31/31] nVMX:
 Documentation:
   +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
   +to easily and efficiently run guest operating systems. Normally, these
 guests
   +*cannot* themselves be hypervisors running their own guests, because in
   VMX,
   +guests cannot use VMX instructions.
 
  because in VMX, guests cannot use VMX instructions looks not correct or
 else
  you can't add nVMX support. :-) It's just because currently KVM doesn't
 emulate
  those VMX instructions.
 
 It depends on whether you look on the half-empty or half-full part of the
 glass ;-)
 
 The VMX instructions, when used in L1, do trap - as mandated by Popek and
 Goldberg's theorem (that sensitive instructions must trap) - but they
 don't just work like, for example, arithmetic instructions just work -
 they need to be emulated by the VMM.
 
   +Terminology
   +---
   +
   +Single-level virtualization has two levels - the host (KVM) and the 
   guests.
   +In nested virtualization, we have three levels: The host (KVM), which we
 call
   +L0, the guest hypervisor, which we call L1, and its nested guest, which 
   we
   +call L2.
 
  Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here,
 given
  that this doc is a centralized place to gain quick picture of the nested 
  VMX.
 
 I'm adding now a short mention. However, I think this file should be viewed
 as a user's guide, not a developer's guide. Developers should probably read
 our full paper, where this terminology is explained, as well as how vmcs02
 is related to the two others.

I agree with the purpose of this doc. 

 
   +Additional patches for running Windows under guest KVM, and Linux under
   +guest VMware server, and support for nested EPT, are currently running in
   +the lab, and will be sent as follow-on patchsets.
 
  any plan on nested VTD?
 
 Yes, for some definition of Yes ;-)
 
 We do have an experimental nested IOMMU implementation: In our nested
 VMX
 paper we showed how giving L1 an IOMMU allows for efficient nested device
 assignment (L0 assigns a PCI device to L1, and L1 does the same to L2).
 In that work we used a very simplistic paravirtual IOMMU instead of fully
 emulating an IOMMU for L1.
 Later, we did develop a full emulation of an IOMMU for L1, although we didn't
 test it in the context of nested VMX (we used it to allow L1 to use an IOMMU
 for better DMA protection inside the guest).
 
 The IOMMU emulation work was done by Nadav Amit, Muli Ben-Yehuda, et al.,
 and will be described in the upcoming Usenix ATC conference
 (http://www.usenix.org/event/atc11/tech/techAbstracts.html#Amit).
 After the conference in June, the paper will be available at this URL:
 http://www.usenix.org/event/atc11/tech/final_files/Amit.pdf
 
 If there is interest, they can perhaps contribute their work to
 KVM (and QEMU) - if you're interested, please get in touch with them directly.

Thanks and good to know those information

 
  It'd be good to provide a list of known supported features. In your current
 code,
  people have to look at code to understand current status. If you can keep a
  supported and verified feature list here, it'd be great.
 
 It will be even better to support all features ;-)
 
 But seriously, the VMX spec is hundreds of pages long, with hundreds of
 features, sub-features, and sub-sub-features and myriads of subcase-of-
 subfeature and combinations thereof, so I don't think such a list would be
 practical - or ever be accurate.

no need for all subfeatures, a list of possibly a dozen features which people
once enabled them one-by-one is applausive, especially for things which
may accelerate L2 perf, such as virtual NMI, tpr shadow, virtual x2APIC, ... 

 
 In the Known Limitations section of this document, I'd like to list major
 features which are missing, and perhaps more importantly - L1 and L2
 guests which are known NOT to work.

yes, that info is also important and thus people can easily reproduce your
success.

 
 By the way, it appears that you've been going over the patches in increasing
 numerical order, and this is the last patch ;-) Have you finished your
 review iteration?
 

yes, I've finished my review on all of your v10 patches. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Wednesday, May 25, 2011 8:34 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 23/31] nVMX:
 Correct handling of interrupt injection:
static void enable_irq_window(struct kvm_vcpu *vcpu)
{
 u32 cpu_based_vm_exec_control;
   + if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
   + /* We can get here when nested_run_pending caused
   +  * vmx_interrupt_allowed() to return false. In this case, do
   +  * nothing - the interrupt will be injected later.
   +  */
 
  I think this is not a rare path? when vcpu is in guest mode with L2 as 
  current
  vmx context, this function could be invoked multiple times since kvm thread
  can be scheduled out/in randomly.
 
 As I wrote in this comment, this can only happen on nested_run_pending
 (i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
 and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
 exited L2, and we wouldn't be in this case.
 
 I don't know if to classify this as a rare path - it's definitely not
 very common. But what does it matter if it's rare or common?

It doesn't matter much. I just tried to understand your comment.

 
 
   + if (to_vmx(vcpu)-nested.nested_run_pending)
   + return 0;
 
  Well, now I can see why you require this special 'nested_run_pending' flag
  because there're places where L0 injects virtual interrupts right after
  VMLAUNCH/VMRESUME emulation and before entering L2. :-)
 
 Indeed. I tried to explain that in the patch description, where I wrote
 
  We keep a new flag, nested_run_pending, which can override the decision
 of
  which should run next, L1 or L2. nested_run_pending=1 means that we
 *must* run
  L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
  and therefore expects L2 to be run (and perhaps be injected with an event it
  specified, etc.). Nested_run_pending is especially intended to avoid 
 switching
  to L1 in the injection decision-point described above.

atm when nested_run_pending is first introduced, its usage is simple which made
me think this field may not be required. But later several key patches do depend
on this flag for correctness. :-)

 
   + nested_vmx_vmexit(vcpu);
   + vmcs12 = get_vmcs12(vcpu);
   + vmcs12-vm_exit_reason =
 EXIT_REASON_EXTERNAL_INTERRUPT;
   + vmcs12-vm_exit_intr_info = 0;
   + /* fall through to normal code, but now in L1, not L2 */
   + }
   +
 
  This is a bad place to add this logic. vmx_interrupt_allowed is simply a
  query function but you make it an absolute trigger point for switching from
  L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
  when there's vNMI pending. But it's dangerous to have such assumption
  for pending events inside vmx_interrupt_allowed.
 
 Now you're beating a dead horse ;-)
 
 Gleb, and to some degree Avi, already argued that this is the wrong place
 to do this exit, and if anything the exit should be done (or just decided on)
 in enable_irq_window().
 
 My counter-argument was that the right way is *neither* of these approaches
 -
 any attempt to commandeer one of the existing x86 ops, be they
 vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
 they were never designed to do is both ugly, and dangerous if the call sites
 change at some time in the future.
 
 So rather than changing one ugly abuse of one function, to the (arguably
 also ugly) abuse of another function, what I'd like to see is a better overall
 design, where the call sites in x86.c know about the possibility of a nested
 guest (they already do - like we previously discussed, an is_guest_mode()
 function was recently added), and when they need, *they* will call an
 exit-to-L1 function, rather than calling a function called enable_irq_window
 or vmx_interrupt_allowed which mysteriously will do the exit.
 

I agree with your point here.

 
  On the other hand, I think there's one area which is not handled timely.
  I think you need to kick a L2-L1 transition when L0 wants to inject
  virtual interrupt. Consider your current logic:
 
  a) L2 is running on cpu1
  b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued 
  to
  cpu1 after updating virqchip
  c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
  the event. As it's an external interrupt, L0 will handle it. As it's a 
  notification
  IPI, nothing is required.
  d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT
 
  At this point you only add logic to enable_irq_window, but there's no
  action to trigger L2-L1 transition. So what will happen? Will the event
  be injected into L2 instead or pend until next switch happens due to
  other cause?
 
 I'm afraid I'm missing something in your explanation... In step d, L0 finds
 an interrupt in the injection queue, so isn't the first thing it does

RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Wednesday, May 25, 2011 9:21 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 20/31] nVMX:
 Exiting from L2 to L1:
  How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO
 comment
  here make the whole load process complete. :-)
 
  Also isn't it more sane to update vmcs01's guest segment info based on
 vmcs12's
  host segment info? Though you can assume the environment in L1 doesn't
 change
  from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural
 clear
  to load those segments fields according to L1's desire.
 
 Right... One of these days, I (or some other volunteer ;-)) would need to
 print out the relevant sections of the SDM, sit down with a marker, a read
 it line by line marking lines, fields, capabilities, and so on, which we
 forgot to implement...

You've done a great job.

 
 How about these additions:
 
   vmcs_write32(GUEST_SYSENTER_CS, vmcs12-host_ia32_sysenter_cs);
   vmcs_writel(GUEST_SYSENTER_ESP, vmcs12-host_ia32_sysenter_esp);
   vmcs_writel(GUEST_SYSENTER_EIP, vmcs12-host_ia32_sysenter_eip);
   vmcs_writel(GUEST_IDTR_BASE, vmcs12-host_idtr_base);
   vmcs_writel(GUEST_GDTR_BASE, vmcs12-host_gdtr_base);
   vmcs_writel(GUEST_TR_BASE, vmcs12-host_tr_base);
   vmcs_writel(GUEST_GS_BASE, vmcs12-host_gs_base);
   vmcs_writel(GUEST_FS_BASE, vmcs12-host_fs_base);
   vmcs_write16(GUEST_ES_SELECTOR, vmcs12-host_es_selector);
   vmcs_write16(GUEST_CS_SELECTOR, vmcs12-host_cs_selector);
   vmcs_write16(GUEST_SS_SELECTOR, vmcs12-host_ss_selector);
   vmcs_write16(GUEST_DS_SELECTOR, vmcs12-host_ds_selector);
   vmcs_write16(GUEST_FS_SELECTOR, vmcs12-host_fs_selector);
   vmcs_write16(GUEST_GS_SELECTOR, vmcs12-host_gs_selector);
   vmcs_write16(GUEST_TR_SELECTOR, vmcs12-host_tr_selector);
 
   if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_PAT)
   vmcs_write64(GUEST_IA32_PAT, vmcs12-host_ia32_pat);
   if (vmcs12-vm_exit_controls 
 VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
   vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
   vmcs12-host_ia32_perf_global_ctrl);
 

looks good.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Wednesday, May 25, 2011 9:26 PM
 
 On Wed, May 25, 2011, Tian, Kevin wrote about RE: [PATCH 18/31] nVMX:
 Implement VMLAUNCH and VMRESUME:
   + if (!saved_vmcs02)
   + return -ENOMEM;
   +
 
  we shouldn't return error after the guest mode is updated. Or else move
  enter_guest_mode to a later place...
 
 I moved things around, but I don't think it matters anyway: If we return
 ENOMEM, the KVM ioctl fails, and the whole L1 guest dies - it doesn't matter
 at this point if we were in the middle of updating its state.
 

yes but the code this way is cleaner that guest mode is set only when next
we'll certainly enter L2.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 0/31] nVMX: Nested VMX, v11

2011-05-25 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Thursday, May 26, 2011 4:01 AM
 
 Hi,
 
 This is the eleventh iteration of the nested VMX patch set, and hopefully the
 last in this format.
 
 Improvements in this version over the previous one include:
 
  * Overhauled vmcs, cpu, and launched handling (new loaded_vmcs structure
 and
per-cpu linked-list replacing old vcpu linked-list).
 
  * Added missing checks that VMPTRLD was done, in emulating VMLAUNCH
 and some
other VMX instructions.
 
  * prepare_vmcs02 is now void, and cannot fail. Correctly handle the case
of apic-access page with an invalid GPA.
 
  * A bunch of corrections to errors and omissions discovered by Kevin Tian.
 
  * Miscellenous function name changes, and other cosmetic improvements.
 
 This new set of patches applies to the current KVM trunk (I checked with
 5b186b275c0288c8576d26baebe9019875d68a2b).
 
 Avi, please apply.

This version looks good to me, except that please make sure unhandled comments
are recorded in your private bugzilla and finally filled into a public place. 
:-)

Acked-by: Kevin Tian kevin.t...@intel.com

Thanks
Kevin

 
 Nadav.
 
 
 About nested VMX:
 -
 
 The following 31 patches implement nested VMX support. This feature enables
 a guest to use the VMX APIs in order to run its own nested guests.
 In other words, it allows running hypervisors (that use VMX) under KVM.
 Multiple guest hypervisors can be run concurrently, and each of those can
 in turn host multiple guests.
 
 The theory behind this work, our implementation, and its performance
 characteristics were presented in OSDI 2010 (the USENIX Symposium on
 Operating Systems Design and Implementation). Our paper was titled
 The Turtles Project: Design and Implementation of Nested Virtualization,
 and was awarded Jay Lepreau Best Paper. The paper is available online, at:
 
   http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
 
 This patch set does not include all the features described in the paper.
 In particular, this patch set is missing nested EPT (L1 can't use EPT and
 must use shadow page tables). It is also missing some features required to
 run VMWare hypervisors as a guest. These missing features will be sent as
 follow-on patchs.
 
 Running nested VMX:
 --
 
 The nested VMX feature is currently disabled by default. It must be
 explicitly enabled with the nested=1 option to the kvm-intel module.
 
 No modifications are required to user space (qemu). However, qemu's default
 emulated CPU type (qemu64) does not list the VMX CPU feature, so it must
 be
 explicitly enabled, by giving qemu one of the following options:
 
  -cpu host  (emulated CPU has all features of the real
 CPU)
 
  -cpu qemu64,+vmx   (add just the vmx feature to a named CPU
 type)
 
 
 This version was only tested with KVM (64-bit) as a guest hypervisor, and
 Linux as a nested guest.
 
 
 Patch statistics:
 -
 
  Documentation/kvm/nested-vmx.txt |  251 ++
  arch/x86/include/asm/kvm_host.h  |2
  arch/x86/include/asm/msr-index.h |   12
  arch/x86/include/asm/vmx.h   |   43
  arch/x86/kvm/svm.c   |6
  arch/x86/kvm/vmx.c   | 2733
 +++--
  arch/x86/kvm/x86.c   |   11
  arch/x86/kvm/x86.h   |8
  8 files changed, 2907 insertions(+), 159 deletions(-)
 
 --
 Nadav Har'El
 IBM Haifa Research Lab
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:53 AM

 This patch contains code to prepare the VMCS which can be used to actually
 run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the
 information
 in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
 own guests).

 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  269
 +++
  1 file changed, 269 insertions(+)

 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:48.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:48.0 +0300
 @@ -347,6 +347,12 @@ struct nested_vmx {
   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
   struct list_head vmcs02_pool;
   int vmcs02_num;
 + u64 vmcs01_tsc_offset;
 + /*
 +  * Guest pages referred to in vmcs02 with host-physical pointers, so
 +  * we must keep them pinned while L2 runs.
 +  */
 + struct page *apic_access_page;
  };

  struct vcpu_vmx {
 @@ -849,6 +855,18 @@ static inline bool report_flexpriority(v
   return flexpriority_enabled;
  }

 +static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
 +{
 + return vmcs12-cpu_based_vm_exec_control  bit;
 +}
 +
 +static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
 +{
 + return (vmcs12-cpu_based_vm_exec_control 
 + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 
 + (vmcs12-secondary_vm_exec_control  bit);
 +}
 +
  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
  {
   int i;
 @@ -1435,6 +1453,22 @@ static void vmx_fpu_activate(struct kvm_

  static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);

 +/*
 + * Return the cr0 value that a nested guest would read. This is a combination
 + * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed 
 by
 + * its hypervisor (cr0_read_shadow).
 + */
 +static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
 +{
 + return (fields-guest_cr0  ~fields-cr0_guest_host_mask) |
 + (fields-cr0_read_shadow  fields-cr0_guest_host_mask);
 +}
 +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
 +{
 + return (fields-guest_cr4  ~fields-cr4_guest_host_mask) |
 + (fields-cr4_read_shadow  fields-cr4_guest_host_mask);
 +}
 +

will guest_ prefix look confusing here? The 'guest' has a broad range which 
makes
above two functions look like they can be used in non-nested case. Should we 
stick
to nested_prefix for nested specific facilities?

  static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
  {
   vmx_decache_cr0_guest_bits(vcpu);
 @@ -3423,6 +3457,9 @@ static void set_cr4_guest_host_mask(stru
   vmx-vcpu.arch.cr4_guest_owned_bits =
 KVM_CR4_GUEST_OWNED_BITS;
   if (enable_ept)
   vmx-vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
 + if (is_guest_mode(vmx-vcpu))
 + vmx-vcpu.arch.cr4_guest_owned_bits =
 + ~get_vmcs12(vmx-vcpu)-cr4_guest_host_mask;

why not is_nested_mode()? :-P

   vmcs_writel(CR4_GUEST_HOST_MASK,
 ~vmx-vcpu.arch.cr4_guest_owned_bits);
  }

 @@ -4760,6 +4797,11 @@ static void free_nested(struct vcpu_vmx
   vmx-nested.current_vmptr = -1ull;
   vmx-nested.current_vmcs12 = NULL;
   }
 + /* Unpin physical memory we referred to in current vmcs02 */
 + if (vmx-nested.apic_access_page) {
 + nested_release_page(vmx-nested.apic_access_page);
 + vmx-nested.apic_access_page = 0;
 + }

   nested_free_all_saved_vmcss(vmx);
  }
 @@ -5829,6 +5871,233 @@ static void vmx_set_supported_cpuid(u32
  }

  /*
 + * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
 + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
 + * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
 + * guest in a way that will both be appropriate to L1's requests, and our
 + * needs. In addition to modifying the active vmcs (which is vmcs02), this
 + * function also has additional necessary side-effects, like setting various
 + * vcpu-arch fields.
 + */
 +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + u32 exec_control;
 +
 + vmcs_write16(GUEST_ES_SELECTOR, vmcs12-guest_es_selector);
 + vmcs_write16(GUEST_CS_SELECTOR, vmcs12-guest_cs_selector);
 + vmcs_write16(GUEST_SS_SELECTOR, vmcs12-guest_ss_selector);
 + vmcs_write16(GUEST_DS_SELECTOR, vmcs12-guest_ds_selector);
 + vmcs_write16(GUEST_FS_SELECTOR, vmcs12-guest_fs_selector);
 + vmcs_write16(GUEST_GS_SELECTOR, vmcs12-guest_gs_selector);
 + vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12-guest_ldtr_selector);
 + vmcs_write16(GUEST_TR_SELECTOR, vmcs12-guest_tr_selector);
 + vmcs_write32(GUEST_ES_LIMIT, vmcs12-guest_es_limit);
 + vmcs_write32(GUEST_CS_LIMIT, vmcs12-guest_cs_limit);
 +

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Tuesday, May 24, 2011 3:57 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 08/31] nVMX: Fix
 local_vcpus_link handling:
   +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
   +{
   + vmcs_clear(loaded_vmcs-vmcs);
   + loaded_vmcs-cpu = -1;
   + loaded_vmcs-launched = 0;
   +}
   +
 
  call it vmcs_init instead since you now remove original vmcs_init 
  invocation,
  which more reflect the necessity of adding VMCLEAR for a new vmcs?
 
 The best name for this function, I think, would have been loaded_vmcs_clear,
 because this function isn't necessarily used to init - it's also called to
 VMCLEAR an old vmcs (and flush its content back to memory) - in that sense
 it is definitely not a vmcs_init.
 
 Unfortunately, I already have a whole chain of functions with this name :(
 the existing loaded_vmcs_clear() does an IPI to the CPU which has this VMCS
 loaded, and causes it to run __loaded_vmcs_clear(), which in turn calls
 the above loaded_vmcs_init(). I wish I could call all three functions
 loaded_vmcs_clear(), but of course I can't. If anyone reading this has a good
 suggestion on how to name these three functions, please let me know.

how about loaded_vmcs_reset?

 
   +static void __loaded_vmcs_clear(void *arg)
{
   - struct vcpu_vmx *vmx = arg;
   + struct loaded_vmcs *loaded_vmcs = arg;
 int cpu = raw_smp_processor_id();
  
   - if (vmx-vcpu.cpu == cpu)
   - vmcs_clear(vmx-vmcs);
   - if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
   + if (loaded_vmcs-cpu != cpu)
   + return; /* cpu migration can race with cpu offline */
 
  what do you mean by cpu migration here? why does 'cpu offline'
  matter here regarding to the cpu change?
 
 __loaded_vmcs_clear() is typically called in one of two cases: cpu migration
 means that a guest that used to run on one CPU, and had its VMCS loaded
 there,
 suddenly needs to run on a different CPU, so we need to clear the VMCS on
 the old CPU. cpu offline means that we want to take a certain CPU offline,
 and before we do that we should VMCLEAR all VMCSs which were loaded on it.

So here you need explicitly differentiate a vcpu and a real cpu. For the 1st 
case
it's just 'vcpu migration, and the latter it's the real 'cpu offline'. 'cpu 
migration' 
is generally a RAS feature in mission critical world. :-) 

 
 The (vmx-cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
 happen: In the cpu offline path, we only call it for the loaded_vmcss which
 we know for sure are loaded on the current cpu. In the cpu migration path,
 loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures
 that
 equality.
 
 But, there can be a race condition (this was actually explained to me a while
 back by Avi - I never seen this happening in practice): Imagine that cpu
 migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
 VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
 a decision is made to take it offline, and all loaded_vmcs loaded on it
 (including the one in question) are cleared. When that CPU acts on this IPI,
 it notices that vmx-cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
 anything (in the new version of the code, I made this more explicit, by
 returning immediately in this case).

the reverse also holds true. Right between the point where cpu_offline hits
a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
that the vcpu is migrated to another cpu, and it's likely that migration path
(vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
from old cpu's linked list. This way later when __loaded_vmcs_clear is
invoked on the offlined cpu, there's still chance to observe cpu as -1.

 
 At least this is the theory. As I said, I didn't see this problem in practice
 (unsurprising, since I never offlined any CPU). Maybe Avi or someone else can
 comment more about this (vmx-cpu.cpu == cpu) check, which existed before
 my patch - in __vcpu_clear().

I agree this check is necessary, but just want you to make the comment clear
with right term.

 
   +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
   +{
   + if (!loaded_vmcs-vmcs)
   + return;
   + loaded_vmcs_clear(loaded_vmcs);
   + free_vmcs(loaded_vmcs-vmcs);
   + loaded_vmcs-vmcs = NULL;
   +}
 
  prefer to not do cleanup work through loaded_vmcs since it's just a pointer
  to a loaded vmcs structure. Though you can carefully arrange the nested
  vmcs cleanup happening before it, it's not very clean and a bit error prone
  simply from this function itself. It's clearer to directly cleanup vmcs01, 
  and
  if you want an assertion could be added to make sure loaded_vmcs doesn't
  point to any stale vmcs02 structure after nested cleanup step.
 
 I'm afraid I didn't understand what you meant here... Basically, this
 free_loaded_vmcs() is just a shortcut for loaded_vmcs_clear

RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:53 AM
 
 Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
 hypervisor to run its own guests.
 
 This patch does not include some of the necessary validity checks on
 vmcs12 fields before the entry. These will appear in a separate patch
 below.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |   84
 +--
  1 file changed, 82 insertions(+), 2 deletions(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -347,6 +347,9 @@ struct nested_vmx {
   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
   struct list_head vmcs02_pool;
   int vmcs02_num;
 +
 + /* Saving the VMCS that we used for running L1 */
 + struct saved_vmcs saved_vmcs01;
   u64 vmcs01_tsc_offset;
   /*
* Guest pages referred to in vmcs02 with host-physical pointers, so
 @@ -4668,6 +4671,8 @@ static void nested_free_all_saved_vmcss(
   kfree(item);
   }
   vmx-nested.vmcs02_num = 0;
 + if (is_guest_mode(vmx-vcpu))
 + nested_free_saved_vmcs(vmx, vmx-nested.saved_vmcs01);
  }
 
  /* Get a vmcs02 for the current vmcs12. */
 @@ -4959,6 +4964,21 @@ static int handle_vmclear(struct kvm_vcp
   return 1;
  }
 
 +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
 +
 +/* Emulate the VMLAUNCH instruction */
 +static int handle_vmlaunch(struct kvm_vcpu *vcpu)
 +{
 + return nested_vmx_run(vcpu, true);
 +}
 +
 +/* Emulate the VMRESUME instruction */
 +static int handle_vmresume(struct kvm_vcpu *vcpu)
 +{
 +
 + return nested_vmx_run(vcpu, false);
 +}
 +
  enum vmcs_field_type {
   VMCS_FIELD_TYPE_U16 = 0,
   VMCS_FIELD_TYPE_U64 = 1,
 @@ -5239,11 +5259,11 @@ static int (*kvm_vmx_exit_handlers[])(st
   [EXIT_REASON_INVLPG]  = handle_invlpg,
   [EXIT_REASON_VMCALL]  = handle_vmcall,
   [EXIT_REASON_VMCLEAR] = handle_vmclear,
 - [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
 + [EXIT_REASON_VMLAUNCH]= handle_vmlaunch,
   [EXIT_REASON_VMPTRLD] = handle_vmptrld,
   [EXIT_REASON_VMPTRST] = handle_vmptrst,
   [EXIT_REASON_VMREAD]  = handle_vmread,
 - [EXIT_REASON_VMRESUME]= handle_vmx_insn,
 + [EXIT_REASON_VMRESUME]= handle_vmresume,
   [EXIT_REASON_VMWRITE] = handle_vmwrite,
   [EXIT_REASON_VMOFF]   = handle_vmoff,
   [EXIT_REASON_VMON]= handle_vmon,
 @@ -6129,6 +6149,66 @@ static void nested_maintain_per_cpu_list
   }
  }
 
 +/*
 + * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or
 VMRESUME on L1
 + * for running an L2 nested guest.
 + */
 +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 +{
 + struct vmcs12 *vmcs12;
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + int cpu;
 + struct saved_vmcs *saved_vmcs02;
 +
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;
 + skip_emulated_instruction(vcpu);
 +
 + vmcs12 = get_vmcs12(vcpu);
 +
 + enter_guest_mode(vcpu);
 +
 + vmx-nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
 +
 + /*
 +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
 +  * vmcs01, on which CPU it was last loaded, and whether it was launched
 +  * (we need all these values next time we will use L1). Then recall
 +  * these values from the last time vmcs02 was used.
 +  */
 + saved_vmcs02 = nested_get_current_vmcs02(vmx);
 + if (!saved_vmcs02)
 + return -ENOMEM;
 +
 + cpu = get_cpu();
 + vmx-nested.saved_vmcs01.vmcs = vmx-vmcs;
 + vmx-nested.saved_vmcs01.cpu = vcpu-cpu;
 + vmx-nested.saved_vmcs01.launched = vmx-launched;
 + vmx-vmcs = saved_vmcs02-vmcs;
 + vcpu-cpu = saved_vmcs02-cpu;

this may be another valid reason for your check on cpu_online in your
latest [08/31] local_vcpus_link fix, since cpu may be offlined after
this assignment. :-)

 + vmx-launched = saved_vmcs02-launched;
 +
 + nested_maintain_per_cpu_lists(vmx,
 + saved_vmcs02, vmx-nested.saved_vmcs01);
 +
 + vmx_vcpu_put(vcpu);
 + vmx_vcpu_load(vcpu, cpu);
 + vcpu-cpu = cpu;
 + put_cpu();
 +
 + vmcs12-launch_state = 1;
 +
 + prepare_vmcs02(vcpu, vmcs12);

Since prepare_vmcs may fail, add a check here and move launch_state
assignment after its success?

Thanks
Kevin

 +
 + /*
 +  * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
 +  * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
 +  * returned as far as L1 is concerned. It will only return (and set
 +  * the success flag) when L2 exits (see nested_vmx_vmexit()).
 +  */
 + return

RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 24, 2011 5:19 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 17/31] nVMX:
 Prepare vmcs02 from vmcs01 and vmcs12:
   +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
   +{
   + return (fields-guest_cr4  ~fields-cr4_guest_host_mask) |
   + (fields-cr4_read_shadow 
 fields-cr4_guest_host_mask);
   +}
   +
 
  will guest_ prefix look confusing here? The 'guest' has a broad range which
 makes
  above two functions look like they can be used in non-nested case. Should we
 stick
  to nested_prefix for nested specific facilities?
 
 I don't know, I thought it made calls like
 
   vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
 
 readable, and the comments (and the parameters) make it obvious it's for
 nested only.
 
 I now renamed these functions nested_read_cr0(), nested_read_cr4() - I hope
 you like these names better.

yes.

 
   + if (is_guest_mode(vmx-vcpu))
   + vmx-vcpu.arch.cr4_guest_owned_bits =
   +
 ~get_vmcs12(vmx-vcpu)-cr4_guest_host_mask;
 
  why not is_nested_mode()? :-P
 
 I assume you're wondering why the function is called is_guest_mode(), and
 not is_nested_mode()?

yes

 
 This name was chosen by Avi Kivity in November last year, for the function
 previously introduced by Joerg Roedel. My original code (before Joerg added
 this function to x86.c) indeed used the term nested_mode, not
 guest_mode.
 
 In January, I pointed to the possibility of confusion between the new
 is_guest_mode() and other things called guest mode, and Avi Kivity said
 he will rename it to is_nested_guest() - see
 http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/01418.html
 But as you can see, he never did this renaming.
 
 That being said, after half a year, I got used to the name is_guest_mode(),
 and am no longer convinced it should be changed. It checks whether the vcpu
 (not the underlying CPU) is in Intel-SDM-terminology guest mode. Just like
 is_long_mode() checks if the vcpu is in long mode. So I'm fine with leaving
 its current name.

well, it's a small issue, and I'm fine with leaving it though I don't like 
'guest' here. :-)

 
   +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12
 *vmcs12)
   +{
 ...
   + if (!vmx-rdtscp_enabled)
   + exec_control = ~SECONDARY_EXEC_RDTSCP;
   + /* Take the following fields only from vmcs12 */
   + exec_control =
 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
   + if (nested_cpu_has(vmcs12,
   +
 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
   + exec_control |=
 vmcs12-secondary_vm_exec_control;
 
  should this 2nd exec_control be merged in clear case-by-case flavor?
 
  what about L0 sets virtualize x2APIC bit while L1 doesn't?
 
  Or what about L0 disables EPT while L1 sets it?
 
  I think it's better to scrutinize every 2nd exec_control feature with a
  clear policy:
  - whether we want to use the stricter policy which is only set when both L0
 and
  L1 set it
  - whether we want to use L1 setting absolutely regardless of L0 setting like
  what you did for virtualize APIC access
 
 Please note that most of the examples you give cannot happen in practice,
 because we tell L1 (via MSR) which features it is allowed to use, and we
 fail entry if it tries to use disallowed features (before ever reaching
 the merge code you're commenting on). So we don't allow L1, for example,
 to use the EPT feature (and when nested-EPT support is added, we won't
 allow L1 to use EPT if L0 didn't). The general thinking was that for most
 fields that we do explicitly allow, OR is the right choice.

This really bases on the value of the control bit. To achieve the strictest
setting between L0/L1, sometimes you want to use AND and sometimes you
want to use OR.

From a design p.o.v, it's better not to have such implicit assumption on other
places. Just make it clean and correct. Also in your example it doesn't cover
the case where L0 sets some bits which are not exposed to L1 via MSR. For
example as I said earlier, what about L0 sets virtualize X2APIC mode while
it's not enabled by or not exposed to L1. With OR, you then also enable this 
mode for L2 absolutely, while L1 has no logic to handle it.

I'd like to see a clean policy for the known control bits here, even with a 
strict policy to incur most VM-exits which can be optimized in the future.

 
 I'll add this to my bugzilla, and think about it again later.


Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Tuesday, May 24, 2011 5:45 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 18/31] nVMX:
 Implement VMLAUNCH and VMRESUME:
   + /*
   +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02).
 Remember
   +  * vmcs01, on which CPU it was last loaded, and whether it was
 launched
   +  * (we need all these values next time we will use L1). Then recall
   +  * these values from the last time vmcs02 was used.
   +  */
   + saved_vmcs02 = nested_get_current_vmcs02(vmx);
   + if (!saved_vmcs02)
   + return -ENOMEM;
   +
   + cpu = get_cpu();
   + vmx-nested.saved_vmcs01.vmcs = vmx-vmcs;
   + vmx-nested.saved_vmcs01.cpu = vcpu-cpu;
   + vmx-nested.saved_vmcs01.launched = vmx-launched;
   + vmx-vmcs = saved_vmcs02-vmcs;
   + vcpu-cpu = saved_vmcs02-cpu;
 
  this may be another valid reason for your check on cpu_online in your
  latest [08/31] local_vcpus_link fix, since cpu may be offlined after
  this assignment. :-)
 
 I believe that wrapping this part of the code with get_cpu()/put_cpu()
 protected me from these kinds of race conditions.

you're right.

 
 By the way, please note that this part of the code was changed after my
 latest loaded_vmcs overhaul. It now looks like this:
 
   vmcs02 = nested_get_current_vmcs02(vmx);
   if (!vmcs02)
   return -ENOMEM;
 
   cpu = get_cpu();
   vmx-loaded_vmcs = vmcs02;
   vmx_vcpu_put(vcpu);
   vmx_vcpu_load(vcpu, cpu);
   vcpu-cpu = cpu;
   put_cpu();
 
 (if Avi gives me the green light, I'll send the entire, up-to-date, patch set
 again).

Generally your new patch looks good.

 
   + vmcs12-launch_state = 1;
   +
   + prepare_vmcs02(vcpu, vmcs12);
 
  Since prepare_vmcs may fail, add a check here and move launch_state
  assignment after its success?
 
 prepare_vmcs02() cannot fail. All the checks that need to be done on vmcs12
 are done before calling it, in nested_vmx_run().
 
 Currently, there's a single case where prepare_vmcs02 fails when it fails
 to access apic_access_addr memory. This is wrong - the check should have
 been
 done earlier. I'll fix that, and make prepare_vmcs02() void.
 

then no problem, as long as you keep this choice clear.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, May 24, 2011 7:06 PM
 
 On 05/24/2011 11:20 AM, Tian, Kevin wrote:
  
The (vmx-cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally
 never
happen: In the cpu offline path, we only call it for the loaded_vmcss 
   which
we know for sure are loaded on the current cpu. In the cpu migration
 path,
loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which
 ensures
that
equality.
  
But, there can be a race condition (this was actually explained to me a
 while
back by Avi - I never seen this happening in practice): Imagine that cpu
migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
VMCLEAR this vmcs. But before that old CPU gets a chance to act on that
 IPI,
a decision is made to take it offline, and all loaded_vmcs loaded on it
(including the one in question) are cleared. When that CPU acts on this
 IPI,
it notices that vmx-cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
anything (in the new version of the code, I made this more explicit, by
returning immediately in this case).
 
  the reverse also holds true. Right between the point where cpu_offline hits
  a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's 
  possible
  that the vcpu is migrated to another cpu, and it's likely that migration 
  path
  (vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
  from old cpu's linked list. This way later when __loaded_vmcs_clear is
  invoked on the offlined cpu, there's still chance to observe cpu as -1.
 
 I don't think it's possible.  Both calls are done with interrupts disabled.

If that's the case then there's another potential issue. Deadlock may happen
when calling smp_call_function_single with interrupt disabled. 

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, May 24, 2011 7:27 PM
 
 On 05/24/2011 02:20 PM, Tian, Kevin wrote:
I don't think it's possible.  Both calls are done with interrupts 
   disabled.
 
  If that's the case then there's another potential issue. Deadlock may happen
  when calling smp_call_function_single with interrupt disabled.
 
 We don't do that.  vcpu migration calls vcpu_clear() with interrupts
 enabled, which then calls smp_call_function_single(), which calls
 __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
 called from interrupts disabled (and calls __vcpu_clear() directly).
 

OK, that's clear to me now. 

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin

 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, May 24, 2011 7:37 PM
 
 On 05/24/2011 02:30 PM, Tian, Kevin wrote:
  
We don't do that.  vcpu migration calls vcpu_clear() with interrupts
enabled, which then calls smp_call_function_single(), which calls
__vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
called from interrupts disabled (and calls __vcpu_clear() directly).
  
 
  OK, that's clear to me now.
 
 Are there still open issues about the patch?
 
 (Nadav, please post patches in the future in new threads so they're
 easier to find)
 

I'm fine with this patch except that Nadav needs to clarify the comment
in __loaded_vmcs_clear (regarding to 'cpu migration' and 'cpu offline' part
which I replied in another mail)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:54 AM
 
 This patch implements nested_vmx_vmexit(), called when the nested L2 guest
 exits and we want to run its L1 parent and let it handle this exit.
 
 Note that this will not necessarily be called on every L2 exit. L0 may decide
 to handle a particular exit on its own, without L1's involvement; In that
 case, L0 will handle the exit, and resume running L2, without running L1 and
 without calling nested_vmx_vmexit(). The logic for deciding whether to handle
 a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
 will appear in a separate patch below.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  257
 +++
  1 file changed, 257 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -6203,6 +6203,263 @@ static int nested_vmx_run(struct kvm_vcp
   return 1;
  }
 
 +/*
 + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
 + * because L2 may have changed some cr0 bits directly (see
 CRO_GUEST_HOST_MASK)
 + * without L0 trapping the change and updating vmcs12.
 + * This function returns the value we should put in vmcs12.guest_cr0. It's 
 not
 + * enough to just return the current (vmcs02) GUEST_CR0 - that may not be
 the
 + * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
 + * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) 
 asked
 + * to trap this change and instead set just the read shadow bit. If this is 
 the
 + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0,
 where
 + * L1 believes they already are.
 + */
 +static inline unsigned long
 +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 +{
 + /*
 +  * As explained above, we take a bit from GUEST_CR0 if we allowed the
 +  * guest to modify it untrapped (vcpu-arch.cr0_guest_owned_bits), or
 +  * if we did trap it - if we did so because L1 asked to trap this bit
 +  * (vmcs12-cr0_guest_host_mask). Otherwise (bits we trapped but L1
 +  * didn't expect us to trap) we read from CR0_READ_SHADOW.
 +  */
 + unsigned long guest_cr0_bits =
 + vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask;
 + return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
 +(vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
 +}

Hi, Nadav,

Not sure whether I get above operation wrong. But it looks not exactly correct 
to me
in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case 
that
bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
operation will make vmcs02_GUEST_CR0 bit returned instead.

Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
why not just updating bits which can be altered while keeping the rest bits from
vmcs12_GUEST_CR0? Say something like:

vmcs12-guest_cr0 = vmcs12-cr0_guest_host_mask; /* keep unchanged bits */
vmcs12-guest_cr0 |= (vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits) |
(vmcs_readl(CR0_READ_SHADOW)  ~( vcpu-arch.cr0_guest_owned_bits | 
vmcs12-cr0_guest_host_mask))

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 24, 2011 8:26 PM
 
 Hi Avi, here is a updated version of the loaded_vmcs patch which you asked me
 to send before the rest of the nvmx patches.
 
 Please let me know when you'd like me to send you an updated version of
 the rest of the patches.
 
 
 Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
 
 In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
 because (at least in theory) the processor might not have written all of its
 content back to memory. Since a patch from June 26, 2008, this is done using
 a per-cpu vcpus_on_cpu linked list of vcpus loaded on each CPU.
 
 The problem is that with nested VMX, we no longer have the concept of a
 vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
 L2s), and each of those may be have been last loaded on a different cpu.
 
 So instead of linking the vcpus, we link the VMCSs, using a new structure
 loaded_vmcs. This structure contains the VMCS, and the information
 pertaining
 to its loading on a specific cpu (namely, the cpu number, and whether it
 was already launched on this cpu once). In nested we will also use the same
 structure to hold L2 VMCSs, and vmx-loaded_vmcs is a pointer to the
 currently active VMCS.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com

Acked-by: Kevin Tian kevin.t...@intel.com

Thanks
Kevin

 ---
  arch/x86/kvm/vmx.c |  150 ---
  1 file changed, 86 insertions(+), 64 deletions(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-24 15:12:22.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-24 15:12:22.0 +0300
 @@ -116,6 +116,18 @@ struct vmcs {
   char data[0];
  };
 
 +/*
 + * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
 + * remember whether it was VMLAUNCHed, and maintain a linked list of all
 VMCSs
 + * loaded on this CPU (so we can clear them if the CPU goes down).
 + */
 +struct loaded_vmcs {
 + struct vmcs *vmcs;
 + int cpu;
 + int launched;
 + struct list_head loaded_vmcss_on_cpu_link;
 +};
 +
  struct shared_msr_entry {
   unsigned index;
   u64 data;
 @@ -124,9 +136,7 @@ struct shared_msr_entry {
 
  struct vcpu_vmx {
   struct kvm_vcpu   vcpu;
 - struct list_head  local_vcpus_link;
   unsigned long host_rsp;
 - int   launched;
   u8fail;
   u8cpl;
   bool  nmi_known_unmasked;
 @@ -140,7 +150,14 @@ struct vcpu_vmx {
   u64   msr_host_kernel_gs_base;
   u64   msr_guest_kernel_gs_base;
  #endif
 - struct vmcs  *vmcs;
 + /*
 +  * loaded_vmcs points to the VMCS currently used in this vcpu. For a
 +  * non-nested (L1) guest, it always points to vmcs01. For a nested
 +  * guest (L2), it points to a different VMCS.
 +  */
 + struct loaded_vmcsvmcs01;
 + struct loaded_vmcs   *loaded_vmcs;
 + bool  __launched; /* temporary, used in
 vmx_vcpu_run */
   struct msr_autoload {
   unsigned nr;
   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
 @@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
 
  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 -static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
 +/*
 + * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is
 needed
 + * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded
 on it.
 + */
 +static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
  static unsigned long *vmx_io_bitmap_a;
 @@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs
  vmcs, phys_addr);
  }
 
 +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
 +{
 + vmcs_clear(loaded_vmcs-vmcs);
 + loaded_vmcs-cpu = -1;
 + loaded_vmcs-launched = 0;
 +}
 +
  static void vmcs_load(struct vmcs *vmcs)
  {
   u64 phys_addr = __pa(vmcs);
 @@ -514,25 +542,24 @@ static void vmcs_load(struct vmcs *vmcs)
  vmcs, phys_addr);
  }
 
 -static void __vcpu_clear(void *arg)
 +static void __loaded_vmcs_clear(void *arg)
  {
 - struct vcpu_vmx *vmx = arg;
 + struct loaded_vmcs *loaded_vmcs = arg;
   int cpu = raw_smp_processor_id();
 
 - if (vmx-vcpu.cpu == cpu)
 - vmcs_clear(vmx-vmcs);
 - if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
 + if (loaded_vmcs-cpu != cpu)
 + return; /* vcpu migration can race with cpu offline */
 + if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
   per_cpu(current_vmcs, cpu) = NULL;
 - list_del(vmx-local_vcpus_link);
 - vmx-vcpu.cpu = -1;
 - vmx-launched = 0;
 +

RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Tuesday, May 24, 2011 9:43 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 20/31] nVMX:
 Exiting from L2 to L1:
   +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
   +{
   + /*
   +  * As explained above, we take a bit from GUEST_CR0 if we allowed
 the
   +  * guest to modify it untrapped (vcpu-arch.cr0_guest_owned_bits),
 or
   +  * if we did trap it - if we did so because L1 asked to trap this bit
   +  * (vmcs12-cr0_guest_host_mask). Otherwise (bits we trapped but
 L1
   +  * didn't expect us to trap) we read from CR0_READ_SHADOW.
   +  */
   + unsigned long guest_cr0_bits =
   + vcpu-arch.cr0_guest_owned_bits |
 vmcs12-cr0_guest_host_mask;
   + return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
   +(vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
   +}
 
  Hi, Nadav,
 
  Not sure whether I get above operation wrong.
 
 This is one of the trickiest functions in nested VMX, which is why I added
 15 lines of comments (!) on just two statements of code.

I read the comment carefully, and the scenario I described is not covered there.

 
  But it looks not exactly correct to me
  in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such 
  case
 that
  bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW,
 however above
  operation will make vmcs02_GUEST_CR0 bit returned instead.
 
 This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
 in particular, if it is set in both L0's and L1's), we always exit to L1 when
 L2 changes this bit, and this bit cannot change while L2 is running, so
 naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
 identical in that be.

Are you sure this is the case? vmcs12.guest_cr0 is identical to an operation
that L1 tries to update GUEST_CR0 when you prepare vmcs02 which is why
you use vmx_set_cr0(vcpu, vmcs12-guest_cr0) in prepare_vmcs02. If L0 
has one bit set in L0's cr0_guest_host_mask, the corresponding bit in 
vmcs12.guest_cr0 will be cached in vmcs02.cr0_read_shadow anyway. This
is not related to whether L2 changes that bit.

IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
has no permission to set its bit effectively in this case.

 Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would
 be
 completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
 the vmcs02-cr0_read_shadow is vmcs12-cr0_read_shadow (see
 nested_read_cr0),
 and is just a pretense that L1 set up for L2 - it is NOT the real bit of
 guest_cr0, so copying it into guest_cr0 would be wrong.

So I'm talking about reserving that bit from vmcs12.guest_cr0 when it's set
in vmcs12.cr0_guest_host_mask which is a natural output.

 
 Note that this function is completely different from nested_read_cr0 (the
 new name), which behaves similar to what you suggested but serves a
 completely
 different (and in some respect, opposite) function.
 
 I think my comments in the code are clearer than what I just wrote here, so
 please take a look at them again, and let me know if you find any errors.
 
  Instead of constructing vmcs12_GUEST_CR0 completely from
 vmcs02_GUEST_CR0,
  why not just updating bits which can be altered while keeping the rest bits
 from
  vmcs12_GUEST_CR0? Say something like:
 
  vmcs12-guest_cr0 = vmcs12-cr0_guest_host_mask; /* keep unchanged
 bits */
  vmcs12-guest_cr0 |= (vmcs_readl(GUEST_CR0) 
 vcpu-arch.cr0_guest_owned_bits) |
  (vmcs_readl(CR0_READ_SHADOW) 
 ~( vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask))
 
 I guess I could do something like this, but do you think it's clearer?
 I don't. Behind all the details, my formula emphasises that MOST cr0 bits
 can be just copied from vmcs02 to vmcs12 as is - and we only have to do
 something strange for special bits - where L0 wanted to trap but L1 didn't.
 In your formula, it looks like there are 3 different cases instead of 2.

But my formula is more clear given that it sticks to the implication of the
cr0_guest_host_mask. You only need to update cr0 bits which can be modified
by the L2 w/o trap while just keeping the rest.

 
 In any case, your formula is definitely not more correct, because the formulas
 are in fact equivalent - let me prove:
 
 If, instead of taking the unchanged bits (as you call them) from
 vmcs12-guest_cr0, you take them from vmcs02-guest_cr0 (you can,
 because they couldn't have changed), you end up with *exactly* the same
 formula I used. Here is the proof:
 
  yourformula =
   (vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) |
   (vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits) |
   (vmcs_readl(CR0_READ_SHADOW) 
 ~( vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask))
 
 Now because of the unchanged bits,
   (vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) ==
   (vmcs02-guest_cr0

RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:54 AM
 
 This patch implements nested_vmx_vmexit(), called when the nested L2 guest
 exits and we want to run its L1 parent and let it handle this exit.
 
 Note that this will not necessarily be called on every L2 exit. L0 may decide
 to handle a particular exit on its own, without L1's involvement; In that
 case, L0 will handle the exit, and resume running L2, without running L1 and
 without calling nested_vmx_vmexit(). The logic for deciding whether to handle
 a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
 will appear in a separate patch below.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com

 +/*
 + * A part of what we need to when the nested L2 guest exits and we want to
 + * run its L1 parent, is to reset L1's guest state to the host state 
 specified
 + * in vmcs12.
 + * This function is to be called not only on normal nested exit, but also on
 + * a nested entry failure, as explained in Intel's spec, 3B.23.7 (VM-Entry
 + * Failures During or After Loading Guest State).
 + * This function should be called when the active VMCS is L1's (vmcs01).
 + */
 +void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12
 *vmcs12)
 +{
 + if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_EFER)
 + vcpu-arch.efer = vmcs12-host_ia32_efer;
 + if (vmcs12-vm_exit_controls  VM_EXIT_HOST_ADDR_SPACE_SIZE)
 + vcpu-arch.efer |= (EFER_LMA | EFER_LME);
 + else
 + vcpu-arch.efer = ~(EFER_LMA | EFER_LME);
 + vmx_set_efer(vcpu, vcpu-arch.efer);
 +
 + if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_PAT)
 + vmcs_write64(GUEST_IA32_PAT, vmcs12-host_ia32_pat);
 +
 + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-host_rsp);
 + kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-host_rip);
 + /*
 +  * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
 +  * actually changed, because it depends on the current state of
 +  * fpu_active (which may have changed).
 +  * Note that vmx_set_cr0 refers to efer set above.
 +  */
 + kvm_set_cr0(vcpu, vmcs12-host_cr0);
 + /*
 +  * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
 +  * to apply the same changes to L1's vmcs. We just set cr0 correctly,
 +  * but we also need to update cr0_guest_host_mask and
 exception_bitmap.
 +  */
 + update_exception_bitmap(vcpu);
 + vcpu-arch.cr0_guest_owned_bits = (vcpu-fpu_active ? X86_CR0_TS : 0);
 + vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);
 +
 + /*
 +  * Note that CR4_GUEST_HOST_MASK is already set in the original
 vmcs01
 +  * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
 +  */
 + vcpu-arch.cr4_guest_owned_bits =
 ~vmcs_readl(CR4_GUEST_HOST_MASK);
 + kvm_set_cr4(vcpu, vmcs12-host_cr4);
 +
 + /* shadow page tables on either EPT or shadow page tables */
 + kvm_set_cr3(vcpu, vmcs12-host_cr3);
 + kvm_mmu_reset_context(vcpu);
 +
 + if (enable_vpid) {
 + /*
 +  * Trivially support vpid by letting L2s share their parent
 +  * L1's vpid. TODO: move to a more elaborate solution, giving
 +  * each L2 its own vpid and exposing the vpid feature to L1.
 +  */
 + vmx_flush_tlb(vcpu);
 + }

How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO comment
here make the whole load process complete. :-)

Also isn't it more sane to update vmcs01's guest segment info based on vmcs12's
host segment info? Though you can assume the environment in L1 doesn't change
from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural clear
to load those segments fields according to L1's desire.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-23 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Sunday, May 22, 2011 3:23 PM
 
 Hi,
 
 On Sun, May 22, 2011, Tian, Kevin wrote about RE: [PATCH 07/31] nVMX:
 Introduce vmcs02: VMCS used to run L2:
  Here the vmcs02 being overridden may have been run on another processor
 before
  but is not vmclear-ed yet. When you resume this vmcs02 with new content on
 a
  separate processor, the risk of corruption exists.
 
 I still believe that my current code is correct (in this area). I'll try to
 explain it here and would be grateful if you could point to me the error (if
 there is one) in my logic:
 
 Nested_vmx_run() is our function which is switches from running L1 to L2
 (patch 18).
 
 This function starts by calling nested_get_current_vmcs02(), which gets us
 *some* vmcs to use for vmcs02. This may be a fresh new VMCS, or a
 recycled
 VMCS, some VMCS we've previously used to run some, potentially different L2
 guest on some, potentially different, CPU.
 nested_get_current_vmcs02() returns a saved_vmcs structure, which
 not only contains a VMCS, but also remembers on which (if any) cpu it is
 currently loaded (and whether it was VMLAUNCHed once on that cpu).
 
 The next thing that Nested_vmx_run() now does is to set up in the vcpu object
 the vmcs, cpu and launched fields according to what was returned above.
 
 Now it calls vmx_vcpu_load(). This standard KVM function checks if we're now
 running on a different CPU from the vcpu-cpu, and if it a different one, is
 uses vcpu_clear() to VMCLEAR the vmcs on the CPU where it was last loaded
 (using an IPI). Only after it vmclears the VMCS on the old CPU, it can finally
 load the VMCS on the new CPU.
 
 Only now Nested_vmx_run() can call prepare_vmcs02, which starts
 VMWRITEing
 to this VMCS, and finally returns.
 

yes, you're correct. Previously I just looked around 07/31 and raised above 
concern.
Along with nested_vmx_run you explained above, this part is clear to me now. :-)

 P.S. Seeing that you're from Intel, maybe you can help me with a pointer:
 I found what appears to be a small error in the SDM - who can I report it to?
 

Let me ask for you.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-23 Thread Tian, Kevin

 From: Avi Kivity
 Sent: Monday, May 23, 2011 11:49 PM
 (regarding interrupts, I think we can do that work post-merge.  But I'd
 like to see Kevin's comments addressed)

My earlier comment has been addressed by Nadav with his explanation.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-23 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Sunday, May 22, 2011 4:30 PM
 
 Hi,
 
 On Fri, May 20, 2011, Tian, Kevin wrote about RE: [PATCH 07/31] nVMX:
 Introduce vmcs02: VMCS used to run L2:
  Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
  is similar to the hardware behavior regarding to cleared and launched state.
 
 If you set VMCS02_POOL_SIZE to a large size, and L1, like typical hypervisors,
 only keeps around a few VMCSs (and VMCLEARs the ones it will not use again),
 then we'll only have a few vmcs02: handle_vmclear() removes from the pool the
 vmcs02 that L1 explicitly told us it won't need again.

yes

 
   +struct saved_vmcs {
   + struct vmcs *vmcs;
   + int cpu;
   + int launched;
   +};
 
  saved looks a bit misleading here. It's simply a list of all active vmcs02
 tracked
  by kvm, isn't it?
 
 I have rewritten this part of the code, based on Avi's and Marcelo's requests,
 and the new name for this structure is loaded_vmcs, i.e., a structure
 describing where a VMCS was loaded.

great, I'll take a look at your new code.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-23 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 24, 2011 2:51 AM
 
  +  vmcs_init(vmx-loaded_vmcs-vmcs);
  +  vmx-loaded_vmcs-cpu = -1;
  +  vmx-loaded_vmcs-launched = 0;
 
  Perhaps a loaded_vmcs_init() to encapsulate initialization of these
  three fields, you'll probably reuse it later.
 
 It's good you pointed this out, because it made me suddenly realise that I
 forgot to VMCLEAR the new vmcs02's I allocate. In practice it never made a
 difference, but better safe than sorry.

yes, that's what spec requires. You need VMCLEAR on any new VMCS which
does implementation specific initialization in that VMCS region.

 
 I had to restructure some of the code a bit to be able to properly use this
 new function (in 3 places - __loaded_vmcs_clear, nested_get_current_vmcs02,
 vmx_create_cpu).
 
  Please repost separately after the fix, I'd like to apply it before the
  rest of the series.
 
 I am adding a new version of this patch at the end of this mail.
 
  (regarding interrupts, I think we can do that work post-merge.  But I'd
  like to see Kevin's comments addressed)
 
 I replied to his comments. Done some of the things he asked, and asked for
 more info on why/where he believes the current code is incorrect where I
 didn't understand what problems he pointed to, and am now waiting for him
 to reply.

As I replied in another thread, I believe this has been explained clearly by 
Nadav.

 
 
 --- 8 -- 8 -- 8 -- 8 --- 8 ---
 
 Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
 
 In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
 because (at least in theory) the processor might not have written all of its
 content back to memory. Since a patch from June 26, 2008, this is done using
 a per-cpu vcpus_on_cpu linked list of vcpus loaded on each CPU.
 
 The problem is that with nested VMX, we no longer have the concept of a
 vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
 L2s), and each of those may be have been last loaded on a different cpu.
 
 So instead of linking the vcpus, we link the VMCSs, using a new structure
 loaded_vmcs. This structure contains the VMCS, and the information
 pertaining
 to its loading on a specific cpu (namely, the cpu number, and whether it
 was already launched on this cpu once). In nested we will also use the same
 structure to hold L2 VMCSs, and vmx-loaded_vmcs is a pointer to the
 currently active VMCS.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  150 ---
  1 file changed, 86 insertions(+), 64 deletions(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-23 21:46:14.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-23 21:46:14.0 +0300
 @@ -116,6 +116,18 @@ struct vmcs {
   char data[0];
  };
 
 +/*
 + * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
 + * remember whether it was VMLAUNCHed, and maintain a linked list of all
 VMCSs
 + * loaded on this CPU (so we can clear them if the CPU goes down).
 + */
 +struct loaded_vmcs {
 + struct vmcs *vmcs;
 + int cpu;
 + int launched;
 + struct list_head loaded_vmcss_on_cpu_link;
 +};
 +
  struct shared_msr_entry {
   unsigned index;
   u64 data;
 @@ -124,9 +136,7 @@ struct shared_msr_entry {
 
  struct vcpu_vmx {
   struct kvm_vcpu   vcpu;
 - struct list_head  local_vcpus_link;
   unsigned long host_rsp;
 - int   launched;
   u8fail;
   u8cpl;
   bool  nmi_known_unmasked;
 @@ -140,7 +150,14 @@ struct vcpu_vmx {
   u64   msr_host_kernel_gs_base;
   u64   msr_guest_kernel_gs_base;
  #endif
 - struct vmcs  *vmcs;
 + /*
 +  * loaded_vmcs points to the VMCS currently used in this vcpu. For a
 +  * non-nested (L1) guest, it always points to vmcs01. For a nested
 +  * guest (L2), it points to a different VMCS.
 +  */
 + struct loaded_vmcsvmcs01;
 + struct loaded_vmcs   *loaded_vmcs;
 + bool  __launched; /* temporary, used in
 vmx_vcpu_run */
   struct msr_autoload {
   unsigned nr;
   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
 @@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
 
  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 -static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
 +/*
 + * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is
 needed
 + * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded
 on it.
 + */
 +static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
  static unsigned long *vmx_io_bitmap_a;
 @@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-21 Thread Tian, Kevin

 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Saturday, May 21, 2011 4:32 AM
 
 On Fri, May 20, 2011, Tian, Kevin wrote about RE: [PATCH 07/31] nVMX:
 Introduce vmcs02: VMCS used to run L2:
  btw, shouldn't you clear recycled VMCS and reset 'cpu' and 'launched' 
  fields?
 
 Well, I believe the answer is no: As far as I understood, a host is allowed
 to take a VMCS that has been used once to launch a certain guest, and then
 modify all the VMCS's fields to define a completely different guest, and then
 VMRESUME it, without doing the regular VMCLEAR/VMLAUNCH, even though
 it's
 a different guest. Is there something wrong in my assumption? Does VMX
 keep
 anything constant between successive VMRESUMEs?

Yes, you can reuse a VMCS with a completely different state if the VMCS is used
on the same processor, and you must ensure that VMCS not having a dirty state
on other processors. The SDM 3B (21.10.1) explicitly requires:



No VMCS should ever be active on more than one logical processor. If a VMCS is 
to be
migrated from one logical processor to another, the first logical processor 
should
execute VMCLEAR for the VMCS (to make it inactive on that logical processor and 
to
ensure that all VMCS data are in memory) before the other logical processor
executes VMPTRLD for the VMCS (to make it active on the second logical 
processor).
A VMCS that is made active on more than one logical processor may become
corrupted



Here the vmcs02 being overridden may have been run on another processor before
but is not vmclear-ed yet. When you resume this vmcs02 with new content on a 
separate processor, the risk of corruption exists.

 
  Have you tried SMP L2 guest?
 
 It sort of works, but it *does* appear to still have a bug which I didn't
 yet have the time to hunt... In one case, for example, an 8-vcpu L2 on an
 8-vcpu L1 seemed to work well (e.g., doing parallel make) for about a minute,
 and then hung with some sort of page fault in the kernel.
 

See whether cleaning up above can help here.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 02/31] nVMX: Implement VMXON and VMXOFF

2011-05-20 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:45 AM
 
 This patch allows a guest to use the VMXON and VMXOFF instructions, and
 emulates them accordingly. Basically this amounts to checking some
 prerequisites, and then remembering whether the guest has enabled or
 disabled VMX operation.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  110
 ++-
  1 file changed, 108 insertions(+), 2 deletions(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:46.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:46.0 +0300
 @@ -130,6 +130,15 @@ struct shared_msr_entry {
   u64 mask;
  };
 
 +/*
 + * The nested_vmx structure is part of vcpu_vmx, and holds information
 +we need
 + * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
 + */
 +struct nested_vmx {
 + /* Has the level1 guest done vmxon? */
 + bool vmxon;
 +};
 +
  struct vcpu_vmx {
   struct kvm_vcpu   vcpu;
   struct list_head  local_vcpus_link;
 @@ -184,6 +193,9 @@ struct vcpu_vmx {
   u32 exit_reason;
 
   bool rdtscp_enabled;
 +
 + /* Support for a guest hypervisor (nested VMX) */
 + struct nested_vmx nested;
  };
 
  enum segment_cache_field {
 @@ -3890,6 +3902,99 @@ static int handle_invalid_op(struct kvm_  }
 
  /*
 + * Emulate the VMXON instruction.
 + * Currently, we just remember that VMX is active, and do not save or
 +even
 + * inspect the argument to VMXON (the so-called VMXON pointer)
 +because we
 + * do not currently need to store anything in that guest-allocated
 +memory

Though we don't need store anything, VMXON needs to check revision ID of
VMXON region to make sure it matches processor's assumption. Considering
an user uses nVMX to practice VMM development and forgot to fill revision
ID into the region. We should fail the instruction at the 1st place.

 + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that
 +the their
 + * argument is different from the VMXON pointer (which the spec says they
 do).
 + */
 +static int handle_vmon(struct kvm_vcpu *vcpu) {
 + struct kvm_segment cs;
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 +
 + /* The Intel VMX Instruction Reference lists a bunch of bits that
 +  * are prerequisite to running VMXON, most notably cr4.VMXE must be
 +  * set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
 +  * Otherwise, we should fail with #UD. We test these now:
 +  */
 + if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
 + !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
 + (vmx_get_rflags(vcpu)  X86_EFLAGS_VM)) {
 + kvm_queue_exception(vcpu, UD_VECTOR);
 + return 1;
 + }
 +
 + vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
 + if (is_long_mode(vcpu)  !cs.l) {
 + kvm_queue_exception(vcpu, UD_VECTOR);
 + return 1;
 + }
 +
 + if (vmx_get_cpl(vcpu)) {
 + kvm_inject_gp(vcpu, 0);
 + return 1;
 + }

You need also check IA32_FEATURE_CONTROL_MSR for bit 0/1/2 as
said in SDM. 

So does the check on 4k alignment and physical-address width for VMXON
region.

 +
 + vmx-nested.vmxon = true;
 +
 + skip_emulated_instruction(vcpu);
 + return 1;
 +}
 +
 +/*
 + * Intel's VMX Instruction Reference specifies a common set of
 +prerequisites
 + * for running VMX instructions (except VMXON, whose prerequisites are
 + * slightly different). It also specifies what exception to inject otherwise.
 + */
 +static int nested_vmx_check_permission(struct kvm_vcpu *vcpu) {
 + struct kvm_segment cs;
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 +
 + if (!vmx-nested.vmxon) {
 + kvm_queue_exception(vcpu, UD_VECTOR);
 + return 0;
 + }
 +
 + vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
 + if ((vmx_get_rflags(vcpu)  X86_EFLAGS_VM) ||
 + (is_long_mode(vcpu)  !cs.l)) {
 + kvm_queue_exception(vcpu, UD_VECTOR);
 + return 0;
 + }
 +
 + if (vmx_get_cpl(vcpu)) {
 + kvm_inject_gp(vcpu, 0);
 + return 0;
 + }
 +
 + return 1;
 +}
 +
 +/*
 + * Free whatever needs to be freed from vmx-nested when L1 goes down,
 +or
 + * just stops using VMX.
 + */
 +static void free_nested(struct vcpu_vmx *vmx) {
 + if (!vmx-nested.vmxon)
 + return;
 + vmx-nested.vmxon = false;
 +}
 +
 +/* Emulate the VMXOFF instruction */
 +static int handle_vmoff(struct kvm_vcpu *vcpu) {
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;

miss one check on CR0.PE

 + free_nested(to_vmx(vcpu));
 + skip_emulated_instruction(vcpu);
 + return 1;
 +}
 +
 +/*
   * The exit handlers return 1 if the exit was handled fully and guest 
 execution
   * may resume.  Otherwise they set the kvm_run parameter to indicate
 what needs
   * to be done to userspace and return 0.
 @@ -3917,8 +4022,8 @@ static int (*kvm_vmx_exit_handlers[])(st

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-20 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:48 AM
 
 We saw in a previous patch that L1 controls its L2 guest with a vcms12.
 L0 needs to create a real VMCS for running L2. We call that vmcs02.
 A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
 fields. This patch only contains code for allocating vmcs02.
 
 In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
 enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
 be reused even when L1 runs multiple L2 guests. However, in future versions
 we'll probably want to add an optimization where vmcs02 fields that rarely
 change will not be set each time. For that, we may want to keep around several
 vmcs02s of L2 guests that have recently run, so that potentially we could run
 these L2s again more quickly because less vmwrites to vmcs02 will be needed.

That would be a neat enhancement and should have an obvious improvement.
Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
is similar to the hardware behavior regarding to cleared and launched state.

 
 This patch adds to each vcpu a vmcs02 pool, vmx-nested.vmcs02_pool,
 which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
 As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
 I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
 reused to enter any L2 guest. In the future, when prepare_vmcs02() is
 optimized not to set all fields every time, VMCS02_POOL_SIZE should be
 increased.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  139
 +++
  1 file changed, 139 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:47.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:47.0 +0300
 @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
  module_param(ple_window, int, S_IRUGO);
 
  #define NR_AUTOLOAD_MSRS 1
 +#define VMCS02_POOL_SIZE 1
 
  struct vmcs {
   u32 revision_id;
 @@ -166,6 +167,30 @@ struct __packed vmcs12 {
  #define VMCS12_SIZE 0x1000
 
  /*
 + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
 + * while we use L2's VMCS), and we wish to save the previous VMCS, we must
 also
 + * remember on which CPU it was last loaded (vcpu-cpu), so when we return
 to
 + * using this VMCS we'll know if we're now running on a different CPU and
 need
 + * to clear the VMCS on the old CPU, and load it on the new one. 
 Additionally,
 + * we need to remember whether this VMCS was launched (vmx-launched),
 so when
 + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot
 deduce
 + * this from other state, because it's possible that this VMCS had once been
 + * launched, but has since been cleared after a CPU switch).
 + */
 +struct saved_vmcs {
 + struct vmcs *vmcs;
 + int cpu;
 + int launched;
 +};

saved looks a bit misleading here. It's simply a list of all active vmcs02 
tracked
by kvm, isn't it?

 +
 +/* Used to remember the last vmcs02 used for some recently used vmcs12s
 */
 +struct vmcs02_list {
 + struct list_head list;
 + gpa_t vmcs12_addr;

uniform the name 'vmptr' as nested_vmx strucure:
 /* The guest-physical address of the current VMCS L1 keeps for L2 */
gpa_t current_vmptr;
/* The host-usable pointer to the above */
struct page *current_vmcs12_page;
struct vmcs12 *current_vmcs12;

you should keep consistent meaning for vmcs12, which means the arch-neutral
state interpreted by KVM only.

 + struct saved_vmcs vmcs02;
 +};
 +
 +/*
   * The nested_vmx structure is part of vcpu_vmx, and holds information we
 need
   * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
   */
 @@ -178,6 +203,10 @@ struct nested_vmx {
   /* The host-usable pointer to the above */
   struct page *current_vmcs12_page;
   struct vmcs12 *current_vmcs12;
 +
 + /* vmcs02_list cache of VMCSs recently used to run L2 guests */
 + struct list_head vmcs02_pool;
 + int vmcs02_num;
  };
 
  struct vcpu_vmx {
 @@ -4200,6 +4229,111 @@ static int handle_invalid_op(struct kvm_
  }
 
  /*
 + * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
 + * We could reuse a single VMCS for all the L2 guests, but we also want the
 + * option to allocate a separate vmcs02 for each separate loaded vmcs12 -
 this
 + * allows keeping them loaded on the processor, and in the future will allow
 + * optimizations where prepare_vmcs02 doesn't need to set all the fields on
 + * every entry if they never change.
 + * So we keep, in vmx-nested.vmcs02_pool, a cache of size
 VMCS02_POOL_SIZE
 + * (=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
 + *
 + * The following functions allocate and free a vmcs02 in this pool.
 + */
 +
 +static void __nested_free_saved_vmcs(void *arg)
 +{
 + struct saved_vmcs *saved_vmcs = arg;
 +

RE: [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12

2011-05-20 Thread Tian, Kevin

 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:49 AM
 
 In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
 standard VMCS fields.
 
 Later patches will enable L1 to read and write these fields using VMREAD/
 VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing
 vmcs02,
 a hardware VMCS for running L2.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  275
 +++
  1 file changed, 275 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:47.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:47.0 +0300
 @@ -144,12 +144,148 @@ struct shared_msr_entry {
   * machines (necessary for live migration).
   * If there are changes in this struct, VMCS12_REVISION must be changed.
   */
 +typedef u64 natural_width;
  struct __packed vmcs12 {
   /* According to the Intel spec, a VMCS region must start with the
* following two fields. Then follow implementation-specific data.
*/
   u32 revision_id;
   u32 abort;
 +
 + u64 io_bitmap_a;
 + u64 io_bitmap_b;
 + u64 msr_bitmap;
 + u64 vm_exit_msr_store_addr;
 + u64 vm_exit_msr_load_addr;
 + u64 vm_entry_msr_load_addr;
 + u64 tsc_offset;
 + u64 virtual_apic_page_addr;
 + u64 apic_access_addr;
 + u64 ept_pointer;
 + u64 guest_physical_address;
 + u64 vmcs_link_pointer;
 + u64 guest_ia32_debugctl;
 + u64 guest_ia32_pat;
 + u64 guest_ia32_efer;
 + u64 guest_pdptr0;
 + u64 guest_pdptr1;
 + u64 guest_pdptr2;
 + u64 guest_pdptr3;
 + u64 host_ia32_pat;
 + u64 host_ia32_efer;
 + u64 padding64[8]; /* room for future expansion */
 + /*
 +  * To allow migration of L1 (complete with its L2 guests) between
 +  * machines of different natural widths (32 or 64 bit), we cannot have
 +  * unsigned long fields with no explict size. We use u64 (aliased
 +  * natural_width) instead. Luckily, x86 is little-endian.
 +  */
 + natural_width cr0_guest_host_mask;
 + natural_width cr4_guest_host_mask;
 + natural_width cr0_read_shadow;
 + natural_width cr4_read_shadow;
 + natural_width cr3_target_value0;
 + natural_width cr3_target_value1;
 + natural_width cr3_target_value2;
 + natural_width cr3_target_value3;
 + natural_width exit_qualification;
 + natural_width guest_linear_address;
 + natural_width guest_cr0;
 + natural_width guest_cr3;
 + natural_width guest_cr4;
 + natural_width guest_es_base;
 + natural_width guest_cs_base;
 + natural_width guest_ss_base;
 + natural_width guest_ds_base;
 + natural_width guest_fs_base;
 + natural_width guest_gs_base;
 + natural_width guest_ldtr_base;
 + natural_width guest_tr_base;
 + natural_width guest_gdtr_base;
 + natural_width guest_idtr_base;
 + natural_width guest_dr7;
 + natural_width guest_rsp;
 + natural_width guest_rip;
 + natural_width guest_rflags;
 + natural_width guest_pending_dbg_exceptions;
 + natural_width guest_sysenter_esp;
 + natural_width guest_sysenter_eip;
 + natural_width host_cr0;
 + natural_width host_cr3;
 + natural_width host_cr4;
 + natural_width host_fs_base;
 + natural_width host_gs_base;
 + natural_width host_tr_base;
 + natural_width host_gdtr_base;
 + natural_width host_idtr_base;
 + natural_width host_ia32_sysenter_esp;
 + natural_width host_ia32_sysenter_eip;
 + natural_width host_rsp;
 + natural_width host_rip;
 + natural_width paddingl[8]; /* room for future expansion */
 + u32 pin_based_vm_exec_control;
 + u32 cpu_based_vm_exec_control;
 + u32 exception_bitmap;
 + u32 page_fault_error_code_mask;
 + u32 page_fault_error_code_match;
 + u32 cr3_target_count;
 + u32 vm_exit_controls;
 + u32 vm_exit_msr_store_count;
 + u32 vm_exit_msr_load_count;
 + u32 vm_entry_controls;
 + u32 vm_entry_msr_load_count;
 + u32 vm_entry_intr_info_field;
 + u32 vm_entry_exception_error_code;
 + u32 vm_entry_instruction_len;
 + u32 tpr_threshold;
 + u32 secondary_vm_exec_control;
 + u32 vm_instruction_error;
 + u32 vm_exit_reason;
 + u32 vm_exit_intr_info;
 + u32 vm_exit_intr_error_code;
 + u32 idt_vectoring_info_field;
 + u32 idt_vectoring_error_code;
 + u32 vm_exit_instruction_len;
 + u32 vmx_instruction_info;
 + u32 guest_es_limit;
 + u32 guest_cs_limit;
 + u32 guest_ss_limit;
 + u32 guest_ds_limit;
 + u32 guest_fs_limit;
 + u32 guest_gs_limit;
 + u32 guest_ldtr_limit;
 + u32 guest_tr_limit;
 + u32 guest_gdtr_limit;
 + u32 guest_idtr_limit;
 + u32 guest_es_ar_bytes;
 + u32 guest_cs_ar_bytes;
 + u32 guest_ss_ar_bytes;
 + u32 guest_ds_ar_bytes;
 + u32 guest_fs_ar_bytes;
 + u32 guest_gs_ar_bytes;
 + u32 guest_ldtr_ar_bytes;
 + u32

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-20 Thread Tian, Kevin

 From: Tian, Kevin
 Sent: Friday, May 20, 2011 4:05 PM

  From: Nadav Har'El
  Sent: Tuesday, May 17, 2011 3:48 AM

  We saw in a previous patch that L1 controls its L2 guest with a vcms12.
  L0 needs to create a real VMCS for running L2. We call that vmcs02.
  A later patch will contain the code, prepare_vmcs02(), for filling the 
  vmcs02
  fields. This patch only contains code for allocating vmcs02.

  In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
  enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It 
  can
  be reused even when L1 runs multiple L2 guests. However, in future versions
  we'll probably want to add an optimization where vmcs02 fields that rarely
  change will not be set each time. For that, we may want to keep around
 several
  vmcs02s of L2 guests that have recently run, so that potentially we could 
  run
  these L2s again more quickly because less vmwrites to vmcs02 will be
 needed.

 That would be a neat enhancement and should have an obvious improvement.
 Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
 is similar to the hardware behavior regarding to cleared and launched state.

  This patch adds to each vcpu a vmcs02 pool, vmx-nested.vmcs02_pool,
  which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE
 L2s.
  As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
  I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
  reused to enter any L2 guest. In the future, when prepare_vmcs02() is
  optimized not to set all fields every time, VMCS02_POOL_SIZE should be
  increased.

  Signed-off-by: Nadav Har'El n...@il.ibm.com
  ---
   arch/x86/kvm/vmx.c |  139
  +++
   1 file changed, 139 insertions(+)

  --- .before/arch/x86/kvm/vmx.c  2011-05-16 22:36:47.0 +0300
  +++ .after/arch/x86/kvm/vmx.c   2011-05-16 22:36:47.0 +0300
  @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
   module_param(ple_window, int, S_IRUGO);

   #define NR_AUTOLOAD_MSRS 1
  +#define VMCS02_POOL_SIZE 1

   struct vmcs {
  u32 revision_id;
  @@ -166,6 +167,30 @@ struct __packed vmcs12 {
   #define VMCS12_SIZE 0x1000

   /*
  + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's
 VMCS
  + * while we use L2's VMCS), and we wish to save the previous VMCS, we
 must
  also
  + * remember on which CPU it was last loaded (vcpu-cpu), so when we
 return
  to
  + * using this VMCS we'll know if we're now running on a different CPU and
  need
  + * to clear the VMCS on the old CPU, and load it on the new one.
 Additionally,
  + * we need to remember whether this VMCS was launched (vmx-launched),
  so when
  + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot
  deduce
  + * this from other state, because it's possible that this VMCS had once 
  been
  + * launched, but has since been cleared after a CPU switch).
  + */
  +struct saved_vmcs {
  +   struct vmcs *vmcs;
  +   int cpu;
  +   int launched;
  +};

 saved looks a bit misleading here. It's simply a list of all active vmcs02
 tracked
 by kvm, isn't it?

  +
  +/* Used to remember the last vmcs02 used for some recently used vmcs12s
  */
  +struct vmcs02_list {
  +   struct list_head list;
  +   gpa_t vmcs12_addr;

 uniform the name 'vmptr' as nested_vmx strucure:
  /* The guest-physical address of the current VMCS L1 keeps for L2 */
   gpa_t current_vmptr;
   /* The host-usable pointer to the above */
   struct page *current_vmcs12_page;
   struct vmcs12 *current_vmcs12;

 you should keep consistent meaning for vmcs12, which means the arch-neutral
 state interpreted by KVM only.

  +   struct saved_vmcs vmcs02;
  +};
  +
  +/*
* The nested_vmx structure is part of vcpu_vmx, and holds information we
  need
* for correct emulation of VMX (i.e., nested VMX) on this vcpu.
*/
  @@ -178,6 +203,10 @@ struct nested_vmx {
  /* The host-usable pointer to the above */
  struct page *current_vmcs12_page;
  struct vmcs12 *current_vmcs12;
  +
  +   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
  +   struct list_head vmcs02_pool;
  +   int vmcs02_num;
   };

   struct vcpu_vmx {
  @@ -4200,6 +4229,111 @@ static int handle_invalid_op(struct kvm_
   }

   /*
  + * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
  + * We could reuse a single VMCS for all the L2 guests, but we also want the
  + * option to allocate a separate vmcs02 for each separate loaded vmcs12 -
  this
  + * allows keeping them loaded on the processor, and in the future will 
  allow
  + * optimizations where prepare_vmcs02 doesn't need to set all the fields on
  + * every entry if they never change.
  + * So we keep, in vmx-nested.vmcs02_pool, a cache of size
  VMCS02_POOL_SIZE
  + * (=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
  + *
  + * The following

RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Tian, Kevin

From:Avi Kivity
Sent: 2008年9月27日 17:50

Yang, Sheng wrote:
 After check host shared interrupts situation, I got a question here:

 If I understand correctly, current solution don't block host
shared irq, just
 come with the performance pentry. The penalty come with host
disabled irq
 line for a period. We have to wait guest to write EOI. But I
fail to see the
 correctness problem here (except a lot of spurious interrupt
in the guest).

 I've checked mail, but can't find clue about that. Can you
explain the
 situation?



If the guest fails to disable interrupts on a device that shares an
interrupt line with the host, the host will experience an interrupt
flood.  Eventually the host will disable the host device as well.


This issue also exists on host side, that one misbehaved driver
can hurt all other drivers sharing same irq line. But it seems no
good way to avoid it. Since not all devices support MSI, we still
need support irq sharing possibly with above caveats given.

Existing approach at least works with a sane guest driver, with
some performance penality there.

Or do you have better alternative?

Thanks,
Kevin

RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Tian, Kevin

From: Dong, Eddie
Sent: 2008年9月28日 10:04

Tian, Kevin wrote:
 From:Avi Kivity
 Sent: 2008年9月27日 17:50

 Yang, Sheng wrote:
 After check host shared interrupts situation, I got a
 question here:

 If I understand correctly, current solution don't block
 host shared irq, just come with the performance pentry.
 The penalty come with host disabled irq line for a
 period. We have to wait guest to write EOI. But I fail
 to see the correctness problem here (except a lot of
 spurious interrupt in the guest).

 I've checked mail, but can't find clue about that. Can
 you explain the situation?

 If the guest fails to disable interrupts on a device
 that shares an interrupt line with the host, the host
 will experience an interrupt flood.  Eventually the host
 will disable the host device as well.

 This issue also exists on host side, that one misbehaved
 driver can hurt all other drivers sharing same irq line.
 But it seems no good way to avoid it. Since not all
 devices support MSI, we still need support irq sharing
 possibly with above caveats given.

 Existing approach at least works with a sane guest
 driver, with some performance penality there.

 Or do you have better alternative?

 Thanks,
 Kevin

MSI is always 1st choice. Including taking host MSI for guest
IOAPIC situation because we don't if guest OS has MSI support
but we are sure host Linux can.

When MSI is impossible, I recommend we disable device
assignment for those sharing interrupt , or we assign all
devices with same interrupt to same guest. Yes the issue is
same in native, but in native the whole OS (kernel) is in same
isolation domain, but now different guest has different
isolation domain :(

In one world, MSI is pretty important for direct IO, and
SR-IOV is #1 usage in future. Just advocate more and wish more
people can ack the SR-IOV patch from ZhaoYU so that we can see
2.6,28 work for direct I/O without sacrificing sharing :)

Yes, irq sharing is most tricky stuff, and hard to make it
architectureally clean. Besides irq storm mentioned by Avi,
driver timeout or device buffer overflow is also subtle to be
intervened by the guest sharing irq. Guest inter-dependency
can impact shared irq handling too. If people do care those
issues that known irq sharing approaches can't address,
your recommendation looks making sense.

Thanks
Kevin
N�Р骒r��yb�X�肚�v�^�)藓{.n�+�筏�hФ�≤�}��财�z�j:+v�����赙zZ+��+zf＂�h���~i���z��wア�?�ㄨ���)撷f

RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Tian, Kevin

From: Avi Kivity [mailto:[EMAIL PROTECTED]
Sent: 2008年9月28日 12:23

There is no issue on the host, since all drivers operate on the same
trust level. A misbehaving driver on the host will take down the entire
system even without shared interrupts, by corrupting memory, not
releasing a lock, etc.

But if you move a driver to the guest, you expect it will be isolated
from the rest of the system, and if there are shared
interrupts, it isn't.


Yes, you're right

 Or do you have better alternative?


No. Maybe the Neocleus polarity trick (which also reduces performance).


To my knowledge, Neocleus polarity trick can't solve this isolation
issue, which just provides one effecient way to track assertion/deassertion
transition on the irq line. For example, reverse polarity when receiving an
instance, and then a new irq instance would occur when all devices de-
assert on shared irq line, and then recover the polarity. In your concerned
case where guest driver misbehaves, this polarity trick can't work neither
as one device always asserts the line.

Thanks,
Kevin
N�Р骒r��yb�X�肚�v�^�)藓{.n�+�筏�hФ�≤�}��财�z�j:+v�����赙zZ+��+zf＂�h���~i���z��wア�?�ㄨ���)撷f

RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-27 Thread Tian, Kevin

From: Avi Kivity [mailto:[EMAIL PROTECTED] 
Sent: 2008年7月27日 16:27

Tian, Kevin wrote:
 From: Darrick J. Wong
 Sent: 2008年7月16日 7:18

 I envision four scenarios:

 0. Guests that don't know about cpufreq still run at whatever 
 nice level
 they started with.

 1. If we have a system with a lot of idle VMs, they will all 
 run with +5
 nice and this patch has no effect.

 2. If we have a system with a lot of busy VMs, they all run 
 with -5 nice
 and this patch also has no effect.

 3. If, however, we have a lot of idle VMs and a few busy 
ones, then the
 -5 nice of the busy VMs will get those VMs extra CPU time.  
On a really
 crummy FPU microbenchmark I have, the score goes from about 
500 to 2000
 with the patch applied, though of course YMMV.  In some 
respects this

 How many VMs did you run in this test? All the VMs are idle except
 the one where your benchmark runs?

 How about the actual effect when several VMs are doing some stuff?

 There's another scenario where some VMs don't support cpufreq while
 others do. Here is it unfair to just renice the latter when 
the former is
 not 'nice' at all?

I guess the solution for such issues is not to have kvm (or qemu) play
with nice levels, but instead send notifications on virtual frequency
changes on the qemu monitor. The management application can then choose
whether to ignore the information, play with nice levels, or even
propagate the frequency change to the host (useful in client-side
virtualization).

Yes, that'd be more flexible and cleaner.

Thanks,
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority

2008-07-17 Thread Tian, Kevin

From: Darrick J. Wong [mailto:[EMAIL PROTECTED] 
Sent: 2008年7月18日 3:05

If there are multiple VMs that are busy, the busy ones will fight among
themselves for CPU time.  I still see some priority boost, just not as
much.

some micro-level analysis is useful here.


I wonder how stable the virtual tsc is...?  Will have to study this.

My point is that to expose virtual freq states doesn't change the fact
whether virtual tsc is stable, since the interception logic about virtual
freq change request only impacts nice. That's expected behavior.

Then whether a virtual tsc is stable is just another issue out of this 
feature.


IDA has the same problem... the T61 BIOS compensates for this fakery
by reporting a frequency of $max_freq + 1 so if you're smart 
then you'll
somehow know that you might see a boost that you can't measure. :P

It can be measured, as one necessary requirement pushed on any
hardware coordinated logic, to provide some type of feedback 
mechanism. For example, Intel processors provides APERF/MPERF
pairs with MPERF incremented in proportion to a fixed boot frequency,
while APERF increments tin proportion to actual performance.
Software should use APERF/MPERF to understand actual freq in
elapsed sampling period. 


I suppose the problem here is that p-states were designed on the
assumption that you're directly manipulating hardware speeds, whereas
what we really want in both this patch and IDA are qualitative values
(medium speed, highest speed, ludicrous speed?)

It's still a bit different. 

For IDA, when ludicrous speed is requested, it may be granted. 
However when it's not, the actual freq will be still at highest speed
and never be lower.

However for this feature, how much cpu cycles can be granted is
not decided by a single 'nice' value, which instead depends on num
of active vcpus at given time on given cpu. Whatever speed is 
requested, either medium, highest or ludicrous, granted cycles can
always vary from some minimal (many vcpus contends) to 100%
(only current is active).  

On the other hand, if you get the same performance  at both 
high and low
speeds, then it doesn't really matter which one you choose.  At least
not until the load changes.  I suppose the next question is, how much
software is dependent on knowing the exact CPU frequency, and are
workload schedulers smart enough to realize that performance
characteristics can change over time (throttling, TM1/TM2, etc)?
Inasmuch as you actually ever know, since with hardware coordination of
cpufreq the hardware can do whatever it wants.

Throttling or TM1/TM2 are related to thermal when some threshold
is reached. Here let's focus on DBS (Demand Based Switching)
which is actively conducted by OSPM per workload estimation. A
typical freq demotion flow is like below:

If (PercentBusy * Pc/Pn)  threshold
switch Pc to Pn;

Here PercentBusy represents CPU utilization in elapsed sampling
period. Pc stands for freq used in elapsed period, and Pn is the
candidate lower freq to be changed. If freq change can still keep
CPU utilization under predefined threshold, then transition is viable.

Here the keypoint is PercentBusy and Pc, which may make the
final decision pointless if inaccurate. That's why hardware coordi-
nation logic is required to provide some feedback to get Pc. 

I agree that finally guest should be able to catch up if wrong decision
makes its workload restrained or over-granted. E.g. when there's
only one vcpu active on pcpu which requests a medium speed,
100% cycles are granted to make it think that medium speed is
enough for its current workload. Later when other vcpus are active
on same pcpu, its granted cycles reduces and then it may realize
medium speed is not enough and then request to highest speed
which then may adds back some cycles with a lower nice value.

But it's better to do some micro-level analysis to understand 
whether it works as expected, and more important, how fast this
catch-up may be. Note that guest is checking freq change at 
like 20ms level, and we then need make sure no thrash is caused
to mess both guest and host.

Actually another concern just raised is the measurement to
PercentBusy. Take Linux for example, it normally substracts
idle time from elapsed time. If guest doesn't understand steal
time, PercentBusy may not reflect the fact at all. For example,
say a vcpu is continuously busy for 10ms, and then happens
to enter idle loop and then scheduled out for 20ms. Next time 
when re-scheduled in, its dbs timer will get PercentBusy as 
33.3% though actually its fully busy. However when vcpu is 
scheduled out outside of idle loop, that steal time is calculated 
as busy portion. I'm still not clear how this may affect this patch...


I don't think it's easily deduced.  I also don't think 
APERF/MPERF are emulated in
kvm yet.  I suppose it wouldn't be difficult to add those two, though
measuring that might be a bit messy.

Maybe the cheap workaround for now is to report

RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-15 Thread Tian, Kevin

From: Darrick J. Wong
Sent: 2008年7月16日 7:18

I envision four scenarios:

0. Guests that don't know about cpufreq still run at whatever 
nice level
they started with.

1. If we have a system with a lot of idle VMs, they will all 
run with +5
nice and this patch has no effect.

2. If we have a system with a lot of busy VMs, they all run 
with -5 nice
and this patch also has no effect.

3. If, however, we have a lot of idle VMs and a few busy ones, then the
-5 nice of the busy VMs will get those VMs extra CPU time.  On a really
crummy FPU microbenchmark I have, the score goes from about 500 to 2000
with the patch applied, though of course YMMV.  In some respects this

How many VMs did you run in this test? All the VMs are idle except
the one where your benchmark runs?

How about the actual effect when several VMs are doing some stuff?

There's another scenario where some VMs don't support cpufreq while
others do. Here is it unfair to just renice the latter when the former is
not 'nice' at all?

Guess this feature has to be applied with some qualifications, e.g.
in a group of VMs with known same PM abilities...


There are some warts to this patch--most notably, the current
implementation uses the Intel MSRs and EST feature flag ... even if the
guest reports the CPU as being AuthenticAMD.  Also, there could be
timing problems introduced by this change--the OS thinks the CPU
frequency changes, but I don't know the effect on the guest CPU TSCs.

You can report constant tsc feature in cpuid virtualization. Of course
if physical TSC is unstable, it's another story about how to mark guest
TSC untrustable. (e.g. Marcelo develops one method by simulating C2)


Control values are as as follows:
0: Nobody's touched cpufreq.  nice is the whatever the default is.
1: Lowest speed.  nice +5.
2: Medium speed.  nice is reset.
3: High speed.  nice -5.

This description seems mismatch with the implementaion, which pushes
+10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
doesn't reflect a meanful state to guest, since there's no perf entry in
ACPI table to carry such value. One likely result is that guest'd think the
cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
sense to set initial value to 2 (P1), as keeping the default nice value, or 
even 3 (P0), if you take that state as IDA style which may over-clock but
not assure.

More critical points to be further thought of, if expecting this feature to be
in real use, is the difinition of exposed virtual freq states, and how these
states can be mapped to scheduler knobs. Inappropriate exposure may
cause guest to excessively bounce between virtual freq points. For example, 
'nice' value is only a relative hint to scheduler and there's no guarantee that
same portion of cpu cycles are added as what 'nice' value changes. There's
even the case where guest requests lowest speed while actual cpu cycles
allocated to it keeps similar as last epoch when it's in high speed. This
will fool the guest that lowest speed can satisfy its requirement. It's similar 
to the requirement on core-based hardware coordination logic, where some 
feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
actual freq in last sampling period. Here the VM case may need similar 
virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
deduced however.

Maybe it's applausive to compare the freq change count for same benchmark
between VM and native, and more interesting is, how's the effect when 
multiple VMs all take use of such features? For example, whether the
expected effect is counteracted with only overhead added? Any strange
behaviors exposed as in real 'nice' won't be changed so frequently in dozens
of ms level? :-)

Thanks,
Kevin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

65 matches

Mail list logo