Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.

2010-12-08 Thread Yoshiaki Tamura
2010/12/8 Isaku Yamahata yamah...@valinux.co.jp:
 QLIST_FOREACH_SAFE?

Thanks! So, it should be,

QLIST_FOREACH_SAFE(e, vm_change_state_head, entries, ne) {
e-cb(e-opaque, running, reason);
}

I'll put it in the next spin.

Yoshi


 On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote:
 By copying the next entry to a tmp pointer,
 qemu_del_vm_change_state_handler() can be called in the handler.

 Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 ---
  vl.c |    5 +++--
  1 files changed, 3 insertions(+), 2 deletions(-)

 diff --git a/vl.c b/vl.c
 index 805e11f..6b6aec0 100644
 --- a/vl.c
 +++ b/vl.c
 @@ -1073,11 +1073,12 @@ void 
 qemu_del_vm_change_state_handler(VMChangeStateEntry *e)

  void vm_state_notify(int running, int reason)
  {
 -    VMChangeStateEntry *e;
 +    VMChangeStateEntry *e, *ne;

      trace_vm_state_notify(running, reason);

 -    for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) {
 +    for (e = vm_change_state_head.lh_first; e; e = ne) {
 +        ne = e-entries.le_next;
          e-cb(e-opaque, running, reason);
      }
  }
 --
 1.7.1.2



 --
 yamahata
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

2010-12-08 Thread Avi Kivity

On 12/07/2010 11:16 PM, David Sharp wrote:


  I don't understand this.  Can you elaborate?  What does 32-bit addressable
  mean?

The ring buffer gives you space that is a multiple of 4 bytes in
length, and 32-bit aligned. Therefore it is useless to attempt to
align the structure beyond 32-bit boundaries, eg, a 64-bit boundary,
because it is unpredictable if the memory the structure will be
written to is at a 64-bit boundary (addr % 8 could be 0 or 4).

  And predicated on packing the event structures?  Is the structure
  __attribute__((packed)), or is it not?

It is not packed in Linus' tree, but one of the patches before this
patch in this patch series adds __attribute__((packed)). This patch
assumes that the event packing patch has been applied. This patch
should not be applied if the packing patch is not (hence,
predicated).


Thanks for the explanations, it makes sense now.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Fix OSXSAVE after migration

2010-12-08 Thread Avi Kivity

On 12/08/2010 04:49 AM, Sheng Yang wrote:

CPUID's OSXSAVE is a mirror of CR4.OSXSAVE bit. We need to update the CPUID
after migration.



Applied, thanks.


@@ -5585,6 +5585,8 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,

mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs-cr4;
kvm_x86_ops-set_cr4(vcpu, sregs-cr4);
+   if (sregs-cr4  X86_CR4_OSXSAVE)
+   update_cpuid(vcpu);
if (!is_long_mode(vcpu)  is_pae(vcpu)) {
load_pdptrs(vcpu, vcpu-arch.walk_mmu, vcpu-arch.cr3);
mmu_reset_needed = 1;



We really should use kvm_set_crX() here.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph/rbd block driver for qemu-kvm (v9)

2010-12-08 Thread Kevin Wolf
Am 06.12.2010 20:53, schrieb Christian Brunner:
 This is a new version of the rbd driver. The only difference from v8 is
 a check for a recent librados version in configure. If the librados version
 is too old, rbd support will be disabled.
 
 RBD is an block driver for the distributed file system Ceph 
 (http://ceph.newdream.net/). This driver uses librados (which is part 
 of the Ceph server) for direct access to the Ceph object store and is 
 running entirely in userspace (Yehuda also wrote a driver for the 
 linux kernel, that can be used to access rbd volumes as a block 
 device).
 
 Regards,
 Christian
 
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
 Signed-off-by: Christian Brunner c...@muc.de

Thanks. I still haven't managed to actually test it, but I've applied
this to the block branch now based on your testing and Stefan's review
(and the fact that it doesn't break my build any more).

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: SVM: Add xsetbv intercept

2010-12-08 Thread Avi Kivity

On 12/07/2010 06:15 PM, Joerg Roedel wrote:

This patch implements the xsetbv intercept to the AMD part
of KVM. This makes AVX usable in a save way for the guest on
AVX capable AMD hardware.
The patch is tested by using AVX in the guest and host in
parallel and checking for data corruption. I also used the
KVM xsave unit-tests and they all pass.



Applied both, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/x86: enlarge number of possible CPUID leaves

2010-12-08 Thread Andre Przywara

Avi, Marcello,

can you please commit this simple fix? (turning 40 to 80?)
Without it QEMU crashes reliably on our new CPUs (they return 46 leaves) 
and causes pain in our testing, because we have to manually apply this 
patch on each tree.


Thanks!
Andre.


Currently the number of CPUID leaves KVM handles is limited to 40.
My desktop machine (AthlonII) already has 35 and future CPUs will
expand this well beyond the limit. Extend the limit to 80 to make
room for future processors.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/include/asm/kvm_host.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Hi,
I found that either KVM or QEMU (possibly both) are broken in respect
to handling more CPUID entries than the limit dictates. KVM will
return -E2BIG, which is the same error as if the user hasn't provided
enough space to hold all entries. Now QEMU will continue to enlarge
the allocated memory until it gets into an out-of-memory condition.
I have tried to fix this with teaching KVM how to deal with a capped
number of entries (there are some bugs in the current code), but this
will limit the number of CPUID entries KVM handles, which will surely
cut of the lastly appended PV leaves.
A proper fix would be to make this allocation dynamic. Is this a
feasible way or will this lead to issues or side-effects?

Regards,
Andre.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 54e42c8..3cc80c4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -79,7 +79,7 @@
 #define KVM_NUM_MMU_PAGES (1  KVM_MMU_HASH_SHIFT)
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
-#define KVM_MAX_CPUID_ENTRIES 40
+#define KVM_MAX_CPUID_ENTRIES 80
 #define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: cleanup CR8 handling

2010-12-08 Thread Andre Przywara
The handling of CR8 writes in KVM is currently somewhat cumbersome.
This patch makes it look like the other CR register handlers
and fixes a possible issue in VMX, where the RIP would be incremented
despite an injected #GP.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/svm.c  |7 ---
 arch/x86/kvm/vmx.c  |4 ++--
 arch/x86/kvm/x86.c  |   18 --
 4 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d968cc5..2b89195 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -659,7 +659,7 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 
tss_selector, int reason,
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
-void kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8);
+int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8);
 int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val);
 int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val);
 unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ae943bb..ed5950c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2610,16 +2610,17 @@ static int cr0_write_interception(struct vcpu_svm *svm)
 static int cr8_write_interception(struct vcpu_svm *svm)
 {
struct kvm_run *kvm_run = svm-vcpu.run;
+   int r;
 
u8 cr8_prev = kvm_get_cr8(svm-vcpu);
/* instruction emulation calls kvm_set_cr8() */
-   emulate_instruction(svm-vcpu, 0, 0, 0);
+   r = emulate_instruction(svm-vcpu, 0, 0, 0);
if (irqchip_in_kernel(svm-vcpu.kvm)) {
clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
-   return 1;
+   return r == EMULATE_DONE;
}
if (cr8_prev = kvm_get_cr8(svm-vcpu))
-   return 1;
+   return r == EMULATE_DONE;
kvm_run-exit_reason = KVM_EXIT_SET_TPR;
return 0;
 }
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 72cfdb7..e5ef924 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3164,8 +3164,8 @@ static int handle_cr(struct kvm_vcpu *vcpu)
case 8: {
u8 cr8_prev = kvm_get_cr8(vcpu);
u8 cr8 = kvm_register_read(vcpu, reg);
-   kvm_set_cr8(vcpu, cr8);
-   skip_emulated_instruction(vcpu);
+   err = kvm_set_cr8(vcpu, cr8);
+   complete_insn_gp(vcpu, err);
if (irqchip_in_kernel(vcpu-kvm))
return 1;
if (cr8_prev = cr8)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed373ba..63b8877 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -667,7 +667,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 }
 EXPORT_SYMBOL_GPL(kvm_set_cr3);
 
-int __kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
+int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
 {
if (cr8  CR8_RESERVED_BITS)
return 1;
@@ -677,12 +677,6 @@ int __kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
vcpu-arch.cr8 = cr8;
return 0;
 }
-
-void kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
-{
-   if (__kvm_set_cr8(vcpu, cr8))
-   kvm_inject_gp(vcpu, 0);
-}
 EXPORT_SYMBOL_GPL(kvm_set_cr8);
 
 unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu)
@@ -4104,7 +4098,7 @@ static int emulator_set_cr(int cr, unsigned long val, 
struct kvm_vcpu *vcpu)
res = kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val));
break;
case 8:
-   res = __kvm_set_cr8(vcpu, val  0xfUL);
+   res = kvm_set_cr8(vcpu, val);
break;
default:
vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr);
@@ -5379,8 +5373,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
}
 
/* re-sync apic's tpr */
-   if (!irqchip_in_kernel(vcpu-kvm))
-   kvm_set_cr8(vcpu, kvm_run-cr8);
+   if (!irqchip_in_kernel(vcpu-kvm)) {
+   if (kvm_set_cr8(vcpu, kvm_run-cr8) != 0) {
+   r = -EINVAL;
+   goto out;
+   }
+   }
 
if (vcpu-arch.pio.count || vcpu-mmio_needed) {
if (vcpu-mmio_needed) {
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 03/16] Keep track of ISA ports ISA device is using in qdev.

2010-12-08 Thread Gleb Natapov
Store all io ports used by device in ISADevice structure.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/cs4231a.c |1 +
 hw/fdc.c |3 +++
 hw/gus.c |4 
 hw/ide/isa.c |2 ++
 hw/isa-bus.c |   25 +
 hw/isa.h |4 
 hw/m48t59.c  |1 +
 hw/mc146818rtc.c |1 +
 hw/ne2000-isa.c  |3 +++
 hw/parallel.c|5 +
 hw/pckbd.c   |3 +++
 hw/sb16.c|4 
 hw/serial.c  |1 +
 13 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/hw/cs4231a.c b/hw/cs4231a.c
index 4d5ce5c..598f032 100644
--- a/hw/cs4231a.c
+++ b/hw/cs4231a.c
@@ -645,6 +645,7 @@ static int cs4231a_initfn (ISADevice *dev)
 isa_init_irq (dev, s-pic, s-irq);
 
 for (i = 0; i  4; i++) {
+isa_init_ioport(dev, i);
 register_ioport_write (s-port + i, 1, 1, cs_write, s);
 register_ioport_read (s-port + i, 1, 1, cs_read, s);
 }
diff --git a/hw/fdc.c b/hw/fdc.c
index a467c4b..22fb64a 100644
--- a/hw/fdc.c
+++ b/hw/fdc.c
@@ -1983,6 +1983,9 @@ static int isabus_fdc_init1(ISADevice *dev)
   fdctrl_write_port, fdctrl);
 register_ioport_write(iobase + 0x07, 1, 1,
   fdctrl_write_port, fdctrl);
+isa_init_ioport_range(dev, iobase, 6);
+isa_init_ioport(dev, iobase + 7);
+
 isa_init_irq(isa-busdev, fdctrl-irq, isairq);
 fdctrl-dma_chann = dma_chann;
 
diff --git a/hw/gus.c b/hw/gus.c
index e9016d8..ff9e7c7 100644
--- a/hw/gus.c
+++ b/hw/gus.c
@@ -264,20 +264,24 @@ static int gus_initfn (ISADevice *dev)
 
 register_ioport_write (s-port, 1, 1, gus_writeb, s);
 register_ioport_write (s-port, 1, 2, gus_writew, s);
+isa_init_ioport_range(dev, s-port, 2);
 
 register_ioport_read ((s-port + 0x100)  0xf00, 1, 1, gus_readb, s);
 register_ioport_read ((s-port + 0x100)  0xf00, 1, 2, gus_readw, s);
+isa_init_ioport_range(dev, (s-port + 0x100)  0xf00, 2);
 
 register_ioport_write (s-port + 6, 10, 1, gus_writeb, s);
 register_ioport_write (s-port + 6, 10, 2, gus_writew, s);
 register_ioport_read (s-port + 6, 10, 1, gus_readb, s);
 register_ioport_read (s-port + 6, 10, 2, gus_readw, s);
+isa_init_ioport_range(dev, s-port + 6, 10);
 
 
 register_ioport_write (s-port + 0x100, 8, 1, gus_writeb, s);
 register_ioport_write (s-port + 0x100, 8, 2, gus_writew, s);
 register_ioport_read (s-port + 0x100, 8, 1, gus_readb, s);
 register_ioport_read (s-port + 0x100, 8, 2, gus_readw, s);
+isa_init_ioport_range(dev, s-port + 0x100, 8);
 
 DMA_register_channel (s-emu.gusdma, GUS_read_DMA, s);
 s-emu.himemaddr = s-himem;
diff --git a/hw/ide/isa.c b/hw/ide/isa.c
index 9856435..4206afd 100644
--- a/hw/ide/isa.c
+++ b/hw/ide/isa.c
@@ -70,6 +70,8 @@ static int isa_ide_initfn(ISADevice *dev)
 ide_bus_new(s-bus, s-dev.qdev);
 ide_init_ioport(s-bus, s-iobase, s-iobase2);
 isa_init_irq(dev, s-irq, s-isairq);
+isa_init_ioport_range(dev, s-iobase, 8);
+isa_init_ioport(dev, s-iobase2);
 ide_init2(s-bus, s-irq);
 vmstate_register(dev-qdev, 0, vmstate_ide_isa, s);
 return 0;
diff --git a/hw/isa-bus.c b/hw/isa-bus.c
index 26036e0..c0ac7e9 100644
--- a/hw/isa-bus.c
+++ b/hw/isa-bus.c
@@ -92,6 +92,31 @@ void isa_init_irq(ISADevice *dev, qemu_irq *p, int isairq)
 dev-nirqs++;
 }
 
+static void isa_init_ioport_one(ISADevice *dev, uint16_t ioport)
+{
+assert(dev-nioports  ARRAY_SIZE(dev-ioports));
+dev-ioports[dev-nioports++] = ioport;
+}
+
+static int isa_cmp_ports(const void *p1, const void *p2)
+{
+return *(uint16_t*)p1 - *(uint16_t*)p2;
+}
+
+void isa_init_ioport_range(ISADevice *dev, uint16_t start, uint16_t length)
+{
+int i;
+for (i = start; i  start + length; i++) {
+isa_init_ioport_one(dev, i);
+}
+qsort(dev-ioports, dev-nioports, sizeof(dev-ioports[0]), isa_cmp_ports);
+}
+
+void isa_init_ioport(ISADevice *dev, uint16_t ioport)
+{
+isa_init_ioport_range(dev, ioport, 1);
+}
+
 static int isa_qdev_init(DeviceState *qdev, DeviceInfo *base)
 {
 ISADevice *dev = DO_UPCAST(ISADevice, qdev, qdev);
diff --git a/hw/isa.h b/hw/isa.h
index aaf0272..4794b76 100644
--- a/hw/isa.h
+++ b/hw/isa.h
@@ -14,6 +14,8 @@ struct ISADevice {
 DeviceState qdev;
 uint32_t isairq[2];
 int nirqs;
+uint16_t ioports[32];
+int nioports;
 };
 
 typedef int (*isa_qdev_initfn)(ISADevice *dev);
@@ -26,6 +28,8 @@ ISABus *isa_bus_new(DeviceState *dev);
 void isa_bus_irqs(qemu_irq *irqs);
 qemu_irq isa_reserve_irq(int isairq);
 void isa_init_irq(ISADevice *dev, qemu_irq *p, int isairq);
+void isa_init_ioport(ISADevice *dev, uint16_t ioport);
+void isa_init_ioport_range(ISADevice *dev, uint16_t start, uint16_t length);
 void isa_qdev_register(ISADeviceInfo *info);
 ISADevice *isa_create(const char *name);
 ISADevice *isa_create_simple(const char *name);
diff --git a/hw/m48t59.c b/hw/m48t59.c
index c7492a6..75a94e1 100644
--- 

[PATCHv8 02/16] Introduce new BusInfo callback get_fw_dev_path.

2010-12-08 Thread Gleb Natapov
New get_fw_dev_path callback will be used for build device path usable
by firmware in contrast to qdev qemu internal device path.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/qdev.h |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/hw/qdev.h b/hw/qdev.h
index bc71110..f72fbde 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -49,6 +49,12 @@ struct DeviceState {
 
 typedef void (*bus_dev_printfn)(Monitor *mon, DeviceState *dev, int indent);
 typedef char *(*bus_get_dev_path)(DeviceState *dev);
+/*
+ * This callback is used to create Open Firmware device path in accordance with
+ * OF spec http://forthworks.com/standards/of1275.pdf. Indicidual bus bindings
+ * can be found here http://playground.sun.com/1275/bindings/.
+ */
+typedef char *(*bus_get_fw_dev_path)(DeviceState *dev);
 typedef int (qbus_resetfn)(BusState *bus);
 
 struct BusInfo {
@@ -56,6 +62,7 @@ struct BusInfo {
 size_t size;
 bus_dev_printfn print_dev;
 bus_get_dev_path get_dev_path;
+bus_get_fw_dev_path get_fw_dev_path;
 qbus_resetfn *reset;
 Property *props;
 };
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 16/16] Pass boot device list to firmware.

2010-12-08 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/fw_cfg.c |   14 ++
 sysemu.h|1 +
 vl.c|   48 
 3 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
index 7b9434f..20a816f 100644
--- a/hw/fw_cfg.c
+++ b/hw/fw_cfg.c
@@ -53,6 +53,7 @@ struct FWCfgState {
 FWCfgFiles *files;
 uint16_t cur_entry;
 uint32_t cur_offset;
+Notifier machine_ready;
 };
 
 static void fw_cfg_write(FWCfgState *s, uint8_t value)
@@ -315,6 +316,15 @@ int fw_cfg_add_file(FWCfgState *s,  const char *filename, 
uint8_t *data,
 return 1;
 }
 
+static void fw_cfg_machine_ready(struct Notifier* n)
+{
+uint32_t len;
+FWCfgState *s = container_of(n, FWCfgState, machine_ready);
+char *bootindex = get_boot_devices_list(len);
+
+fw_cfg_add_file(s, bootorder, (uint8_t*)bootindex, len);
+}
+
 FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
 target_phys_addr_t ctl_addr, target_phys_addr_t 
data_addr)
 {
@@ -343,6 +353,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t 
data_port,
 fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus);
 fw_cfg_add_i16(s, FW_CFG_BOOT_MENU, (uint16_t)boot_menu);
 
+
+s-machine_ready.notify = fw_cfg_machine_ready;
+qemu_add_machine_init_done_notifier(s-machine_ready);
+
 return s;
 }
 
diff --git a/sysemu.h b/sysemu.h
index c42f33a..38a20a3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -196,4 +196,5 @@ void register_devices(void);
 
 void add_boot_device_path(int32_t bootindex, DeviceState *dev,
   const char *suffix);
+char *get_boot_devices_list(uint32_t *size);
 #endif
diff --git a/vl.c b/vl.c
index 0d20d26..c4d3fc0 100644
--- a/vl.c
+++ b/vl.c
@@ -736,6 +736,54 @@ void add_boot_device_path(int32_t bootindex, DeviceState 
*dev,
 QTAILQ_INSERT_TAIL(fw_boot_order, node, link);
 }
 
+/*
+ * This function returns null terminated string that consist of new line
+ * separated device pathes.
+ *
+ * memory pointed by size is assigned total length of the array in bytes
+ *
+ */
+char *get_boot_devices_list(uint32_t *size)
+{
+FWBootEntry *i;
+uint32_t total = 0;
+char *list = NULL;
+
+QTAILQ_FOREACH(i, fw_boot_order, link) {
+char *devpath = NULL, *bootpath;
+int len;
+
+if (i-dev) {
+devpath = qdev_get_fw_dev_path(i-dev);
+assert(devpath);
+}
+
+if (i-suffix  devpath) {
+bootpath = qemu_malloc(strlen(devpath) + strlen(i-suffix) + 1);
+sprintf(bootpath, %s%s, devpath, i-suffix);
+qemu_free(devpath);
+} else if (devpath) {
+bootpath = devpath;
+} else {
+bootpath = strdup(i-suffix);
+assert(bootpath);
+}
+
+if (total) {
+list[total-1] = '\n';
+}
+len = strlen(bootpath) + 1;
+list = qemu_realloc(list, total + len);
+memcpy(list[total], bootpath, len);
+total += len;
+qemu_free(bootpath);
+}
+
+*size = total;
+
+return list;
+}
+
 static void numa_add(const char *optarg)
 {
 char option[128];
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 04/16] Add get_fw_dev_path callback to ISA bus in qdev.

2010-12-08 Thread Gleb Natapov
Use device ioports to create unique device path.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/isa-bus.c |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/hw/isa-bus.c b/hw/isa-bus.c
index c0ac7e9..c423c1b 100644
--- a/hw/isa-bus.c
+++ b/hw/isa-bus.c
@@ -31,11 +31,13 @@ static ISABus *isabus;
 target_phys_addr_t isa_mem_base = 0;
 
 static void isabus_dev_print(Monitor *mon, DeviceState *dev, int indent);
+static char *isabus_get_fw_dev_path(DeviceState *dev);
 
 static struct BusInfo isa_bus_info = {
 .name  = ISA,
 .size  = sizeof(ISABus),
 .print_dev = isabus_dev_print,
+.get_fw_dev_path = isabus_get_fw_dev_path,
 };
 
 ISABus *isa_bus_new(DeviceState *dev)
@@ -188,4 +190,18 @@ static void isabus_register_devices(void)
 sysbus_register_withprop(isabus_bridge_info);
 }
 
+static char *isabus_get_fw_dev_path(DeviceState *dev)
+{
+ISADevice *d = (ISADevice*)dev;
+char path[40];
+int off;
+
+off = snprintf(path, sizeof(path), %s, qdev_fw_name(dev));
+if (d-nioports) {
+snprintf(path + off, sizeof(path) - off, @%04x, d-ioports[0]);
+}
+
+return strdup(path);
+}
+
 device_init(isabus_register_devices)
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 11/16] Add get_fw_dev_path callback to scsi bus.

2010-12-08 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/scsi-bus.c |   23 +++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index 93f0e9a..7febb86 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -5,9 +5,12 @@
 #include qdev.h
 #include blockdev.h
 
+static char *scsibus_get_fw_dev_path(DeviceState *dev);
+
 static struct BusInfo scsi_bus_info = {
 .name  = SCSI,
 .size  = sizeof(SCSIBus),
+.get_fw_dev_path = scsibus_get_fw_dev_path,
 .props = (Property[]) {
 DEFINE_PROP_UINT32(scsi-id, SCSIDevice, id, -1),
 DEFINE_PROP_END_OF_LIST(),
@@ -518,3 +521,23 @@ void scsi_req_complete(SCSIRequest *req)
req-tag,
req-status);
 }
+
+static char *scsibus_get_fw_dev_path(DeviceState *dev)
+{
+SCSIDevice *d = (SCSIDevice*)dev;
+SCSIBus *bus = scsi_bus_from_device(d);
+char path[100];
+int i;
+
+for (i = 0; i  bus-ndev; i++) {
+if (bus-devs[i] == d) {
+break;
+}
+}
+
+assert(i != bus-ndev);
+
+snprintf(path, sizeof(path), %...@%x, qdev_fw_name(dev), i);
+
+return strdup(path);
+}
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 10/16] Add get_fw_dev_path callback for usb bus.

2010-12-08 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/usb-bus.c |   42 ++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/hw/usb-bus.c b/hw/usb-bus.c
index 256b881..8b4583c 100644
--- a/hw/usb-bus.c
+++ b/hw/usb-bus.c
@@ -5,11 +5,13 @@
 #include monitor.h
 
 static void usb_bus_dev_print(Monitor *mon, DeviceState *qdev, int indent);
+static char *usbbus_get_fw_dev_path(DeviceState *dev);
 
 static struct BusInfo usb_bus_info = {
 .name  = USB,
 .size  = sizeof(USBBus),
 .print_dev = usb_bus_dev_print,
+.get_fw_dev_path = usbbus_get_fw_dev_path,
 };
 static int next_usb_bus = 0;
 static QTAILQ_HEAD(, USBBus) busses = QTAILQ_HEAD_INITIALIZER(busses);
@@ -307,3 +309,43 @@ USBDevice *usbdevice_create(const char *cmdline)
 }
 return usb-usbdevice_init(params);
 }
+
+static int usbbus_get_fw_dev_path_helper(USBDevice *d, USBBus *bus, char *p,
+ int len)
+{
+int l = 0;
+USBPort *port;
+
+QTAILQ_FOREACH(port, bus-used, next) {
+if (port-dev == d) {
+if (port-pdev) {
+l = usbbus_get_fw_dev_path_helper(port-pdev, bus, p, len);
+}
+l += snprintf(p + l, len - l, %...@%x/, qdev_fw_name(d-qdev),
+  port-index);
+break;
+}
+}
+
+return l;
+}
+
+static char *usbbus_get_fw_dev_path(DeviceState *dev)
+{
+USBDevice *d = (USBDevice*)dev;
+USBBus *bus = usb_bus_from_device(d);
+char path[100];
+int l;
+
+assert(d-attached != 0);
+
+l = usbbus_get_fw_dev_path_helper(d, bus, path, sizeof(path));
+
+if (l == 0) {
+abort();
+}
+
+path[l-1] = '\0';
+
+return strdup(path);
+}
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 01/16] Introduce fw_name field to DeviceInfo structure.

2010-12-08 Thread Gleb Natapov
Add fw_name to DeviceInfo to use in device path building. In
contrast to name fw_name should refer to functionality device
provides instead of particular device model like name does.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/fdc.c|1 +
 hw/ide/isa.c|1 +
 hw/ide/qdev.c   |1 +
 hw/isa-bus.c|1 +
 hw/lance.c  |1 +
 hw/piix_pci.c   |1 +
 hw/qdev.h   |6 ++
 hw/scsi-disk.c  |1 +
 hw/usb-hub.c|1 +
 hw/usb-net.c|1 +
 hw/virtio-pci.c |1 +
 11 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/hw/fdc.c b/hw/fdc.c
index c159dcb..a467c4b 100644
--- a/hw/fdc.c
+++ b/hw/fdc.c
@@ -2040,6 +2040,7 @@ static const VMStateDescription vmstate_isa_fdc ={
 static ISADeviceInfo isa_fdc_info = {
 .init = isabus_fdc_init1,
 .qdev.name  = isa-fdc,
+.qdev.fw_name  = fdc,
 .qdev.size  = sizeof(FDCtrlISABus),
 .qdev.no_user = 1,
 .qdev.vmsd  = vmstate_isa_fdc,
diff --git a/hw/ide/isa.c b/hw/ide/isa.c
index 6b57e0d..9856435 100644
--- a/hw/ide/isa.c
+++ b/hw/ide/isa.c
@@ -98,6 +98,7 @@ ISADevice *isa_ide_init(int iobase, int iobase2, int isairq,
 
 static ISADeviceInfo isa_ide_info = {
 .qdev.name  = isa-ide,
+.qdev.fw_name  = ide,
 .qdev.size  = sizeof(ISAIDEState),
 .init   = isa_ide_initfn,
 .qdev.reset = isa_ide_reset,
diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c
index 0808760..6d27b60 100644
--- a/hw/ide/qdev.c
+++ b/hw/ide/qdev.c
@@ -134,6 +134,7 @@ static int ide_drive_initfn(IDEDevice *dev)
 
 static IDEDeviceInfo ide_drive_info = {
 .qdev.name  = ide-drive,
+.qdev.fw_name  = drive,
 .qdev.size  = sizeof(IDEDrive),
 .init   = ide_drive_initfn,
 .qdev.props = (Property[]) {
diff --git a/hw/isa-bus.c b/hw/isa-bus.c
index 4e306de..26036e0 100644
--- a/hw/isa-bus.c
+++ b/hw/isa-bus.c
@@ -153,6 +153,7 @@ static int isabus_bridge_init(SysBusDevice *dev)
 static SysBusDeviceInfo isabus_bridge_info = {
 .init = isabus_bridge_init,
 .qdev.name  = isabus-bridge,
+.qdev.fw_name  = isa,
 .qdev.size  = sizeof(SysBusDevice),
 .qdev.no_user = 1,
 };
diff --git a/hw/lance.c b/hw/lance.c
index dc12144..1a3bb1a 100644
--- a/hw/lance.c
+++ b/hw/lance.c
@@ -141,6 +141,7 @@ static void lance_reset(DeviceState *dev)
 static SysBusDeviceInfo lance_info = {
 .init   = lance_init,
 .qdev.name  = lance,
+.qdev.fw_name  = ethernet,
 .qdev.size  = sizeof(SysBusPCNetState),
 .qdev.reset = lance_reset,
 .qdev.vmsd  = vmstate_lance,
diff --git a/hw/piix_pci.c b/hw/piix_pci.c
index b5589b9..38f9d9e 100644
--- a/hw/piix_pci.c
+++ b/hw/piix_pci.c
@@ -365,6 +365,7 @@ static PCIDeviceInfo i440fx_info[] = {
 static SysBusDeviceInfo i440fx_pcihost_info = {
 .init = i440fx_pcihost_initfn,
 .qdev.name= i440FX-pcihost,
+.qdev.fw_name = pci,
 .qdev.size= sizeof(I440FXState),
 .qdev.no_user = 1,
 };
diff --git a/hw/qdev.h b/hw/qdev.h
index 3fac364..bc71110 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -141,6 +141,7 @@ typedef void (*qdev_resetfn)(DeviceState *dev);
 
 struct DeviceInfo {
 const char *name;
+const char *fw_name;
 const char *alias;
 const char *desc;
 size_t size;
@@ -306,6 +307,11 @@ void qdev_prop_set_defaults(DeviceState *dev, Property 
*props);
 void qdev_prop_register_global_list(GlobalProperty *props);
 void qdev_prop_set_globals(DeviceState *dev);
 
+static inline const char *qdev_fw_name(DeviceState *dev)
+{
+return dev-info-fw_name ? : dev-info-alias ? : dev-info-name;
+}
+
 /* This is a nasty hack to allow passing a NULL bus to qdev_create.  */
 extern struct BusInfo system_bus_info;
 
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index 6e49404..851046f 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -1230,6 +1230,7 @@ static int scsi_disk_initfn(SCSIDevice *dev)
 
 static SCSIDeviceInfo scsi_disk_info = {
 .qdev.name= scsi-disk,
+.qdev.fw_name = disk,
 .qdev.desc= virtual scsi disk or cdrom,
 .qdev.size= sizeof(SCSIDiskState),
 .qdev.reset   = scsi_disk_reset,
diff --git a/hw/usb-hub.c b/hw/usb-hub.c
index 2a1edfc..8e3a96b 100644
--- a/hw/usb-hub.c
+++ b/hw/usb-hub.c
@@ -545,6 +545,7 @@ static int usb_hub_initfn(USBDevice *dev)
 static struct USBDeviceInfo hub_info = {
 .product_desc   = QEMU USB Hub,
 .qdev.name  = usb-hub,
+.qdev.fw_name= hub,
 .qdev.size  = sizeof(USBHubState),
 .init   = usb_hub_initfn,
 .handle_packet  = usb_hub_handle_packet,
diff --git a/hw/usb-net.c b/hw/usb-net.c
index 58c672f..f6bed21 100644
--- a/hw/usb-net.c
+++ b/hw/usb-net.c
@@ -1496,6 +1496,7 @@ static USBDevice *usb_net_init(const char *cmdline)
 static struct USBDeviceInfo net_info = {
 .product_desc   = QEMU USB Network Interface,
 .qdev.name  = usb-net,
+.qdev.fw_name= network,
 .qdev.size  = sizeof(USBNetState),
 .init   = usb_net_initfn,
 .handle_packet  = 

[PATCHv8 07/16] Add get_fw_dev_path callback for system bus.

2010-12-08 Thread Gleb Natapov
Prints out mmio or pio used to access child device.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/pci_host.c |2 ++
 hw/sysbus.c   |   30 ++
 hw/sysbus.h   |4 
 3 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/hw/pci_host.c b/hw/pci_host.c
index bc5b771..28d45bf 100644
--- a/hw/pci_host.c
+++ b/hw/pci_host.c
@@ -197,6 +197,7 @@ void pci_host_conf_register_ioport(pio_addr_t ioport, 
PCIHostState *s)
 {
 pci_host_init(s);
 register_ioport_simple(s-conf_noswap_handler, ioport, 4, 4);
+sysbus_init_ioports(s-busdev, ioport, 4);
 }
 
 int pci_host_data_register_mmio(PCIHostState *s, int swap)
@@ -215,4 +216,5 @@ void pci_host_data_register_ioport(pio_addr_t ioport, 
PCIHostState *s)
 register_ioport_simple(s-data_noswap_handler, ioport, 4, 1);
 register_ioport_simple(s-data_noswap_handler, ioport, 4, 2);
 register_ioport_simple(s-data_noswap_handler, ioport, 4, 4);
+sysbus_init_ioports(s-busdev, ioport, 4);
 }
diff --git a/hw/sysbus.c b/hw/sysbus.c
index d817721..1583bd8 100644
--- a/hw/sysbus.c
+++ b/hw/sysbus.c
@@ -22,11 +22,13 @@
 #include monitor.h
 
 static void sysbus_dev_print(Monitor *mon, DeviceState *dev, int indent);
+static char *sysbus_get_fw_dev_path(DeviceState *dev);
 
 struct BusInfo system_bus_info = {
 .name   = System,
 .size   = sizeof(BusState),
 .print_dev  = sysbus_dev_print,
+.get_fw_dev_path = sysbus_get_fw_dev_path,
 };
 
 void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq)
@@ -106,6 +108,16 @@ void sysbus_init_mmio_cb(SysBusDevice *dev, 
target_phys_addr_t size,
 dev-mmio[n].cb = cb;
 }
 
+void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t size)
+{
+pio_addr_t i;
+
+for (i = 0; i  size; i++) {
+assert(dev-num_pio  QDEV_MAX_PIO);
+dev-pio[dev-num_pio++] = ioport++;
+}
+}
+
 static int sysbus_device_init(DeviceState *dev, DeviceInfo *base)
 {
 SysBusDeviceInfo *info = container_of(base, SysBusDeviceInfo, qdev);
@@ -171,3 +183,21 @@ static void sysbus_dev_print(Monitor *mon, DeviceState 
*dev, int indent)
indent, , s-mmio[i].addr, s-mmio[i].size);
 }
 }
+
+static char *sysbus_get_fw_dev_path(DeviceState *dev)
+{
+SysBusDevice *s = sysbus_from_qdev(dev);
+char path[40];
+int off;
+
+off = snprintf(path, sizeof(path), %s, qdev_fw_name(dev));
+
+if (s-num_mmio) {
+snprintf(path + off, sizeof(path) - off, @TARGET_FMT_plx,
+ s-mmio[0].addr);
+} else if (s-num_pio) {
+snprintf(path + off, sizeof(path) - off, @i%04x, s-pio[0]);
+}
+
+return strdup(path);
+}
diff --git a/hw/sysbus.h b/hw/sysbus.h
index 5980901..e9eb618 100644
--- a/hw/sysbus.h
+++ b/hw/sysbus.h
@@ -6,6 +6,7 @@
 #include qdev.h
 
 #define QDEV_MAX_MMIO 32
+#define QDEV_MAX_PIO 32
 #define QDEV_MAX_IRQ 256
 
 typedef struct SysBusDevice SysBusDevice;
@@ -23,6 +24,8 @@ struct SysBusDevice {
 mmio_mapfunc cb;
 ram_addr_t iofunc;
 } mmio[QDEV_MAX_MMIO];
+int num_pio;
+pio_addr_t pio[QDEV_MAX_PIO];
 };
 
 typedef int (*sysbus_initfn)(SysBusDevice *dev);
@@ -45,6 +48,7 @@ void sysbus_init_mmio_cb(SysBusDevice *dev, 
target_phys_addr_t size,
 mmio_mapfunc cb);
 void sysbus_init_irq(SysBusDevice *dev, qemu_irq *p);
 void sysbus_pass_irq(SysBusDevice *dev, SysBusDevice *target);
+void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t 
size);
 
 
 void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 13/16] Change fw_cfg_add_file() to get full file path as a parameter.

2010-12-08 Thread Gleb Natapov
Change fw_cfg_add_file() to get full file path as a parameter instead
of building one internally. Two reasons for that. First caller may need
to know how file is named. Second this moves policy of file naming out
from fw_cfg. Platform may want to use more then two levels of
directories for instance.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/fw_cfg.c |   16 
 hw/fw_cfg.h |4 ++--
 hw/loader.c |   16 ++--
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
index 72866ae..7b9434f 100644
--- a/hw/fw_cfg.c
+++ b/hw/fw_cfg.c
@@ -277,10 +277,9 @@ int fw_cfg_add_callback(FWCfgState *s, uint16_t key, 
FWCfgCallback callback,
 return 1;
 }
 
-int fw_cfg_add_file(FWCfgState *s,  const char *dir, const char *filename,
-uint8_t *data, uint32_t len)
+int fw_cfg_add_file(FWCfgState *s,  const char *filename, uint8_t *data,
+uint32_t len)
 {
-const char *basename;
 int i, index;
 
 if (!s-files) {
@@ -297,15 +296,8 @@ int fw_cfg_add_file(FWCfgState *s,  const char *dir, const 
char *filename,
 
 fw_cfg_add_bytes(s, FW_CFG_FILE_FIRST + index, data, len);
 
-basename = strrchr(filename, '/');
-if (basename) {
-basename++;
-} else {
-basename = filename;
-}
-
-snprintf(s-files-f[index].name, sizeof(s-files-f[index].name),
- %s/%s, dir, basename);
+pstrcpy(s-files-f[index].name, sizeof(s-files-f[index].name),
+filename);
 for (i = 0; i  index; i++) {
 if (strcmp(s-files-f[index].name, s-files-f[i].name) == 0) {
 FW_CFG_DPRINTF(%s: skip duplicate: %s\n, __FUNCTION__,
diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
index 4d13a4f..856bf91 100644
--- a/hw/fw_cfg.h
+++ b/hw/fw_cfg.h
@@ -60,8 +60,8 @@ int fw_cfg_add_i32(FWCfgState *s, uint16_t key, uint32_t 
value);
 int fw_cfg_add_i64(FWCfgState *s, uint16_t key, uint64_t value);
 int fw_cfg_add_callback(FWCfgState *s, uint16_t key, FWCfgCallback callback,
 void *callback_opaque, uint8_t *data, size_t len);
-int fw_cfg_add_file(FWCfgState *s, const char *dir, const char *filename,
-uint8_t *data, uint32_t len);
+int fw_cfg_add_file(FWCfgState *s, const char *filename, uint8_t *data,
+uint32_t len);
 FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
 target_phys_addr_t crl_addr, target_phys_addr_t 
data_addr);
 
diff --git a/hw/loader.c b/hw/loader.c
index 49ac1fa..1e98326 100644
--- a/hw/loader.c
+++ b/hw/loader.c
@@ -592,8 +592,20 @@ int rom_add_file(const char *file, const char *fw_dir,
 }
 close(fd);
 rom_insert(rom);
-if (rom-fw_file  fw_cfg)
-fw_cfg_add_file(fw_cfg, rom-fw_dir, rom-fw_file, rom-data, 
rom-romsize);
+if (rom-fw_file  fw_cfg) {
+const char *basename;
+char fw_file_name[56];
+
+basename = strrchr(rom-fw_file, '/');
+if (basename) {
+basename++;
+} else {
+basename = rom-fw_file;
+}
+snprintf(fw_file_name, sizeof(fw_file_name), %s/%s, rom-fw_dir,
+ basename);
+fw_cfg_add_file(fw_cfg, fw_file_name, rom-data, rom-romsize);
+}
 return 0;
 
 err:
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 12/16] Add bootindex parameter to net/block/fd device

2010-12-08 Thread Gleb Natapov
If bootindex is specified on command line a string that describes device
in firmware readable way is added into sorted list. Later this list will
be passed into firmware to control boot order.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 block_int.h |4 +++-
 hw/e1000.c  |4 
 hw/eepro100.c   |3 +++
 hw/fdc.c|8 
 hw/ide/qdev.c   |5 +
 hw/ne2000.c |3 +++
 hw/pcnet.c  |4 
 hw/qdev.c   |   32 
 hw/qdev.h   |1 +
 hw/rtl8139.c|4 
 hw/scsi-disk.c  |1 +
 hw/usb-net.c|2 ++
 hw/virtio-blk.c |2 ++
 hw/virtio-net.c |2 ++
 net.h   |4 +++-
 sysemu.h|2 ++
 vl.c|   40 
 17 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/block_int.h b/block_int.h
index 3c3adb5..0a0e47d 100644
--- a/block_int.h
+++ b/block_int.h
@@ -227,6 +227,7 @@ typedef struct BlockConf {
 uint16_t logical_block_size;
 uint16_t min_io_size;
 uint32_t opt_io_size;
+int32_t bootindex;
 } BlockConf;
 
 static inline unsigned int get_physical_block_exp(BlockConf *conf)
@@ -249,6 +250,7 @@ static inline unsigned int get_physical_block_exp(BlockConf 
*conf)
 DEFINE_PROP_UINT16(physical_block_size, _state,   \
_conf.physical_block_size, 512), \
 DEFINE_PROP_UINT16(min_io_size, _state, _conf.min_io_size, 0),  \
-DEFINE_PROP_UINT32(opt_io_size, _state, _conf.opt_io_size, 0)
+DEFINE_PROP_UINT32(opt_io_size, _state, _conf.opt_io_size, 0),\
+DEFINE_PROP_INT32(bootindex, _state, _conf.bootindex, -1) \
 
 #endif /* BLOCK_INT_H */
diff --git a/hw/e1000.c b/hw/e1000.c
index 57d08cf..e411b03 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -30,6 +30,7 @@
 #include net.h
 #include net/checksum.h
 #include loader.h
+#include sysemu.h
 
 #include e1000_hw.h
 
@@ -1154,6 +1155,9 @@ static int pci_e1000_init(PCIDevice *pci_dev)
   d-dev.qdev.info-name, d-dev.qdev.id, d);
 
 qemu_format_nic_info_str(d-nic-nc, macaddr);
+
+add_boot_device_path(d-conf.bootindex, pci_dev-qdev, /ethernet-...@0);
+
 return 0;
 }
 
diff --git a/hw/eepro100.c b/hw/eepro100.c
index f8a700a..a464e9b 100644
--- a/hw/eepro100.c
+++ b/hw/eepro100.c
@@ -46,6 +46,7 @@
 #include pci.h
 #include net.h
 #include eeprom93xx.h
+#include sysemu.h
 
 #define KiB 1024
 
@@ -1907,6 +1908,8 @@ static int e100_nic_init(PCIDevice *pci_dev)
 s-vmstate-name = s-nic-nc.model;
 vmstate_register(pci_dev-qdev, -1, s-vmstate, s);
 
+add_boot_device_path(s-conf.bootindex, pci_dev-qdev, /ethernet-...@0);
+
 return 0;
 }
 
diff --git a/hw/fdc.c b/hw/fdc.c
index 22fb64a..a7c7c17 100644
--- a/hw/fdc.c
+++ b/hw/fdc.c
@@ -35,6 +35,7 @@
 #include sysbus.h
 #include qdev-addr.h
 #include blockdev.h
+#include sysemu.h
 
 //
 /* debug Floppy devices */
@@ -523,6 +524,8 @@ typedef struct FDCtrlSysBus {
 typedef struct FDCtrlISABus {
 ISADevice busdev;
 struct FDCtrl state;
+int32_t bootindexA;
+int32_t bootindexB;
 } FDCtrlISABus;
 
 static uint32_t fdctrl_read (void *opaque, uint32_t reg)
@@ -1992,6 +1995,9 @@ static int isabus_fdc_init1(ISADevice *dev)
 qdev_set_legacy_instance_id(dev-qdev, iobase, 2);
 ret = fdctrl_init_common(fdctrl);
 
+add_boot_device_path(isa-bootindexA, dev-qdev, /flo...@0);
+add_boot_device_path(isa-bootindexB, dev-qdev, /flo...@1);
+
 return ret;
 }
 
@@ -2051,6 +2057,8 @@ static ISADeviceInfo isa_fdc_info = {
 .qdev.props = (Property[]) {
 DEFINE_PROP_DRIVE(driveA, FDCtrlISABus, state.drives[0].bs),
 DEFINE_PROP_DRIVE(driveB, FDCtrlISABus, state.drives[1].bs),
+DEFINE_PROP_INT32(bootindexA, FDCtrlISABus, bootindexA, -1),
+DEFINE_PROP_INT32(bootindexB, FDCtrlISABus, bootindexB, -1),
 DEFINE_PROP_END_OF_LIST(),
 },
 };
diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c
index 01a181b..2bb5c27 100644
--- a/hw/ide/qdev.c
+++ b/hw/ide/qdev.c
@@ -21,6 +21,7 @@
 #include qemu-error.h
 #include hw/ide/internal.h
 #include blockdev.h
+#include sysemu.h
 
 /* - */
 
@@ -143,6 +144,10 @@ static int ide_drive_initfn(IDEDevice *dev)
 if (!dev-serial) {
 dev-serial = qemu_strdup(s-drive_serial_str);
 }
+
+add_boot_device_path(dev-conf.bootindex, dev-qdev,
+ dev-unit ? /d...@1 : /d...@0);
+
 return 0;
 }
 
diff --git a/hw/ne2000.c b/hw/ne2000.c
index 126e7cf..a030106 100644
--- a/hw/ne2000.c
+++ b/hw/ne2000.c
@@ -26,6 +26,7 @@
 #include net.h
 #include ne2000.h
 #include loader.h
+#include sysemu.h
 
 /* debug NE2000 card */
 //#define DEBUG_NE2000
@@ -746,6 +747,8 @@ static int pci_ne2000_init(PCIDevice *pci_dev)
 }
 }
 
+add_boot_device_path(s-c.bootindex, pci_dev-qdev, /ethernet-...@0);
+
 return 0;
 }
 

[PATCHv8 09/16] Record which USBDevice USBPort belongs too.

2010-12-08 Thread Gleb Natapov
Ports on root hub will have NULL here. This is needed to reconstruct
path from device to its root hub to build device path.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/usb-bus.c  |3 ++-
 hw/usb-hub.c  |2 +-
 hw/usb-musb.c |2 +-
 hw/usb-ohci.c |2 +-
 hw/usb-uhci.c |2 +-
 hw/usb.h  |3 ++-
 6 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/hw/usb-bus.c b/hw/usb-bus.c
index b692503..256b881 100644
--- a/hw/usb-bus.c
+++ b/hw/usb-bus.c
@@ -110,11 +110,12 @@ USBDevice *usb_create_simple(USBBus *bus, const char 
*name)
 }
 
 void usb_register_port(USBBus *bus, USBPort *port, void *opaque, int index,
-   usb_attachfn attach)
+   USBDevice *pdev, usb_attachfn attach)
 {
 port-opaque = opaque;
 port-index = index;
 port-attach = attach;
+port-pdev = pdev;
 QTAILQ_INSERT_TAIL(bus-free, port, next);
 bus-nfree++;
 }
diff --git a/hw/usb-hub.c b/hw/usb-hub.c
index 8e3a96b..8a3f829 100644
--- a/hw/usb-hub.c
+++ b/hw/usb-hub.c
@@ -535,7 +535,7 @@ static int usb_hub_initfn(USBDevice *dev)
 for (i = 0; i  s-nb_ports; i++) {
 port = s-ports[i];
 usb_register_port(usb_bus_from_device(dev),
-  port-port, s, i, usb_hub_attach);
+  port-port, s, i, s-dev, usb_hub_attach);
 port-wPortStatus = PORT_STAT_POWER;
 port-wPortChange = 0;
 }
diff --git a/hw/usb-musb.c b/hw/usb-musb.c
index 7f15842..9efe7a6 100644
--- a/hw/usb-musb.c
+++ b/hw/usb-musb.c
@@ -343,7 +343,7 @@ struct MUSBState {
 }
 
 usb_bus_new(s-bus, NULL /* FIXME */);
-usb_register_port(s-bus, s-port, s, 0, musb_attach);
+usb_register_port(s-bus, s-port, s, 0, NULL, musb_attach);
 
 return s;
 }
diff --git a/hw/usb-ohci.c b/hw/usb-ohci.c
index 8fb2f83..1247295 100644
--- a/hw/usb-ohci.c
+++ b/hw/usb-ohci.c
@@ -1705,7 +1705,7 @@ static void usb_ohci_init(OHCIState *ohci, DeviceState 
*dev,
 usb_bus_new(ohci-bus, dev);
 ohci-num_ports = num_ports;
 for (i = 0; i  num_ports; i++) {
-usb_register_port(ohci-bus, ohci-rhport[i].port, ohci, i, 
ohci_attach);
+usb_register_port(ohci-bus, ohci-rhport[i].port, ohci, i, NULL, 
ohci_attach);
 }
 
 ohci-async_td = 0;
diff --git a/hw/usb-uhci.c b/hw/usb-uhci.c
index 1d83400..b9b822f 100644
--- a/hw/usb-uhci.c
+++ b/hw/usb-uhci.c
@@ -1115,7 +1115,7 @@ static int usb_uhci_common_initfn(UHCIState *s)
 
 usb_bus_new(s-bus, s-dev.qdev);
 for(i = 0; i  NB_PORTS; i++) {
-usb_register_port(s-bus, s-ports[i].port, s, i, uhci_attach);
+usb_register_port(s-bus, s-ports[i].port, s, i, NULL, uhci_attach);
 }
 s-frame_timer = qemu_new_timer(vm_clock, uhci_frame_timer, s);
 s-expire_time = qemu_get_clock(vm_clock) +
diff --git a/hw/usb.h b/hw/usb.h
index 00d2802..0b32d77 100644
--- a/hw/usb.h
+++ b/hw/usb.h
@@ -203,6 +203,7 @@ struct USBPort {
 USBDevice *dev;
 usb_attachfn attach;
 void *opaque;
+USBDevice *pdev;
 int index; /* internal port index, may be used with the opaque */
 QTAILQ_ENTRY(USBPort) next;
 };
@@ -312,7 +313,7 @@ USBDevice *usb_create(USBBus *bus, const char *name);
 USBDevice *usb_create_simple(USBBus *bus, const char *name);
 USBDevice *usbdevice_create(const char *cmdline);
 void usb_register_port(USBBus *bus, USBPort *port, void *opaque, int index,
-   usb_attachfn attach);
+   USBDevice *pdev, usb_attachfn attach);
 void usb_unregister_port(USBBus *bus, USBPort *port);
 int usb_device_attach(USBDevice *dev);
 int usb_device_detach(USBDevice *dev);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 08/16] Add get_fw_dev_path callback for pci bus.

2010-12-08 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/pci.c |  108 -
 1 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 0c15b13..e7ea907 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -43,6 +43,7 @@
 
 static void pcibus_dev_print(Monitor *mon, DeviceState *dev, int indent);
 static char *pcibus_get_dev_path(DeviceState *dev);
+static char *pcibus_get_fw_dev_path(DeviceState *dev);
 static int pcibus_reset(BusState *qbus);
 
 struct BusInfo pci_bus_info = {
@@ -50,6 +51,7 @@ struct BusInfo pci_bus_info = {
 .size   = sizeof(PCIBus),
 .print_dev  = pcibus_dev_print,
 .get_dev_path = pcibus_get_dev_path,
+.get_fw_dev_path = pcibus_get_fw_dev_path,
 .reset  = pcibus_reset,
 .props  = (Property[]) {
 DEFINE_PROP_PCI_DEVFN(addr, PCIDevice, devfn, -1),
@@ -1117,45 +1119,63 @@ void pci_msi_notify(PCIDevice *dev, unsigned int vector)
 typedef struct {
 uint16_t class;
 const char *desc;
+const char *fw_name;
+uint16_t fw_ign_bits;
 } pci_class_desc;
 
 static const pci_class_desc pci_class_descriptions[] =
 {
-{ 0x0100, SCSI controller},
-{ 0x0101, IDE controller},
-{ 0x0102, Floppy controller},
-{ 0x0103, IPI controller},
-{ 0x0104, RAID controller},
+{ 0x0001, VGA controller, display},
+{ 0x0100, SCSI controller, scsi},
+{ 0x0101, IDE controller, ide},
+{ 0x0102, Floppy controller, fdc},
+{ 0x0103, IPI controller, ipi},
+{ 0x0104, RAID controller, raid},
 { 0x0106, SATA controller},
 { 0x0107, SAS controller},
 { 0x0180, Storage controller},
-{ 0x0200, Ethernet controller},
-{ 0x0201, Token Ring controller},
-{ 0x0202, FDDI controller},
-{ 0x0203, ATM controller},
+{ 0x0200, Ethernet controller, ethernet},
+{ 0x0201, Token Ring controller, token-ring},
+{ 0x0202, FDDI controller, fddi},
+{ 0x0203, ATM controller, atm},
 { 0x0280, Network controller},
-{ 0x0300, VGA controller},
+{ 0x0300, VGA controller, display, 0x00ff},
 { 0x0301, XGA controller},
 { 0x0302, 3D controller},
 { 0x0380, Display controller},
-{ 0x0400, Video controller},
-{ 0x0401, Audio controller},
+{ 0x0400, Video controller, video},
+{ 0x0401, Audio controller, sound},
 { 0x0402, Phone},
 { 0x0480, Multimedia controller},
-{ 0x0500, RAM controller},
-{ 0x0501, Flash controller},
+{ 0x0500, RAM controller, memory},
+{ 0x0501, Flash controller, flash},
 { 0x0580, Memory controller},
-{ 0x0600, Host bridge},
-{ 0x0601, ISA bridge},
-{ 0x0602, EISA bridge},
-{ 0x0603, MC bridge},
-{ 0x0604, PCI bridge},
-{ 0x0605, PCMCIA bridge},
-{ 0x0606, NUBUS bridge},
-{ 0x0607, CARDBUS bridge},
+{ 0x0600, Host bridge, host},
+{ 0x0601, ISA bridge, isa},
+{ 0x0602, EISA bridge, eisa},
+{ 0x0603, MC bridge, mca},
+{ 0x0604, PCI bridge, pci},
+{ 0x0605, PCMCIA bridge, pcmcia},
+{ 0x0606, NUBUS bridge, nubus},
+{ 0x0607, CARDBUS bridge, cardbus},
 { 0x0608, RACEWAY bridge},
 { 0x0680, Bridge},
-{ 0x0c03, USB controller},
+{ 0x0700, Serial port, serial},
+{ 0x0701, Parallel port, parallel},
+{ 0x0800, Interrupt controller, interrupt-controller},
+{ 0x0801, DMA controller, dma-controller},
+{ 0x0802, Timer, timer},
+{ 0x0803, RTC, rtc},
+{ 0x0900, Keyboard, keyboard},
+{ 0x0901, Pen, pen},
+{ 0x0902, Mouse, mouse},
+{ 0x0A00, Dock station, dock, 0x00ff},
+{ 0x0B00, i386 cpu, cpu, 0x00ff},
+{ 0x0c00, Fireware contorller, fireware},
+{ 0x0c01, Access bus controller, access-bus},
+{ 0x0c02, SSA controller, ssa},
+{ 0x0c03, USB controller, usb},
+{ 0x0c04, Fibre channel controller, fibre-channel},
 { 0, NULL}
 };
 
@@ -1960,6 +1980,48 @@ static void pcibus_dev_print(Monitor *mon, DeviceState 
*dev, int indent)
 }
 }
 
+static char *pci_dev_fw_name(DeviceState *dev, char *buf, int len)
+{
+PCIDevice *d = (PCIDevice *)dev;
+const char *name = NULL;
+const pci_class_desc *desc =  pci_class_descriptions;
+int class = pci_get_word(d-config + PCI_CLASS_DEVICE);
+
+while (desc-desc 
+  (class  ~desc-fw_ign_bits) !=
+  (desc-class  ~desc-fw_ign_bits)) {
+desc++;
+}
+
+if (desc-desc) {
+name = desc-fw_name;
+}
+
+if (name) {
+pstrcpy(buf, len, name);
+} else {
+snprintf(buf, len, pci%04x,%04x,
+ pci_get_word(d-config + PCI_VENDOR_ID),
+ pci_get_word(d-config + PCI_DEVICE_ID));
+}
+
+return buf;
+}
+
+static char *pcibus_get_fw_dev_path(DeviceState *dev)
+{
+PCIDevice *d = (PCIDevice *)dev;
+char path[50], name[33];
+int off;
+
+off = snprintf(path, sizeof(path), %...@%x,
+   pci_dev_fw_name(dev, name, sizeof name),
+   PCI_SLOT(d-devfn));
+if 

[PATCHv8 05/16] Store IDE bus id in IDEBus structure for easy access.

2010-12-08 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/ide/cmd646.c   |4 ++--
 hw/ide/internal.h |3 ++-
 hw/ide/isa.c  |2 +-
 hw/ide/piix.c |4 ++--
 hw/ide/qdev.c |3 ++-
 hw/ide/via.c  |4 ++--
 6 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index dfe6091..ea5d2dc 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -253,8 +253,8 @@ static int pci_cmd646_ide_initfn(PCIDevice *dev)
 pci_conf[PCI_INTERRUPT_PIN] = 0x01; // interrupt on pin 1
 
 irq = qemu_allocate_irqs(cmd646_set_irq, d, 2);
-ide_bus_new(d-bus[0], d-dev.qdev);
-ide_bus_new(d-bus[1], d-dev.qdev);
+ide_bus_new(d-bus[0], d-dev.qdev, 0);
+ide_bus_new(d-bus[1], d-dev.qdev, 1);
 ide_init2(d-bus[0], irq[0]);
 ide_init2(d-bus[1], irq[1]);
 
diff --git a/hw/ide/internal.h b/hw/ide/internal.h
index 85f4a16..71af66f 100644
--- a/hw/ide/internal.h
+++ b/hw/ide/internal.h
@@ -449,6 +449,7 @@ struct IDEBus {
 IDEDevice *slave;
 BMDMAState *bmdma;
 IDEState ifs[2];
+int bus_id;
 uint8_t unit;
 uint8_t cmd;
 qemu_irq irq;
@@ -567,7 +568,7 @@ void ide_init2_with_non_qdev_drives(IDEBus *bus, DriveInfo 
*hd0,
 void ide_init_ioport(IDEBus *bus, int iobase, int iobase2);
 
 /* hw/ide/qdev.c */
-void ide_bus_new(IDEBus *idebus, DeviceState *dev);
+void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id);
 IDEDevice *ide_create_drive(IDEBus *bus, int unit, DriveInfo *drive);
 
 #endif /* HW_IDE_INTERNAL_H */
diff --git a/hw/ide/isa.c b/hw/ide/isa.c
index 4206afd..8c59c5a 100644
--- a/hw/ide/isa.c
+++ b/hw/ide/isa.c
@@ -67,7 +67,7 @@ static int isa_ide_initfn(ISADevice *dev)
 {
 ISAIDEState *s = DO_UPCAST(ISAIDEState, dev, dev);
 
-ide_bus_new(s-bus, s-dev.qdev);
+ide_bus_new(s-bus, s-dev.qdev, 0);
 ide_init_ioport(s-bus, s-iobase, s-iobase2);
 isa_init_irq(dev, s-irq, s-isairq);
 isa_init_ioport_range(dev, s-iobase, 8);
diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index e02b89a..1c0cb0c 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -125,8 +125,8 @@ static int pci_piix_ide_initfn(PCIIDEState *d)
 
 vmstate_register(d-dev.qdev, 0, vmstate_ide_pci, d);
 
-ide_bus_new(d-bus[0], d-dev.qdev);
-ide_bus_new(d-bus[1], d-dev.qdev);
+ide_bus_new(d-bus[0], d-dev.qdev, 0);
+ide_bus_new(d-bus[1], d-dev.qdev, 1);
 ide_init_ioport(d-bus[0], 0x1f0, 0x3f6);
 ide_init_ioport(d-bus[1], 0x170, 0x376);
 
diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c
index 6d27b60..88ff657 100644
--- a/hw/ide/qdev.c
+++ b/hw/ide/qdev.c
@@ -29,9 +29,10 @@ static struct BusInfo ide_bus_info = {
 .size  = sizeof(IDEBus),
 };
 
-void ide_bus_new(IDEBus *idebus, DeviceState *dev)
+void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id)
 {
 qbus_create_inplace(idebus-qbus, ide_bus_info, dev, NULL);
+idebus-bus_id = bus_id;
 }
 
 static int ide_qdev_init(DeviceState *qdev, DeviceInfo *base)
diff --git a/hw/ide/via.c b/hw/ide/via.c
index 66be0c4..78857e8 100644
--- a/hw/ide/via.c
+++ b/hw/ide/via.c
@@ -154,8 +154,8 @@ static int vt82c686b_ide_initfn(PCIDevice *dev)
 
 vmstate_register(dev-qdev, 0, vmstate_ide_pci, d);
 
-ide_bus_new(d-bus[0], d-dev.qdev);
-ide_bus_new(d-bus[1], d-dev.qdev);
+ide_bus_new(d-bus[0], d-dev.qdev, 0);
+ide_bus_new(d-bus[1], d-dev.qdev, 1);
 ide_init2(d-bus[0], isa_reserve_irq(14));
 ide_init2(d-bus[1], isa_reserve_irq(15));
 ide_init_ioport(d-bus[0], 0x1f0, 0x3f6);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 06/16] Add get_fw_dev_path callback to IDE bus.

2010-12-08 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/ide/qdev.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c
index 88ff657..01a181b 100644
--- a/hw/ide/qdev.c
+++ b/hw/ide/qdev.c
@@ -24,9 +24,12 @@
 
 /* - */
 
+static char *idebus_get_fw_dev_path(DeviceState *dev);
+
 static struct BusInfo ide_bus_info = {
 .name  = IDE,
 .size  = sizeof(IDEBus),
+.get_fw_dev_path = idebus_get_fw_dev_path,
 };
 
 void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id)
@@ -35,6 +38,16 @@ void ide_bus_new(IDEBus *idebus, DeviceState *dev, int 
bus_id)
 idebus-bus_id = bus_id;
 }
 
+static char *idebus_get_fw_dev_path(DeviceState *dev)
+{
+char path[30];
+
+snprintf(path, sizeof(path), %...@%d, qdev_fw_name(dev),
+ ((IDEBus*)dev-parent_bus)-bus_id);
+
+return strdup(path);
+}
+
 static int ide_qdev_init(DeviceState *qdev, DeviceInfo *base)
 {
 IDEDevice *dev = DO_UPCAST(IDEDevice, qdev, qdev);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 14/16] Add bootindex for option roms.

2010-12-08 Thread Gleb Natapov
Extend -option-rom command to have additional parameter ,bootindex=.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 hw/loader.c|   16 +++-
 hw/loader.h|8 
 hw/multiboot.c |3 ++-
 hw/ne2000.c|2 +-
 hw/nseries.c   |4 ++--
 hw/palm.c  |6 +++---
 hw/pc.c|7 ---
 hw/pci.c   |2 +-
 hw/pcnet-pci.c |2 +-
 qemu-config.c  |   17 +
 sysemu.h   |6 +-
 vl.c   |   11 +--
 12 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/hw/loader.c b/hw/loader.c
index 1e98326..eb198f6 100644
--- a/hw/loader.c
+++ b/hw/loader.c
@@ -107,7 +107,7 @@ int load_image_targphys(const char *filename,
 
 size = get_image_size(filename);
 if (size  0)
-rom_add_file_fixed(filename, addr);
+rom_add_file_fixed(filename, addr, -1);
 return size;
 }
 
@@ -557,10 +557,11 @@ static void rom_insert(Rom *rom)
 }
 
 int rom_add_file(const char *file, const char *fw_dir,
- target_phys_addr_t addr)
+ target_phys_addr_t addr, int32_t bootindex)
 {
 Rom *rom;
 int rc, fd = -1;
+char devpath[100];
 
 rom = qemu_mallocz(sizeof(*rom));
 rom-name = qemu_strdup(file);
@@ -605,7 +606,12 @@ int rom_add_file(const char *file, const char *fw_dir,
 snprintf(fw_file_name, sizeof(fw_file_name), %s/%s, rom-fw_dir,
  basename);
 fw_cfg_add_file(fw_cfg, fw_file_name, rom-data, rom-romsize);
+snprintf(devpath, sizeof(devpath), /r...@%s, fw_file_name);
+} else {
+snprintf(devpath, sizeof(devpath), /rom@ TARGET_FMT_plx, addr);
 }
+
+add_boot_device_path(bootindex, NULL, devpath);
 return 0;
 
 err:
@@ -635,12 +641,12 @@ int rom_add_blob(const char *name, const void *blob, 
size_t len,
 
 int rom_add_vga(const char *file)
 {
-return rom_add_file(file, vgaroms, 0);
+return rom_add_file(file, vgaroms, 0, -1);
 }
 
-int rom_add_option(const char *file)
+int rom_add_option(const char *file, int32_t bootindex)
 {
-return rom_add_file(file, genroms, 0);
+return rom_add_file(file, genroms, 0, bootindex);
 }
 
 static void rom_reset(void *unused)
diff --git a/hw/loader.h b/hw/loader.h
index 1f82fc5..fc6bdff 100644
--- a/hw/loader.h
+++ b/hw/loader.h
@@ -22,7 +22,7 @@ void pstrcpy_targphys(const char *name,
 
 
 int rom_add_file(const char *file, const char *fw_dir,
- target_phys_addr_t addr);
+ target_phys_addr_t addr, int32_t bootindex);
 int rom_add_blob(const char *name, const void *blob, size_t len,
  target_phys_addr_t addr);
 int rom_load_all(void);
@@ -31,8 +31,8 @@ int rom_copy(uint8_t *dest, target_phys_addr_t addr, size_t 
size);
 void *rom_ptr(target_phys_addr_t addr);
 void do_info_roms(Monitor *mon);
 
-#define rom_add_file_fixed(_f, _a)  \
-rom_add_file(_f, NULL, _a)
+#define rom_add_file_fixed(_f, _a, _i)  \
+rom_add_file(_f, NULL, _a, _i)
 #define rom_add_blob_fixed(_f, _b, _l, _a)  \
 rom_add_blob(_f, _b, _l, _a)
 
@@ -43,6 +43,6 @@ void do_info_roms(Monitor *mon);
 #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
 
 int rom_add_vga(const char *file);
-int rom_add_option(const char *file);
+int rom_add_option(const char *file, int32_t bootindex);
 
 #endif
diff --git a/hw/multiboot.c b/hw/multiboot.c
index e710bbb..7cc3055 100644
--- a/hw/multiboot.c
+++ b/hw/multiboot.c
@@ -331,7 +331,8 @@ int load_multiboot(void *fw_cfg,
 fw_cfg_add_bytes(fw_cfg, FW_CFG_INITRD_DATA, mb_bootinfo_data,
  sizeof(bootinfo));
 
-option_rom[nb_option_roms] = multiboot.bin;
+option_rom[nb_option_roms].name = multiboot.bin;
+option_rom[nb_option_roms].bootindex = 0;
 nb_option_roms++;
 
 return 1; /* yes, we are multiboot */
diff --git a/hw/ne2000.c b/hw/ne2000.c
index a030106..5966359 100644
--- a/hw/ne2000.c
+++ b/hw/ne2000.c
@@ -742,7 +742,7 @@ static int pci_ne2000_init(PCIDevice *pci_dev)
 if (!pci_dev-qdev.hotplugged) {
 static int loaded = 0;
 if (!loaded) {
-rom_add_option(pxe-ne2k_pci.bin);
+rom_add_option(pxe-ne2k_pci.bin, -1);
 loaded = 1;
 }
 }
diff --git a/hw/nseries.c b/hw/nseries.c
index 04a028d..2f6f473 100644
--- a/hw/nseries.c
+++ b/hw/nseries.c
@@ -1326,7 +1326,7 @@ static void n8x0_init(ram_addr_t ram_size, const char 
*boot_device,
 qemu_register_reset(n8x0_boot_init, s);
 }
 
-if (option_rom[0]  (boot_device[0] == 'n' || !kernel_filename)) {
+if (option_rom[0].name  (boot_device[0] == 'n' || !kernel_filename)) {
 int rom_size;
 uint8_t nolo_tags[0x1];
 /* No, wait, better start at the ROM.  */
@@ -1341,7 +1341,7 @@ static void n8x0_init(ram_addr_t ram_size, const char 
*boot_device,
  *
  * The code above is for loading the `zImage' file from Nokia
  * images.  */
-rom_size = 

[PATCHv8 00/16] boot order specification

2010-12-08 Thread Gleb Natapov
Forget to save a couple of buffers before sending version 7 :(

Anthony, Blue can this be applied now?

Gleb Natapov (16):
  Introduce fw_name field to DeviceInfo structure.
  Introduce new BusInfo callback get_fw_dev_path.
  Keep track of ISA ports ISA device is using in qdev.
  Add get_fw_dev_path callback to ISA bus in qdev.
  Store IDE bus id in IDEBus structure for easy access.
  Add get_fw_dev_path callback to IDE bus.
  Add get_fw_dev_path callback for system bus.
  Add get_fw_dev_path callback for pci bus.
  Record which USBDevice USBPort belongs too.
  Add get_fw_dev_path callback for usb bus.
  Add get_fw_dev_path callback to scsi bus.
  Add bootindex parameter to net/block/fd device
  Change fw_cfg_add_file() to get full file path as a parameter.
  Add bootindex for option roms.
  Add notifier that will be called when machine is fully created.
  Pass boot device list to firmware.

 block_int.h   |4 +-
 hw/cs4231a.c  |1 +
 hw/e1000.c|4 ++
 hw/eepro100.c |3 +
 hw/fdc.c  |   12 ++
 hw/fw_cfg.c   |   30 --
 hw/fw_cfg.h   |4 +-
 hw/gus.c  |4 ++
 hw/ide/cmd646.c   |4 +-
 hw/ide/internal.h |3 +-
 hw/ide/isa.c  |5 ++-
 hw/ide/piix.c |4 +-
 hw/ide/qdev.c |   22 ++-
 hw/ide/via.c  |4 +-
 hw/isa-bus.c  |   42 +++
 hw/isa.h  |4 ++
 hw/lance.c|1 +
 hw/loader.c   |   32 ---
 hw/loader.h   |8 ++--
 hw/m48t59.c   |1 +
 hw/mc146818rtc.c  |1 +
 hw/multiboot.c|3 +-
 hw/ne2000-isa.c   |3 +
 hw/ne2000.c   |5 ++-
 hw/nseries.c  |4 +-
 hw/palm.c |6 +-
 hw/parallel.c |5 ++
 hw/pc.c   |7 ++-
 hw/pci.c  |  110 ---
 hw/pci_host.c |2 +
 hw/pckbd.c|3 +
 hw/pcnet-pci.c|2 +-
 hw/pcnet.c|4 ++
 hw/piix_pci.c |1 +
 hw/qdev.c |   32 +++
 hw/qdev.h |   14 ++
 hw/rtl8139.c  |4 ++
 hw/sb16.c |4 ++
 hw/scsi-bus.c |   23 +++
 hw/scsi-disk.c|2 +
 hw/serial.c   |1 +
 hw/sysbus.c   |   30 ++
 hw/sysbus.h   |4 ++
 hw/usb-bus.c  |   45 -
 hw/usb-hub.c  |3 +-
 hw/usb-musb.c |2 +-
 hw/usb-net.c  |3 +
 hw/usb-ohci.c |2 +-
 hw/usb-uhci.c |2 +-
 hw/usb.h  |3 +-
 hw/virtio-blk.c   |2 +
 hw/virtio-net.c   |2 +
 hw/virtio-pci.c   |1 +
 net.h |4 +-
 qemu-config.c |   17 
 sysemu.h  |   11 +-
 vl.c  |  114 -
 57 files changed, 593 insertions(+), 80 deletions(-)

-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv8 15/16] Add notifier that will be called when machine is fully created.

2010-12-08 Thread Gleb Natapov
Action that depends on fully initialized device model should register
with this notifier chain.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 sysemu.h |2 ++
 vl.c |   15 +++
 2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/sysemu.h b/sysemu.h
index 48f8eee..c42f33a 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -60,6 +60,8 @@ void qemu_system_reset(void);
 void qemu_add_exit_notifier(Notifier *notify);
 void qemu_remove_exit_notifier(Notifier *notify);
 
+void qemu_add_machine_init_done_notifier(Notifier *notify);
+
 void do_savevm(Monitor *mon, const QDict *qdict);
 int load_vmstate(const char *name);
 void do_delvm(Monitor *mon, const QDict *qdict);
diff --git a/vl.c b/vl.c
index 844d6a5..0d20d26 100644
--- a/vl.c
+++ b/vl.c
@@ -254,6 +254,9 @@ static void *boot_set_opaque;
 static NotifierList exit_notifiers =
 NOTIFIER_LIST_INITIALIZER(exit_notifiers);
 
+static NotifierList machine_init_done_notifiers =
+NOTIFIER_LIST_INITIALIZER(machine_init_done_notifiers);
+
 int kvm_allowed = 0;
 uint32_t xen_domid;
 enum xen_mode xen_mode = XEN_EMULATE;
@@ -1782,6 +1785,16 @@ static void qemu_run_exit_notifiers(void)
 notifier_list_notify(exit_notifiers);
 }
 
+void qemu_add_machine_init_done_notifier(Notifier *notify)
+{
+notifier_list_add(machine_init_done_notifiers, notify);
+}
+
+static void qemu_run_machine_init_done_notifiers(void)
+{
+notifier_list_notify(machine_init_done_notifiers);
+}
+
 static const QEMUOption *lookup_opt(int argc, char **argv,
 const char **poptarg, int *poptind)
 {
@@ -3028,6 +3041,8 @@ int main(int argc, char **argv, char **envp)
 }
 
 qemu_register_reset((void *)qbus_reset_all, sysbus_get_default());
+qemu_run_machine_init_done_notifiers();
+
 qemu_system_reset();
 if (loadvm) {
 if (load_vmstate(loadvm)  0) {
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/x86: enlarge number of possible CPUID leaves

2010-12-08 Thread Avi Kivity

On 12/08/2010 01:13 PM, Andre Przywara wrote:

Avi, Marcello,

can you please commit this simple fix? (turning 40 to 80?)
Without it QEMU crashes reliably on our new CPUs (they return 46 
leaves) and causes pain in our testing, because we have to manually 
apply this patch on each tree.


Sorry about that, applied now.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/x86: enlarge number of possible CPUID leaves

2010-12-08 Thread Avi Kivity

On 12/01/2010 02:55 PM, Andre Przywara wrote:

Avi Kivity wrote:

On 12/01/2010 01:17 PM, Andre Przywara wrote:

Currently the number of CPUID leaves KVM handles is limited to 40.
My desktop machine (AthlonII) already has 35 and future CPUs will
expand this well beyond the limit. Extend the limit to 80 to make
room for future processors.

Signed-off-by: Andre Przywaraandre.przyw...@amd.com
---
  arch/x86/include/asm/kvm_host.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

Hi,
I found that either KVM or QEMU (possibly both) are broken in respect
to handling more CPUID entries than the limit dictates. KVM will
return -E2BIG, which is the same error as if the user hasn't provided
enough space to hold all entries. Now QEMU will continue to enlarge
the allocated memory until it gets into an out-of-memory condition.
I have tried to fix this with teaching KVM how to deal with a capped
number of entries (there are some bugs in the current code), but this
will limit the number of CPUID entries KVM handles, which will surely
cut of the lastly appended PV leaves.
A proper fix would be to make this allocation dynamic. Is this a
feasible way or will this lead to issues or side-effects?



Well, there has to be some limit, or userspace can allocate unbounded 
kernel memory.
But this limit should not be a compile-time constant, but a runtime 
one. The needed size depends on the host CPU (plus the KVM PV leaves) 
and thus could be determined once for all VMs and vCPUs at module 
load-time. But then we cannot use the static array allocation we 
currently have in struct kvm_vcpu_arch:

struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
So we would use a kind-of dynamic allocation bounded by the host CPU's 
need. But for the code is does not make much difference to a real 
dynamic allocation.


Also we could implement kvm_dev_ioctl_get_supported_cpuid without the 
vmalloc, if we don't care about some dozens of copy_to_user() calls in 
this function. Then we would not need this limit in 
GET_SUPPORTED_CPUID at all, but it will strike us again at 
KVM_SET_CPUID[2], where we may not fulfill the promises we gave earlier.

Having said this, what about that:
kvm_dev_ioctl_get_supported_cpuid is invariant to the VM or vCPU (as 
it is used by a system ioctl), so it could be run once at 
initialization, which would limit the ioctl implementation to a plain 
bounded copy.
Would you want such a patch (removing the vmalloc and maybe even the 
limit)?


Making GET_SUPPORTED_CPUID data static would be an improvement, yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.

2010-12-08 Thread Anthony Liguori

On 12/08/2010 02:11 AM, Yoshiaki Tamura wrote:

2010/12/8 Isaku Yamahatayamah...@valinux.co.jp:
   

QLIST_FOREACH_SAFE?
 

Thanks! So, it should be,

QLIST_FOREACH_SAFE(e,vm_change_state_head, entries, ne) {
 e-cb(e-opaque, running, reason);
}

I'll put it in the next spin.
   


This is still brittle though because it only allows the current handler 
to delete itself.  A better approach is to borrow the technique we use 
with file descriptors (using a deleted flag) as that is robust against 
deletion of any elements in a handler.


Regards,

Anthony Liguori


Yoshi

   

On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote:
 

By copying the next entry to a tmp pointer,
qemu_del_vm_change_state_handler() can be called in the handler.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
---
  vl.c |5 +++--
  1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index 805e11f..6b6aec0 100644
--- a/vl.c
+++ b/vl.c
@@ -1073,11 +1073,12 @@ void 
qemu_del_vm_change_state_handler(VMChangeStateEntry *e)

  void vm_state_notify(int running, int reason)
  {
-VMChangeStateEntry *e;
+VMChangeStateEntry *e, *ne;

  trace_vm_state_notify(running, reason);

-for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) {
+for (e = vm_change_state_head.lh_first; e; e = ne) {
+ne = e-entries.le_next;
  e-cb(e-opaque, running, reason);
  }
  }
--
1.7.1.2


   

--
yamahata
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Fix build error on s390 due to missing tlbs_dirty

2010-12-08 Thread Avi Kivity
Make it available for all archs.

Signed-off-by: Avi Kivity a...@redhat.com
---
 include/linux/kvm_host.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd0da8f..b5021db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,8 +256,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
-   long tlbs_dirty;
 #endif
+   long tlbs_dirty;
 };
 
 /* The guest did something we don't support. */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/28] nVMX: Nested VMX, v7

2010-12-08 Thread Nadav Har'El
Hi,

This is the seventh iteration of the nested VMX patch set. It fixes a bunch
of bugs in the previous iteration, and in particular it now works correctly
with EPT in the L0 hypervisor, so ept=0 no longer needs to be specified.

This new set of patches should apply to the current KVM trunk (I checked with
66fc6be8d2b04153b753182610f919faf9c705bc). In particular it uses the recently
added is_guest_mode() function (common to both nested svm and vmx) instead of
inventing our own flag.

About nested VMX:
-

The following 28 patches implement nested VMX support. This feature enables a
guest to use the VMX APIs in order to run its own nested guests. In other
words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
The Turtles Project: Design and Implementation of Nested Virtualization,
and was awarded Jay Lepreau Best Paper. The paper is available online, at:

http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (shadow page tables are
used in L1, while L0 can use shadow page tables or EPT). It is also missing
some features required to run VMWare Server as a guest. These missing features
will be sent as follow-on patchs.

Running nested VMX:
--

The current patches have a number of requirements, which will be relaxed in
follow-on patches:

1. This version was only tested with KVM (64-bit) as a guest hypervisor, and
   Linux as a nested guest.

2. SMP is supported in the code, but is unfortunately buggy in this version
   and often leads to hangs. Use the nosmp option in the L0 (topmost)
   kernel to avoid this bug (and to reduce your performance ;-))..

3. No modifications are required to user space (qemu). However, qemu does not
   currently list VMX as a CPU feature in its emulated CPUs (even when they
   are named after CPUs that do normally have VMX). Therefore, the -cpu host
   option should be given to qemu, to tell it to support CPU features which
   exist in the host - and in particular VMX.
   This requirement can be made unnecessary by a trivial patch to qemu (which
   I will submit in the future).

4. The nested VMX feature is currently disabled by default. It must be
   explicitly enabled with the nested=1 option to the kvm-intel module.

5. Nested VPID is not properly supported in this version. You must give the
   vpid=0 module options to kvm-intel to turn this feature off.


Patch statistics:
-

 Documentation/kvm/nested-vmx.txt |  237 ++
 arch/x86/include/asm/kvm_host.h  |2 
 arch/x86/include/asm/vmx.h   |   31 
 arch/x86/kvm/svm.c   |6 
 arch/x86/kvm/vmx.c   | 2416 -
 arch/x86/kvm/x86.c   |   16 
 arch/x86/kvm/x86.h   |6 
 7 files changed, 2676 insertions(+), 38 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/28] nVMX: Add nested module option to vmx.c

2010-12-08 Thread Nadav Har'El
This patch adds a module option nested to vmx.c, which controls whether
the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |8 
 1 file changed, 8 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:48.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:48.0 +0200
@@ -69,6 +69,14 @@ module_param(emulate_invalid_guest_state
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., the guest may use
+ * VMX and be a hypervisor for its own guests. If nested=0, the guest may not
+ * use VMX instructions.
+ */
+static int nested = 0;
+module_param(nested, int, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features

2010-12-08 Thread Nadav Har'El
If the nested module option is enabled, add the VMX CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:48.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:48.0 +0200
@@ -4284,6 +4284,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+   if (func == 1  nested)
+   entry-ecx |= bit(X86_FEATURE_VMX);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/28] nVMX: Implement VMXON and VMXOFF

2010-12-08 Thread Nadav Har'El
This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  102 ++-
 1 file changed, 100 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:48.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:48.0 +0200
@@ -127,6 +127,17 @@ struct shared_msr_entry {
u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+   /* Has the level1 guest done vmxon? */
+   bool vmxon;
+};
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -174,6 +185,9 @@ struct vcpu_vmx {
u32 exit_reason;
 
bool rdtscp_enabled;
+
+   /* Support for a guest hypervisor (nested VMX) */
+   struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3653,6 +3667,90 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called VMXON pointer) because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+   struct kvm_segment cs;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /* The Intel VMX Instruction Reference lists a bunch of bits that
+* are prerequisite to running VMXON, most notably CR4.VMXE must be
+* set to 1. Otherwise, we should fail with #UD. We test these now:
+*/
+   if (!nested ||
+   !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+   !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+   (vmx_get_rflags(vcpu)  X86_EFLAGS_VM)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
+   if (is_long_mode(vcpu)  !cs.l) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (vmx_get_cpl(vcpu)) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+
+   vmx-nested.vmxon = true;
+
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+   struct kvm_segment cs;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   if (!vmx-nested.vmxon) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 0;
+   }
+
+   vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
+   if ((vmx_get_rflags(vcpu)  X86_EFLAGS_VM) ||
+   (is_long_mode(vcpu)  !cs.l)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 0;
+   }
+
+   if (vmx_get_cpl(vcpu)) {
+   kvm_inject_gp(vcpu, 0);
+   return 0;
+   }
+
+   return 1;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   to_vmx(vcpu)-nested.vmxon = false;
+
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3680,8 +3778,8 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_VMREAD]  = handle_vmx_insn,
[EXIT_REASON_VMRESUME]= handle_vmx_insn,
[EXIT_REASON_VMWRITE] = handle_vmx_insn,
-   [EXIT_REASON_VMOFF]   = handle_vmx_insn,
-   [EXIT_REASON_VMON]= handle_vmx_insn,
+   [EXIT_REASON_VMOFF]   = handle_vmoff,
+   [EXIT_REASON_VMON]= handle_vmon,
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold,
[EXIT_REASON_APIC_ACCESS] = handle_apic_access,
[EXIT_REASON_WBINVD]  = handle_wbinvd,
--
To unsubscribe from this list: send 

[PATCH 04/28] nVMX: Allow setting the VMXE bit in CR4

2010-12-08 Thread Nadav Har'El
This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops-set_cr4(). This function
now returns an int: If kvm_x86_ops-set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested module option is on,
and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/svm.c  |6 +-
 arch/x86/kvm/vmx.c  |   13 +++--
 arch/x86/kvm/x86.c  |4 +---
 4 files changed, 18 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h 2010-12-08 18:56:49.0 
+0200
+++ .after/arch/x86/include/asm/kvm_host.h  2010-12-08 18:56:49.0 
+0200
@@ -535,7 +535,7 @@ struct kvm_x86_ops {
void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-   void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+   int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/svm.c   2010-12-08 18:56:49.0 +0200
@@ -1370,11 +1370,14 @@ static void svm_set_cr0(struct kvm_vcpu 
update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
unsigned long host_cr4_mce = read_cr4()  X86_CR4_MCE;
unsigned long old_cr4 = to_svm(vcpu)-vmcb-save.cr4;
 
+   if (cr4  X86_CR4_VMXE)
+   return 1;
+
if (npt_enabled  ((old_cr4 ^ cr4)  X86_CR4_PGE))
force_new_asid(vcpu);
 
@@ -1383,6 +1386,7 @@ static void svm_set_cr4(struct kvm_vcpu 
cr4 |= X86_CR4_PAE;
cr4 |= host_cr4_mce;
to_svm(vcpu)-vmcb-save.cr4 = cr4;
+   return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/x86.c   2010-12-08 18:56:49.0 +0200
@@ -610,11 +610,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
!load_pdptrs(vcpu, vcpu-arch.walk_mmu, vcpu-arch.cr3))
return 1;
 
-   if (cr4  X86_CR4_VMXE)
+   if (kvm_x86_ops-set_cr4(vcpu, cr4))
return 1;
 
-   kvm_x86_ops-set_cr4(vcpu, cr4);
-
if ((cr4 ^ old_cr4)  pdptr_bits)
kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:49.0 +0200
@@ -1876,7 +1876,7 @@ static void ept_save_pdptrs(struct kvm_v
  (unsigned long *)vcpu-arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
unsigned long cr0,
@@ -1971,11 +1971,19 @@ static void vmx_set_cr3(struct kvm_vcpu 
vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ?
KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+   if (cr4  X86_CR4_VMXE) {
+   if (!nested)
+   return 1;
+   } else {
+   if (nested  to_vmx(vcpu)-nested.vmxon)
+   return 1;
+   }
+
vcpu-arch.cr4 = cr4;
if (enable_ept) {
if (!is_paging(vcpu)) {
@@ -1988,6 +1996,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
vmcs_writel(CR4_READ_SHADOW, cr4);
vmcs_writel(GUEST_CR4, hw_cr4);
+   return 0;
 }
 
 static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/28] nVMX: Introduce vmcs12: a VMCS structure for L1

2010-12-08 Thread Nadav Har'El
An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it vmcs12, as it is the VMCS
that L1 keeps for its L2 guests. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's current
VMCS, and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   64 +++
 1 file changed, 64 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:49.0 +0200
@@ -128,6 +128,34 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed in order to preserve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __packed vmcs12 {
+   /* According to the Intel spec, a VMCS region must start with the
+* following two fields. Then follow implementation-specific data.
+*/
+   u32 revision_id;
+   u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -136,6 +164,12 @@ struct shared_msr_entry {
 struct nested_vmx {
/* Has the level1 guest done vmxon? */
bool vmxon;
+
+   /* The guest-physical address of the current VMCS L1 keeps for L2 */
+   gpa_t current_vmptr;
+   /* The host-usable pointer to the above */
+   struct page *current_vmcs12_page;
+   struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -195,6 +229,28 @@ static inline struct vcpu_vmx *to_vmx(st
return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+   struct page *page = gfn_to_page(vcpu-kvm, addr  PAGE_SHIFT);
+   if (is_error_page(page)) {
+   kvm_release_page_clean(page);
+   return NULL;
+   }
+   return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+   kunmap(page);
+   kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+   kunmap(page);
+   kvm_release_page_clean(page);
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
@@ -3755,6 +3811,9 @@ static int handle_vmoff(struct kvm_vcpu 
 
to_vmx(vcpu)-nested.vmxon = false;
 
+   if (to_vmx(vcpu)-nested.current_vmptr != -1ull)
+   nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page);
+
skip_emulated_instruction(vcpu);
return 1;
 }
@@ -4183,6 +4242,8 @@ static void vmx_free_vcpu(struct kvm_vcp
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
free_vpid(vmx);
+   if (vmx-nested.vmxon  to_vmx(vcpu)-nested.current_vmptr != -1ull)
+   nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page);
vmx_free_vmcs(vcpu);
kfree(vmx-guest_msrs);
kvm_vcpu_uninit(vcpu);
@@ -4249,6 +4310,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
goto free_vmcs;
}
 
+   vmx-nested.current_vmptr = -1ull;
+   vmx-nested.current_vmcs12 = NULL;
+
return vmx-vcpu;
 
 free_vmcs:
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs

2010-12-08 Thread Nadav Har'El
When the guest can use VMX instructions (when the nested module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  117 +++
 arch/x86/kvm/x86.c |6 +-
 2 files changed, 122 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/x86.c   2010-12-08 18:56:49.0 +0200
@@ -796,7 +796,11 @@ static u32 msrs_to_save[] = {
 #ifdef CONFIG_X86_64
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
-   MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+   MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
+   MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
+   MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
+   MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
+   MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
 };
 
 static unsigned num_msrs_to_save;
--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:49.0 +0200
@@ -1211,6 +1211,119 @@ static void vmx_adjust_tsc_offset(struct
 }
 
 /*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 0 when we handled a
+ * VMX-specific MSR, or 1 when we haven't (and the caller should handled it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+   u64 vmx_msr = 0;
+   u32 vmx_msr_high, vmx_msr_low;
+
+   switch (msr_index) {
+   case MSR_IA32_FEATURE_CONTROL:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_BASIC:
+   /*
+* This MSR reports some information about VMX support of the
+* processor. We should return information about the VMX we
+* emulate for the guest, and the VMCS structure we give it -
+* not about the VMX support of the underlying hardware.
+* However, some capabilities of the underlying hardware are
+* used directly by our emulation (e.g., the physical address
+* width), so these are copied from what the hardware reports.
+*/
+   *pdata = VMCS12_REVISION | (((u64)sizeof(struct vmcs12))  32);
+   rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+#define VMX_BASIC_64   0x0001LLU
+#define VMX_BASIC_MEM_TYPE 0x003cLLU
+#define VMX_BASIC_INOUT0x0040LLU
+   *pdata |= vmx_msr 
+   (VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
+   break;
+#define CORE2_PINBASED_CTLS_MUST_BE_ONE0x0016
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS0x48d
+   case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+   case MSR_IA32_VMX_PINBASED_CTLS:
+   vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
+   vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
+   PIN_BASED_EXT_INTR_MASK |
+   PIN_BASED_NMI_EXITING |
+   PIN_BASED_VIRTUAL_NMIS;
+   *pdata = vmx_msr_low | ((u64)vmx_msr_high  32);
+   break;
+   case MSR_IA32_VMX_PROCBASED_CTLS:
+   /* This MSR determines which vm-execution controls the L1
+* hypervisor may ask, or may not ask, to enable. Normally we
+* can only allow enabling features which the hardware can
+* support, but we limit ourselves to allowing only known
+* features that were tested nested. We allow disabling any
+* feature (even if the hardware can't disable it).
+*/
+   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+   vmx_msr_low = 0; /* allow disabling any feature */
+   vmx_msr_high = /* do not expose new untested features */
+   CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+   CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+   CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+   CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+   CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
+   CPU_BASED_USE_MSR_BITMAPS |
+#ifdef CONFIG_X86_64
+   CPU_BASED_CR8_LOAD_EXITING |
+   CPU_BASED_CR8_STORE_EXITING |
+#endif
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+   *pdata = vmx_msr_low | ((u64)vmx_msr_high  32);
+   break;
+   case MSR_IA32_VMX_EXIT_CTLS:
+ 

[PATCH 07/28] nVMX: Decoding memory operands of VMX instructions

2010-12-08 Thread Nadav Har'El
This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   59 +++
 arch/x86/kvm/x86.c |3 +-
 arch/x86/kvm/x86.h |3 ++
 3 files changed, 64 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/x86.c   2010-12-08 18:56:49.0 +0200
@@ -3688,7 +3688,7 @@ static int kvm_fetch_guest_virt(gva_t ad
  exception);
 }
 
-static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
   struct kvm_vcpu *vcpu,
   struct x86_exception *exception)
 {
@@ -3696,6 +3696,7 @@ static int kvm_read_guest_virt(gva_t add
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
  exception);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int 
bytes,
  struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.h  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/x86.h   2010-12-08 18:56:49.0 +0200
@@ -74,6 +74,9 @@ void kvm_before_handle_nmi(struct kvm_vc
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq);
 
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+   struct kvm_vcpu *vcpu, struct x86_exception *exception);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:49.0 +0200
@@ -3936,6 +3936,65 @@ static int handle_vmoff(struct kvm_vcpu 
 }
 
 /*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+unsigned long exit_qualification,
+u32 vmx_instruction_info, gva_t *ret)
+{
+   /*
+* According to Vol. 3B, Information for VM Exits Due to Instruction
+* Execution, on an exit, vmx_instruction_info holds most of the
+* addressing components of the operand. Only the displacement part
+* is put in exit_qualification (see 3B, Basic VM-Exit Information).
+* For how an actual address is calculated from all these components,
+* refer to Vol. 1, Operand Addressing.
+*/
+   int  scaling = vmx_instruction_info  3;
+   int  addr_size = (vmx_instruction_info  7)  7;
+   bool is_reg = vmx_instruction_info  (1u  10);
+   int  seg_reg = (vmx_instruction_info  15)  7;
+   int  index_reg = (vmx_instruction_info  18)  0xf;
+   bool index_is_valid = !(vmx_instruction_info  (1u  22));
+   int  base_reg   = (vmx_instruction_info  23)  0xf;
+   bool base_is_valid  = !(vmx_instruction_info  (1u  27));
+
+   if (is_reg) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   switch (addr_size) {
+   case 1: /* 32 bit. high bits are undefined according to the spec: */
+   exit_qualification = 0x;
+   break;
+   case 2: /* 64 bit */
+   break;
+   default: /* 16 bit */
+   return 1;
+   }
+
+   /* Addr = segment_base + offset */
+   /* offset = base + [index * scale] + displacement */
+   *ret = vmx_get_segment_base(vcpu, seg_reg);
+   if (base_is_valid)
+   *ret += kvm_register_read(vcpu, base_reg);
+   if (index_is_valid)
+   *ret += kvm_register_read(vcpu, index_reg)scaling;
+   *ret += exit_qualification; /* holds the displacement */
+   /*
+* TODO: throw #GP (and return 1) in various cases that the VM*
+* instructions require it - e.g., offset beyond segment limit,
+* unusable or unreadable/unwritable segment, non-canonical 64-bit
+* address, and so on. Currently these are not checked.
+*/
+   return 0;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12

2010-12-08 Thread Nadav Har'El
In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a 
hardware VMCS for each active vmcs12 (i.e., for each L2 guest).

We call each of these L0 VMCSs a vmcs02, as it is the VMCS that L0 uses
to run its nested guest L2.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   96 +++
 1 file changed, 96 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:49.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:49.0 +0200
@@ -155,6 +155,12 @@ struct __packed vmcs12 {
  */
 #define VMCS12_REVISION 0x11e57ed0
 
+struct vmcs_list {
+   struct list_head list;
+   gpa_t vmcs12_addr;
+   struct vmcs *vmcs02;
+};
+
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
@@ -170,6 +176,10 @@ struct nested_vmx {
/* The host-usable pointer to the above */
struct page *current_vmcs12_page;
struct vmcs12 *current_vmcs12;
+
+   /* list of real (hardware) VMCS, one for each L2 guest of L1 */
+   struct list_head vmcs02_list; /* a vmcs_list */
+   int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -1736,6 +1746,85 @@ static void free_vmcs(struct vmcs *vmcs)
free_pages((unsigned long)vmcs, vmcs_config.order);
 }
 
+static struct vmcs *nested_get_current_vmcs(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs_list *list_item, *n;
+
+   list_for_each_entry_safe(list_item, n, vmx-nested.vmcs02_list, list)
+   if (list_item-vmcs12_addr == vmx-nested.current_vmptr)
+   return list_item-vmcs02;
+
+   return NULL;
+}
+
+/*
+ * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
+ * does not already exist. The allocation is done in L0 memory, so to avoid
+ * denial-of-service attack by guests, we limit the number of concurrently-
+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
+ * trigger this limit.
+ */
+static const int NESTED_MAX_VMCS = 256;
+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
+{
+   struct vmcs_list *new_l2_guest;
+   struct vmcs *vmcs02;
+
+   if (nested_get_current_vmcs(vcpu))
+   return 0; /* nothing to do - we already have a VMCS */
+
+   if (to_vmx(vcpu)-nested.vmcs02_num = NESTED_MAX_VMCS)
+   return -ENOMEM;
+
+   new_l2_guest = (struct vmcs_list *)
+   kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
+   if (!new_l2_guest)
+   return -ENOMEM;
+
+   vmcs02 = alloc_vmcs();
+   if (!vmcs02) {
+   kfree(new_l2_guest);
+   return -ENOMEM;
+   }
+
+   new_l2_guest-vmcs12_addr = to_vmx(vcpu)-nested.current_vmptr;
+   new_l2_guest-vmcs02 = vmcs02;
+   list_add((new_l2_guest-list), (to_vmx(vcpu)-nested.vmcs02_list));
+   to_vmx(vcpu)-nested.vmcs02_num++;
+   return 0;
+}
+
+/* Free a vmcs12's associated vmcs02, and remove it from vmcs02_list */
+static void nested_free_vmcs(struct kvm_vcpu *vcpu, gpa_t vmptr)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs_list *list_item, *n;
+
+   list_for_each_entry_safe(list_item, n, vmx-nested.vmcs02_list, list)
+   if (list_item-vmcs12_addr == vmptr) {
+   free_vmcs(list_item-vmcs02);
+   list_del((list_item-list));
+   kfree(list_item);
+   vmx-nested.vmcs02_num--;
+   return;
+   }
+}
+
+static void free_l1_state(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs_list *list_item, *n;
+
+   list_for_each_entry_safe(list_item, n,
+   vmx-nested.vmcs02_list, list) {
+   free_vmcs(list_item-vmcs02);
+   list_del((list_item-list));
+   kfree(list_item);
+   }
+   vmx-nested.vmcs02_num = 0;
+}
+
 static void free_kvm_area(void)
 {
int cpu;
@@ -3884,6 +3973,9 @@ static int handle_vmon(struct kvm_vcpu *
return 1;
}
 
+   INIT_LIST_HEAD((vmx-nested.vmcs02_list));
+   vmx-nested.vmcs02_num = 0;
+
vmx-nested.vmxon = true;
 
skip_emulated_instruction(vcpu);
@@ -3931,6 +4023,8 @@ static int handle_vmoff(struct kvm_vcpu 
if (to_vmx(vcpu)-nested.current_vmptr != -1ull)
nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page);
 
+   free_l1_state(vcpu);
+
skip_emulated_instruction(vcpu);
return 1;
 }
@@ -4420,6 +4514,8 @@ static void vmx_free_vcpu(struct kvm_vcp
free_vpid(vmx);
if (vmx-nested.vmxon  to_vmx(vcpu)-nested.current_vmptr != -1ull)
nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page);
+   if (vmx-nested.vmxon)
+   free_l1_state(vcpu);

[PATCH 11/28] nVMX: Implement VMCLEAR

2010-12-08 Thread Nadav Har'El
This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   60 ++-
 1 file changed, 59 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:50.0 +0200
@@ -279,6 +279,8 @@ struct __packed vmcs12 {
u32 abort;
 
struct vmcs_fields fields;
+
+   bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
 /*
@@ -4413,6 +4415,62 @@ static void nested_vmx_failValid(struct 
get_vmcs12_fields(vcpu)-vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   gva_t gva;
+   gpa_t vmcs12_addr;
+   struct vmcs12 *vmcs12;
+   struct page *page;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmcs_read32(VMX_INSTRUCTION_INFO), gva))
+   return 1;
+
+   if (kvm_read_guest_virt(gva, vmcs12_addr, sizeof(vmcs12_addr),
+   vcpu, NULL)) {
+   kvm_queue_exception(vcpu, PF_VECTOR);
+   return 1;
+   }
+
+   if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+   nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
+   if (vmcs12_addr == vmx-nested.current_vmptr) {
+   nested_release_page(vmx-nested.current_vmcs12_page);
+   vmx-nested.current_vmptr = -1ull;
+   }
+
+   page = nested_get_page(vcpu, vmcs12_addr);
+   if (page == NULL) {
+   /*
+* For accurate processor emulation, VMCLEAR beyond available
+* physical memory should do nothing at all. However, it is
+* possible that a nested vmx bug, not a guest hypervisor bug,
+* resulted in this case, so let's shut down before doing any
+* more damage:
+*/
+   set_bit(KVM_REQ_TRIPLE_FAULT, vcpu-requests);
+   return 1;
+   }
+   vmcs12 = kmap(page);
+   vmcs12-launch_state = 0;
+   nested_release_page(page);
+
+   nested_free_vmcs(vcpu, vmcs12_addr);
+
+   skip_emulated_instruction(vcpu);
+   nested_vmx_succeed(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4434,7 +4492,7 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_INVD]= handle_invd,
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
-   [EXIT_REASON_VMCLEAR] = handle_vmx_insn,
+   [EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
[EXIT_REASON_VMPTRLD] = handle_vmx_insn,
[EXIT_REASON_VMPTRST] = handle_vmx_insn,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/28] nVMX: Implement VMPTRLD

2010-12-08 Thread Nadav Har'El
This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   61 ++-
 1 file changed, 60 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:50.0 +0200
@@ -4471,6 +4471,65 @@ static int handle_vmclear(struct kvm_vcp
return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   gva_t gva;
+   gpa_t vmcs12_addr;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmcs_read32(VMX_INSTRUCTION_INFO), gva))
+   return 1;
+
+   if (kvm_read_guest_virt(gva, vmcs12_addr, sizeof(vmcs12_addr),
+   vcpu, NULL)) {
+   kvm_queue_exception(vcpu, PF_VECTOR);
+   return 1;
+   }
+
+   if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+   nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
+   if (vmx-nested.current_vmptr != vmcs12_addr) {
+   struct vmcs12 *new_vmcs12;
+   struct page *page;
+   page = nested_get_page(vcpu, vmcs12_addr);
+   if (page == NULL) {
+   nested_vmx_failInvalid(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   new_vmcs12 = kmap(page);
+   if (new_vmcs12-revision_id != VMCS12_REVISION) {
+   nested_release_page_clean(page);
+   nested_vmx_failValid(vcpu,
+   VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   if (vmx-nested.current_vmptr != -1ull)
+   nested_release_page(vmx-nested.current_vmcs12_page);
+
+   vmx-nested.current_vmptr = vmcs12_addr;
+   vmx-nested.current_vmcs12 = new_vmcs12;
+   vmx-nested.current_vmcs12_page = page;
+
+   if (nested_create_current_vmcs(vcpu))
+   return -ENOMEM;
+   }
+
+   nested_vmx_succeed(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4494,7 +4553,7 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_VMCALL]  = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
-   [EXIT_REASON_VMPTRLD] = handle_vmx_insn,
+   [EXIT_REASON_VMPTRLD] = handle_vmptrld,
[EXIT_REASON_VMPTRST] = handle_vmx_insn,
[EXIT_REASON_VMREAD]  = handle_vmx_insn,
[EXIT_REASON_VMRESUME]= handle_vmx_insn,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/28] nVMX: Implement VMPTRST

2010-12-08 Thread Nadav Har'El
This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   27 ++-
 arch/x86/kvm/x86.c |3 ++-
 arch/x86/kvm/x86.h |3 +++
 3 files changed, 31 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/x86.c   2010-12-08 18:56:50.0 +0200
@@ -3705,7 +3705,7 @@ static int kvm_read_guest_virt_system(gv
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
-static int kvm_write_guest_virt_system(gva_t addr, void *val,
+int kvm_write_guest_virt_system(gva_t addr, void *val,
   unsigned int bytes,
   struct kvm_vcpu *vcpu,
   struct x86_exception *exception)
@@ -3736,6 +3736,7 @@ static int kvm_write_guest_virt_system(g
 out:
return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(unsigned long addr,
  void *val,
--- .before/arch/x86/kvm/x86.h  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/x86.h   2010-12-08 18:56:50.0 +0200
@@ -77,6 +77,9 @@ int kvm_inject_realmode_interrupt(struct
 int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
struct kvm_vcpu *vcpu, struct x86_exception *exception);
 
+int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
+   struct kvm_vcpu *vcpu, struct x86_exception *exception);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:50.0 +0200
@@ -4530,6 +4530,31 @@ static int handle_vmptrld(struct kvm_vcp
return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   gva_t vmcs_gva;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (get_vmx_mem_address(vcpu, exit_qualification,
+   vmx_instruction_info, vmcs_gva))
+   return 1;
+   /* ok to use *_system, because handle_vmread verified cpl=0 */
+   if (kvm_write_guest_virt_system(vmcs_gva,
+(void *)to_vmx(vcpu)-nested.current_vmptr,
+sizeof(u64), vcpu, NULL)) {
+   kvm_queue_exception(vcpu, PF_VECTOR);
+   return 1;
+   }
+   nested_vmx_succeed(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4554,7 +4579,7 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
[EXIT_REASON_VMPTRLD] = handle_vmptrld,
-   [EXIT_REASON_VMPTRST] = handle_vmx_insn,
+   [EXIT_REASON_VMPTRST] = handle_vmptrst,
[EXIT_REASON_VMREAD]  = handle_vmx_insn,
[EXIT_REASON_VMRESUME]= handle_vmx_insn,
[EXIT_REASON_VMWRITE] = handle_vmx_insn,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/28] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2010-12-08 Thread Nadav Har'El
This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we
built for L1).

VMREAD/WRITE can only access one VMCS at a time (the current VMCS), which
makes it difficult for us to read from vmcs01 while writing to vmcs12. This
is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then
read that memory copy while writing to vmcs12.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  409 +++
 1 file changed, 409 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:50.0 +0200
@@ -805,6 +805,28 @@ static inline bool report_flexpriority(v
return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
+{
+   return cpu_has_vmx_tpr_shadow() 
+   get_vmcs12_fields(vcpu)-cpu_based_vm_exec_control 
+   CPU_BASED_TPR_SHADOW;
+}
+
+static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
+{
+   return cpu_has_secondary_exec_ctrls() 
+   get_vmcs12_fields(vcpu)-cpu_based_vm_exec_control 
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+}
+
+static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
+  *vcpu)
+{
+   return nested_cpu_has_secondary_exec_ctrls(vcpu) 
+   (get_vmcs12_fields(vcpu)-secondary_vm_exec_control 
+   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
int i;
@@ -1253,6 +1275,37 @@ static void vmx_load_host_state(struct v
preempt_enable();
 }
 
+int load_vmcs_host_state(struct vmcs_fields *src)
+{
+   vmcs_write16(HOST_ES_SELECTOR, src-host_es_selector);
+   vmcs_write16(HOST_CS_SELECTOR, src-host_cs_selector);
+   vmcs_write16(HOST_SS_SELECTOR, src-host_ss_selector);
+   vmcs_write16(HOST_DS_SELECTOR, src-host_ds_selector);
+   vmcs_write16(HOST_FS_SELECTOR, src-host_fs_selector);
+   vmcs_write16(HOST_GS_SELECTOR, src-host_gs_selector);
+   vmcs_write16(HOST_TR_SELECTOR, src-host_tr_selector);
+
+   if (vmcs_config.vmexit_ctrl  VM_EXIT_LOAD_IA32_PAT)
+   vmcs_write64(HOST_IA32_PAT, src-host_ia32_pat);
+
+   vmcs_write32(HOST_IA32_SYSENTER_CS, src-host_ia32_sysenter_cs);
+
+   vmcs_writel(HOST_CR0, src-host_cr0);
+   vmcs_writel(HOST_CR3, src-host_cr3);
+   vmcs_writel(HOST_CR4, src-host_cr4);
+   vmcs_writel(HOST_FS_BASE, src-host_fs_base);
+   vmcs_writel(HOST_GS_BASE, src-host_gs_base);
+   vmcs_writel(HOST_TR_BASE, src-host_tr_base);
+   vmcs_writel(HOST_GDTR_BASE, src-host_gdtr_base);
+   vmcs_writel(HOST_IDTR_BASE, src-host_idtr_base);
+   vmcs_writel(HOST_RSP, src-host_rsp);
+   vmcs_writel(HOST_RIP, src-host_rip);
+   vmcs_writel(HOST_IA32_SYSENTER_ESP, src-host_ia32_sysenter_esp);
+   vmcs_writel(HOST_IA32_SYSENTER_EIP, src-host_ia32_sysenter_eip);
+
+   return 0;
+}
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -5365,6 +5418,362 @@ static void vmx_set_supported_cpuid(u32 
entry-ecx |= bit(X86_FEATURE_VMX);
 }
 
+/*
+ * Make a copy of the current VMCS to ordinary memory. This is needed because
+ * in VMX you cannot read and write to two VMCS at the same time, so when we
+ * want to do this (in prepare_vmcs02, which needs to read from vmcs01 while
+ * preparing vmcs02), we need to first save a copy of one VMCS's fields in
+ * memory, and then use that copy.
+ */
+void save_vmcs(struct vmcs_fields *dst)
+{
+   dst-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+   dst-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+   dst-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+   dst-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+   dst-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+   dst-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+   dst-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+   dst-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+   dst-host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
+   dst-host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
+   dst-host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
+   dst-host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
+   dst-host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
+   dst-host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
+   dst-host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
+   dst-io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+   dst-io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+   if (cpu_has_vmx_msr_bitmap())
+   

[PATCH 16/28] nVMX: Move register-syncing to a function

2010-12-08 Thread Nadav Har'El
Move code that syncs dirty RSP and RIP registers back to the VMCS, into a
function. We will need to call this function from additional places in the
next patch.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:50.0 +0200
@@ -5033,6 +5033,15 @@ static void vmx_cancel_injection(struct 
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
+{
+   if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty))
+   vmcs_writel(GUEST_RSP, vcpu-arch.regs[VCPU_REGS_RSP]);
+   if (test_bit(VCPU_REGS_RIP, (unsigned long *)vcpu-arch.regs_dirty))
+   vmcs_writel(GUEST_RIP, vcpu-arch.regs[VCPU_REGS_RIP]);
+   vcpu-arch.regs_dirty = 0;
+}
+
 #ifdef CONFIG_X86_64
 #define R r
 #define Q q
@@ -5054,10 +5063,7 @@ static void vmx_vcpu_run(struct kvm_vcpu
if (vmx-emulation_required  emulate_invalid_guest_state)
return;
 
-   if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty))
-   vmcs_writel(GUEST_RSP, vcpu-arch.regs[VCPU_REGS_RSP]);
-   if (test_bit(VCPU_REGS_RIP, (unsigned long *)vcpu-arch.regs_dirty))
-   vmcs_writel(GUEST_RIP, vcpu-arch.regs[VCPU_REGS_RIP]);
+   sync_cached_regs_to_vmcs(vcpu);
 
/* When single-stepping over STI and MOV SS, we must clear the
 * corresponding interruptibility bits in the guest state. Otherwise
@@ -5165,7 +5171,6 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
vcpu-arch.regs_avail = ~((1  VCPU_REGS_RIP) | (1  VCPU_REGS_RSP)
  | (1  VCPU_EXREG_PDPTR));
-   vcpu-arch.regs_dirty = 0;
 
vmx-idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/28] nVMX: Implement VMLAUNCH and VMRESUME

2010-12-08 Thread Nadav Har'El
Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  235 ++-
 1 file changed, 232 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:50.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:50.0 +0200
@@ -281,6 +281,9 @@ struct __packed vmcs12 {
struct vmcs_fields fields;
 
bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+   int cpu;
+   int launched;
 };
 
 /*
@@ -315,6 +318,21 @@ struct nested_vmx {
/* list of real (hardware) VMCS, one for each L2 guest of L1 */
struct list_head vmcs02_list; /* a vmcs_list */
int vmcs02_num;
+
+   /* Level 1 state for switching to level 2 and back */
+   struct  {
+   u64 efer;
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   int cpu;
+   int launched;
+   } l1_state;
+   /* Saving the VMCS that we used for running L1 */
+   struct vmcs *vmcs01;
+   struct vmcs_fields *vmcs01_fields;
+   /* Saving some vcpu-arch.* data we had for L1, while running L2 */
+   unsigned long l1_arch_cr3;
 };
 
 struct vcpu_vmx {
@@ -1344,6 +1362,16 @@ static void vmx_vcpu_load(struct kvm_vcp
 
rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
+
+   if (vmx-nested.vmcs01_fields != NULL) {
+   struct vmcs_fields *vmcs01 = vmx-nested.vmcs01_fields;
+   vmcs01-host_tr_base = vmcs_readl(HOST_TR_BASE);
+   vmcs01-host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+   vmcs01-host_ia32_sysenter_esp =
+   vmcs_readl(HOST_IA32_SYSENTER_ESP);
+   if (is_guest_mode(vcpu))
+   load_vmcs_host_state(vmcs01);
+   }
}
 }
 
@@ -2173,6 +2201,9 @@ static void free_l1_state(struct kvm_vcp
kfree(list_item);
}
vmx-nested.vmcs02_num = 0;
+
+   kfree(vmx-nested.vmcs01_fields);
+   vmx-nested.vmcs01_fields = NULL;
 }
 
 static void free_kvm_area(void)
@@ -4326,6 +4357,10 @@ static int handle_vmon(struct kvm_vcpu *
INIT_LIST_HEAD((vmx-nested.vmcs02_list));
vmx-nested.vmcs02_num = 0;
 
+   vmx-nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!vmx-nested.vmcs01_fields)
+   return -ENOMEM;
+
vmx-nested.vmxon = true;
 
skip_emulated_instruction(vcpu);
@@ -4524,6 +4559,50 @@ static int handle_vmclear(struct kvm_vcp
return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   /* yet another strange pre-requisite listed in the VMX spec */
+   if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 
+   GUEST_INTR_STATE_MOV_SS) {
+   nested_vmx_failValid(vcpu,
+   VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
+   if (to_vmx(vcpu)-nested.current_vmcs12-launch_state == launch) {
+   /* Must use VMLAUNCH for the first time, VMRESUME later */
+   nested_vmx_failValid(vcpu,
+   launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS :
+VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
+   skip_emulated_instruction(vcpu);
+
+   nested_vmx_run(vcpu);
+   return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+   return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+   return handle_launch_or_resume(vcpu, false);
+}
+
 enum vmcs_field_type {
VMCS_FIELD_TYPE_U16 = 0,
VMCS_FIELD_TYPE_U64 = 1,
@@ -4797,11 +4876,11 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmclear,
-   [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
+   [EXIT_REASON_VMLAUNCH]= handle_vmlaunch,
[EXIT_REASON_VMPTRLD] = handle_vmptrld,
[EXIT_REASON_VMPTRST] = handle_vmptrst,
[EXIT_REASON_VMREAD]  = handle_vmread,
-   [EXIT_REASON_VMRESUME]= handle_vmx_insn,
+   [EXIT_REASON_VMRESUME]= 

[PATCH 19/28] nVMX: Exiting from L2 to L1

2010-12-08 Thread Nadav Har'El
This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  233 +++
 1 file changed, 233 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:51.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:51.0 +0200
@@ -5092,6 +5092,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+   if (is_guest_mode(vmx-vcpu))
+   return;
__vmx_complete_interrupts(vmx, vmx-idt_vectoring_info,
  VM_EXIT_INSTRUCTION_LEN,
  IDT_VECTORING_ERROR_CODE);
@@ -6002,6 +6004,237 @@ static int nested_vmx_run(struct kvm_vcp
return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+   unsigned long guest_cr0_bits =
+   vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask;
+   return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
+   (vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+   unsigned long guest_cr4_bits =
+   vcpu-arch.cr4_guest_owned_bits | vmcs12-cr4_guest_host_mask;
+   return (vmcs_readl(GUEST_CR4)  guest_cr4_bits) |
+   (vmcs_readl(CR4_READ_SHADOW)  ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the changes to the guest state while L2 was
+ * running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu)
+{
+   struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+   /* update guest state fields: */
+   vmcs12-guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+   vmcs12-guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+   vmcs12-guest_dr7 = vmcs_readl(GUEST_DR7);
+   vmcs12-guest_rsp = vmcs_readl(GUEST_RSP);
+   vmcs12-guest_rip = vmcs_readl(GUEST_RIP);
+   vmcs12-guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+   vmcs12-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+   vmcs12-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+   vmcs12-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+   vmcs12-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+   vmcs12-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+   vmcs12-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+   vmcs12-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+   vmcs12-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+   vmcs12-guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+   vmcs12-guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+   vmcs12-guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+   vmcs12-guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+   vmcs12-guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+   vmcs12-guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+   vmcs12-guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+   vmcs12-guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+   vmcs12-guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+   vmcs12-guest_idtr_limit = 

[PATCH 21/28] nVMX: Correct handling of interrupt injection

2010-12-08 Thread Nadav Har'El
When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the interrupt window VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   33 +
 1 file changed, 33 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:51.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:51.0 +0200
@@ -3462,9 +3462,25 @@ out:
return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+   return get_vmcs12_fields(vcpu)-pin_based_vm_exec_control 
+   PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
u32 cpu_based_vm_exec_control;
+   if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
+   /* We can get here when nested_run_pending caused
+* vmx_interrupt_allowed() to return false. In this case, do
+* nothing - the interrupt will be injected later.
+*/
+   return;
 
cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3578,6 +3594,13 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+   if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu)) {
+   if (to_vmx(vcpu)-nested.nested_run_pending)
+   return 0;
+   nested_vmx_vmexit(vcpu, true);
+   /* fall through to normal code, but now in L1, not L2 */
+   }
+
return (vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5126,6 +5149,14 @@ static int vmx_handle_exit(struct kvm_vc
if (enable_ept  is_paging(vcpu))
vcpu-arch.cr3 = vmcs_readl(GUEST_CR3);
 
+   /*
+* the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+* we did not inject a still-pending event to L1 now because of
+* nested_run_pending, we need to re-enable this bit.
+*/
+   if (vmx-nested.nested_run_pending)
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+
if (exit_reason == EXIT_REASON_VMLAUNCH ||
exit_reason == EXIT_REASON_VMRESUME)
vmx-nested.nested_run_pending = 1;
@@ -5317,6 +5348,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+   if (is_guest_mode(vcpu))
+   return;
__vmx_complete_interrupts(to_vmx(vcpu),
  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
  VM_ENTRY_INSTRUCTION_LEN,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 22/28] nVMX: Correct handling of exception injection

2010-12-08 Thread Nadav Har'El
Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   26 ++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:51.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:51.0 +0200
@@ -1491,6 +1491,25 @@ static void skip_emulated_instruction(st
vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+   struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+   /* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+   if (!(vmcs12-exception_bitmap  PF_VECTOR))
+   return 0;
+
+   nested_vmx_vmexit(vcpu, false);
+   return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
bool has_error_code, u32 error_code,
bool reinject)
@@ -1498,6 +1517,10 @@ static void vmx_queue_exception(struct k
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+   if (nr == PF_VECTOR  is_guest_mode(vcpu) 
+   nested_pf_handled(vcpu))
+   return;
+
if (has_error_code) {
vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3533,6 +3556,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+   if (is_guest_mode(vcpu))
+   return;
+
if (!cpu_has_virtual_nmis()) {
/*
 * Tracking the NMI-blocked state in software is built upon
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/28] nVMX: Correct handling of idt vectoring info

2010-12-08 Thread Nadav Har'El
This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   32 
 1 file changed, 32 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:51.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:51.0 +0200
@@ -335,6 +335,10 @@ struct nested_vmx {
unsigned long l1_arch_cr3;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
+   /* true if last exit was of L2, and had a valid idt_vectoring_info */
+   bool valid_idt_vectoring_info;
+   /* These are saved if valid_idt_vectoring_info */
+   u32 vm_exit_instruction_len, idt_vectoring_error_code;
 };
 
 struct vcpu_vmx {
@@ -5384,6 +5388,22 @@ static void vmx_cancel_injection(struct 
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+   int irq  = vmx-idt_vectoring_info  VECTORING_INFO_VECTOR_MASK;
+   int type = vmx-idt_vectoring_info  VECTORING_INFO_TYPE_MASK;
+   int errCodeValid = vmx-idt_vectoring_info 
+   VECTORING_INFO_DELIVER_CODE_MASK;
+   vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+   irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+   vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+   vmx-nested.vm_exit_instruction_len);
+   if (errCodeValid)
+   vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+   vmx-nested.idt_vectoring_error_code);
+}
+
 static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
 {
if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty))
@@ -5405,6 +5425,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+   if (is_guest_mode(vcpu)  vmx-nested.valid_idt_vectoring_info)
+   nested_handle_valid_idt_vectoring_info(vmx);
+
/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked))
vmx-entry_time = ktime_get();
@@ -5525,6 +5548,15 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
vmx-idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+   vmx-nested.valid_idt_vectoring_info = is_guest_mode(vcpu) 
+   (vmx-idt_vectoring_info  VECTORING_INFO_VALID_MASK);
+   if (vmx-nested.valid_idt_vectoring_info) {
+   vmx-nested.vm_exit_instruction_len =
+   vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+   vmx-nested.idt_vectoring_error_code =
+   vmcs_read32(IDT_VECTORING_ERROR_CODE);
+   }
+
asm(mov %0, %%ds; mov %0, %%es : : r(__USER_DS));
vmx-launched = 1;
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions

2010-12-08 Thread Nadav Har'El
When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   54 ---
 1 file changed, 51 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:51.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:51.0 +0200
@@ -3877,6 +3877,54 @@ static void complete_insn_gp(struct kvm_
skip_emulated_instruction(vcpu);
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   if (is_guest_mode(vcpu)) {
+   /*
+* We get here when L2 changed cr0 in a way that did not change
+* any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
+* but did change L0 shadowed bits. This can currently happen
+* with the TS bit: L0 may want to leave TS on (for lazy fpu
+* loading) while pretending to allow the guest to change it.
+*/
+   vmcs_writel(GUEST_CR0,
+  (val  vcpu-arch.cr0_guest_owned_bits) |
+  (vmcs_readl(GUEST_CR0)  ~vcpu-arch.cr0_guest_owned_bits));
+   vmcs_writel(CR0_READ_SHADOW, val);
+   vcpu-arch.cr0 = val;
+   return 0;
+   } else
+   return kvm_set_cr0(vcpu, val);
+}
+
+static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   if (is_guest_mode(vcpu)) {
+   vmcs_writel(GUEST_CR4,
+ (val  vcpu-arch.cr4_guest_owned_bits) |
+ (vmcs_readl(GUEST_CR4)  ~vcpu-arch.cr4_guest_owned_bits));
+   vmcs_writel(CR4_READ_SHADOW, val);
+   vcpu-arch.cr4 = val;
+   return 0;
+   } else
+   return kvm_set_cr4(vcpu, val);
+}
+
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+   if (is_guest_mode(vcpu)) {
+   /* As in handle_set_cr0(), we can't call vmx_set_cr0 here */
+   vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0)  ~X86_CR0_TS);
+   vmcs_writel(CR0_READ_SHADOW,
+   vmcs_readl(CR0_READ_SHADOW)  ~X86_CR0_TS);
+   vcpu-arch.cr0 = ~X86_CR0_TS;
+   } else
+   vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
unsigned long exit_qualification, val;
@@ -3893,7 +3941,7 @@ static int handle_cr(struct kvm_vcpu *vc
trace_kvm_cr_write(cr, val);
switch (cr) {
case 0:
-   err = kvm_set_cr0(vcpu, val);
+   err = handle_set_cr0(vcpu, val);
complete_insn_gp(vcpu, err);
return 1;
case 3:
@@ -3901,7 +3949,7 @@ static int handle_cr(struct kvm_vcpu *vc
complete_insn_gp(vcpu, err);
return 1;
case 4:
-   err = kvm_set_cr4(vcpu, val);
+   err = handle_set_cr4(vcpu, val);
complete_insn_gp(vcpu, err);
return 1;
case 8: {
@@ -3919,7 +3967,7 @@ static int handle_cr(struct kvm_vcpu *vc
};
break;
case 2: /* clts */
-   vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+   handle_clts(vcpu);
trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
skip_emulated_instruction(vcpu);
vmx_fpu_activate(vcpu);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 25/28] nVMX: Further fixes for lazy FPU loading

2010-12-08 Thread Nadav Har'El
KVM's Lazy FPU loading means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:52.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:52.0 +0200
@@ -1098,6 +1098,15 @@ static void update_exception_bitmap(stru
eb = ~(1u  PF_VECTOR); /* bypass_guest_pf = 0 */
if (vcpu-fpu_active)
eb = ~(1u  NM_VECTOR);
+
+   /* When we are running a nested L2 guest and L1 specified for it a
+* certain exception bitmap, we must trap the same exceptions and pass
+* them to L1. When running L2, we will only handle the exceptions
+* specified above if L1 did not want them.
+*/
+   if (is_guest_mode(vcpu))
+   eb |= get_vmcs12_fields(vcpu)-exception_bitmap;
+
vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1415,8 +1424,19 @@ static void vmx_fpu_activate(struct kvm_
cr0 = ~(X86_CR0_TS | X86_CR0_MP);
cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP);
vmcs_writel(GUEST_CR0, cr0);
-   update_exception_bitmap(vcpu);
vcpu-arch.cr0_guest_owned_bits = X86_CR0_TS;
+   if (is_guest_mode(vcpu)) {
+   /* While we (L0) no longer care about NM exceptions or cr0.TS
+* changes, our guest hypervisor (L1) might care in which case
+* we must trap them for it.
+*/
+   u32 eb = vmcs_read32(EXCEPTION_BITMAP)  ~(1u  NM_VECTOR);
+   struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+   eb |= vmcs12-exception_bitmap;
+   vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
+   vmcs_write32(EXCEPTION_BITMAP, eb);
+   } else
+   update_exception_bitmap(vcpu);
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
 }
 
@@ -1424,12 +1444,24 @@ static void vmx_decache_cr0_guest_bits(s
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+   /* Note that there is no vcpu-fpu_active = 0 here. The caller must
+* set this *before* calling this function.
+*/
vmx_decache_cr0_guest_bits(vcpu);
vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
-   update_exception_bitmap(vcpu);
+   vmcs_write32(EXCEPTION_BITMAP,
+   vmcs_read32(EXCEPTION_BITMAP) | (1u  NM_VECTOR));
vcpu-arch.cr0_guest_owned_bits = 0;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
-   vmcs_writel(CR0_READ_SHADOW, vcpu-arch.cr0);
+   if (is_guest_mode(vcpu))
+   /* Unfortunately in nested mode we play with arch.cr0's PG
+* bit, so we musn't copy it all, just the relevant TS bit
+*/
+   vmcs_writel(CR0_READ_SHADOW,
+   (vmcs_readl(CR0_READ_SHADOW)  ~X86_CR0_TS) |
+   (vcpu-arch.cr0  X86_CR0_TS));
+   else
+   vmcs_writel(CR0_READ_SHADOW, vcpu-arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 26/28] nVMX: Additional TSC-offset handling

2010-12-08 Thread Nadav Har'El
In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset.
We also need to set vmcs12.tsc_offset, for this change to survive the next
nested entry (see prepare_vmcs02()).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   11 +++
 1 file changed, 11 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:52.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:52.0 +0200
@@ -1665,12 +1665,23 @@ static u64 guest_read_tsc(void)
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
vmcs_write64(TSC_OFFSET, offset);
+   if (is_guest_mode(vcpu))
+   /*
+* We are only changing TSC_OFFSET when L2 is running if for
+* some reason L1 chose not to trap the TSC MSR. Since
+* prepare_vmcs12() does not copy tsc_offset, we need to also
+* set the vmcs12 field here.
+*/
+   get_vmcs12_fields(vcpu)-tsc_offset = offset -
+   to_vmx(vcpu)-nested.vmcs01_fields-tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
+   if (is_guest_mode(vcpu))
+   get_vmcs12_fields(vcpu)-tsc_offset += adjustment;
 }
 
 /*
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 27/28] nVMX: Miscellenous small corrections

2010-12-08 Thread Nadav Har'El
Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2010-12-08 18:56:52.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-12-08 18:56:52.0 +0200
@@ -933,7 +933,7 @@ static void vmcs_load(struct vmcs *vmcs)
: =g(error) : a(phys_addr), m(phys_addr)
: cc, memory);
if (error)
-   printk(KERN_ERR kvm: vmptrld %p/%llx fail\n,
+   printk(KERN_ERR kvm: vmptrld %p/%llx failed\n,
   vmcs, phys_addr);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 28/28] nVMX: Documentation

2010-12-08 Thread Nadav Har'El
This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 Documentation/kvm/nested-vmx.txt |  237 +
 1 file changed, 237 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt2010-12-08 18:56:52.0 
+0200
+++ .after/Documentation/kvm/nested-vmx.txt 2010-12-08 18:56:52.0 
+0200
@@ -0,0 +1,237 @@
+Nested VMX
+==
+
+Overview
+-
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The Nested VMX feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+The Turtles Project: Design and Implementation of Nested Virtualization,
+available at:
+
+   http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+---
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and the nested guest, which we
+call L2.
+
+
+Known limitations
+-
+
+The current code supports running Linux under a guest KVM using shadow
+page tables. It supports multiple guest hypervisors, each can run multiple
+guests. Only 64-bit guest hypervisors are supported. SMP is supported, but
+is known to be buggy in this release.
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+--
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the nested=1 option to the kvm-intel module.
+
+
+ABIs
+
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their Intel 64 and IA-32 Architectures Software
+Developer's Manual. Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+struct __packed vmcs12 {
+   /* According to the Intel spec, a VMCS region must start with the
+* following two fields. Then follow implementation-specific data.
+*/
+   u32 revision_id;
+   u32 abort;
+
+   struct shadow_vmcs shadow_vmcs;
+
+   bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+   int cpu;
+   int launched;
+}
+
+struct __packed shadow_vmcs {
+   u16 virtual_processor_id;
+   u16 guest_es_selector;
+   u16 guest_cs_selector;
+   u16 guest_ss_selector;
+   u16 guest_ds_selector;
+   u16 guest_fs_selector;
+   u16 guest_gs_selector;
+   u16 guest_ldtr_selector;
+   u16 guest_tr_selector;
+   u16 host_es_selector;
+   u16 host_cs_selector;
+   u16 host_ss_selector;
+   u16 host_ds_selector;
+   u16 host_fs_selector;
+   u16 host_gs_selector;
+   u16 host_tr_selector;
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   u64 vm_exit_msr_store_addr;
+   u64 vm_exit_msr_load_addr;
+   u64 vm_entry_msr_load_addr;
+   u64 tsc_offset;
+   u64 virtual_apic_page_addr;
+   u64 apic_access_addr;
+   u64 ept_pointer;
+   u64 guest_physical_address;
+   u64 vmcs_link_pointer;
+   u64 

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Rik van Riel

On 12/03/2010 08:23 AM, Peter Zijlstra wrote:

On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote:
unsigned long clone_flags);

+
+#ifdef CONFIG_SCHED_HRTICK
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+#else
+static inline void yield_to(struct task_struct *p) yield()
+#endif


That does SCHED_HRTICK have to do with any of this?


Legacy from an old prototype this patch is based on.
I'll get rid of that.


+/**
+ * requeue_task - requeue a task which priority got changed by yield_to


priority doesn't seem the right word, you're not actually changing
anything related to p-*prio


True, I'll change the comment.


+ * @rq: the tasks's runqueue
+ * @p: the task in question
+ * Must be called with the runqueue lock held. Will cause the CPU to
+ * reschedule if p is now at the head of the runqueue.
+ */
+void requeue_task(struct rq *rq, struct task_struct *p)
+{
+   assert_spin_locked(rq-lock);
+
+   if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
+   return;
+
+   dequeue_task(rq, p, 0);
+   enqueue_task(rq, p, 0);
+
+   resched_task(p);


I guess that wants to be something like check_preempt_curr()


Done.  Thanks for pointing that out.


@@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned 
int, len,
return ret;
  }

+#ifdef CONFIG_SCHED_HRTICK


Still wondering what all this has to do with SCHED_HRTICK..


+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se =p-se;
+   struct rq *rq;
+   struct cfs_rq *cfs_rq;
+   u64 remain = slice_remain(current);
+
+   rq = task_rq_lock(p,flags);
+   if (task_running(rq, p) || task_has_rt_policy(p))
+   goto out;


See, this all ain't nice, slice_remain() don't make no sense to be
called for !fair tasks.

Why not write:

   if (curr-sched_class == p-sched_class
   curr-sched_class-yield_to)
curr-sched_class-yield_to(curr, p);

or something, and then implement sched_class_fair::yield_to only,
leaving it a NOP for all other classes.


Done.


+   cfs_rq = cfs_rq_of(se);
+   se-vruntime -= remain;
+   if (se-vruntime  cfs_rq-min_vruntime)
+   se-vruntime = cfs_rq-min_vruntime;


Now here we have another problem, remain was measured in wall-time, and
then you go change a virtual time measure using that. These things are
related like:

  vt = t/weight

So you're missing a weight factor somewhere.

Also, that check against min_vruntime doesn't really make much sense.


OK, how do I do this?


+   requeue_task(rq, p);


Just makes me wonder why you added requeue task to begin with.. why not
simply dequeue at the top of this function, and enqueue at the tail,
like all the rest does: see rt_mutex_setprio(), set_user_nice(),
sched_move_task().


Done.


+ out:
+   task_rq_unlock(rq,flags);
+   yield();
+}
+EXPORT_SYMBOL(yield_to);


EXPORT_SYMBOL_GPL() pretty please, I really hate how kvm is a module and
needs to export hooks all over the core kernel :/


Done.


Right, so another approach might be to simply swap the vruntime between
curr and p.


Doesn't that run into the same scale issue you described
above?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm hangs on mkfs

2010-12-08 Thread Hanno Böck
Hi,

I tried a relatively simple task with qemu-kvm. I have two qcow hd images and 
try to create filesystems on them using a gentoo installation disk.

Starting qemu with:
qemu -m 512 -cdrom install-x86-minimal-20101116.iso -hda hda.img -hdb hdb.img


However, mkfs always hangs indefinitely. Doesn't really matter if ext2/3/4, it 
always hangs at
Writing superblocks and filesystem accounting information:

Any idea where to start looking for the problem? (please cc me as I'm not 
subscribed to this list)

-- 
Hanno Böck  Blog:   http://www.hboeck.de/
GPG: 3DBD3B20   Jabber/Mail:ha...@hboeck.de

http://schokokeks.org - professional webhosting


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-08 Thread Marcelo Tosatti
On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote:
 On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote:
  On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote:
   Usually nobody usually thinks about that scenario (me included and 
   specially),
   but kvmclock can be actually disabled in the host.
   
   It happens in two scenarios:
1. host too old.
2. we passed -kvmclock to our -cpu parameter.
   
   In both cases, we should not register kvmclock savevm section. This patch
   achives that by registering this section only if kvmclock is actually
   currently enabled in cpuid.
   
   The only caveat is that we have to register the savevm section a little 
   bit
   later, since we won't know the final kvmclock state before cpuid gets 
   parsed.
  
  What is the problem of registering the section? Restoring the value if
  the host does not support it returns an error?
  
  Can't you ignore the error if kvmclock is not reported in cpuid, in the
  restore handler?
 
 We can change the restore handler, but not the restore handler of
 binaries that are already out there. The motivation here is precisely to
 address migration to hosts without kvmclock, so it's better to have
 a way to disable, than to count on the fact that the other side will be
 able to ignore it.

OK. Can't you register conditionally on kvmclock cpuid bit at the end of
kvm_arch_init_vcpu, in target-i386/kvm.c?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Fix missing lock for kvm_io_bus_unregister_dev()

2010-12-08 Thread Marcelo Tosatti
On Wed, Dec 08, 2010 at 01:45:06AM +0900, Takuya Yoshikawa wrote:
 Memo:
  - kvm_io_bus_register_dev() was protected as far as I checked.
 
  - kvm_create_pit() was commented like Caller must hold slots_lock
but kvm_free_pit() was not. So I don't know if I should protect
the whole kvm_free_pit().
 
 What is the best fix? -- or I misread something?
 
 Thanks,
   Takuya
 
 ===
 From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 
 The comment for kvm_io_bus_unregister_dev() says Caller must hold
 slots_lock but some callers don't do so.
 
 Though this patch fixes these, more consistent locking manner may be
 needed in the future.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 ---
  arch/x86/kvm/i8254.c |2 ++
  arch/x86/kvm/i8259.c |2 ++
  arch/x86/kvm/x86.c   |2 ++
  virt/kvm/ioapic.c|2 ++
  4 files changed, 8 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
 index efad723..f64393c 100644
 --- a/arch/x86/kvm/i8254.c
 +++ b/arch/x86/kvm/i8254.c
 @@ -744,9 +744,11 @@ void kvm_free_pit(struct kvm *kvm)
   struct hrtimer *timer;
  
   if (kvm-arch.vpit) {
 + mutex_lock(kvm-slots_lock);
   kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS, 
 kvm-arch.vpit-dev);
   kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS,
 kvm-arch.vpit-speaker_dev);
 + mutex_unlock(kvm-slots_lock);
   kvm_unregister_irq_mask_notifier(kvm, 0,
  kvm-arch.vpit-mask_notifier);
   kvm_unregister_irq_ack_notifier(kvm,

This is supposedly safe because this is only called from vm destroy
context, when dropping the last reference.

 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index f628234..d9fe35d 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -596,7 +596,9 @@ void kvm_destroy_pic(struct kvm *kvm)
   struct kvm_pic *vpic = kvm-arch.vpic;
  
   if (vpic) {
 + mutex_lock(kvm-slots_lock);
   kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS, vpic-dev);
 + mutex_unlock(kvm-slots_lock);
   kvm-arch.vpic = NULL;
   kfree(vpic);
   }
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index ed373ba..48e59d1 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -3313,8 +3313,10 @@ long kvm_arch_vm_ioctl(struct file *filp,
   if (vpic) {
   r = kvm_ioapic_init(kvm);
   if (r) {
 + mutex_lock(kvm-slots_lock);
   kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS,
 vpic-dev);
 + mutex_unlock(kvm-slots_lock);
   kfree(vpic);
   goto create_irqchip_unlock;
   }
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 0b9df83..fa76380 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -409,7 +409,9 @@ void kvm_ioapic_destroy(struct kvm *kvm)
   struct kvm_ioapic *ioapic = kvm-arch.vioapic;
  
   if (ioapic) {
 + mutex_lock(kvm-slots_lock);
   kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, ioapic-dev);
 + mutex_unlock(kvm-slots_lock);
   kvm-arch.vioapic = NULL;
   kfree(ioapic);
   }
 -- 
 1.7.1

It seems the best way to fix is to move irq_lock and slots_lock
acquision from kvm_set_irq_routing/kvm_ioapic_destroy/kvm_destroy_pic to
their callers.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm hangs on mkfs

2010-12-08 Thread Brian Jackson
On Wednesday, December 08, 2010 01:09:25 pm Hanno Böck wrote:
 Hi,
 
 I tried a relatively simple task with qemu-kvm. I have two qcow hd images
 and try to create filesystems on them using a gentoo installation disk.


qcow2 (I hope you are using that vs just qcow) is known to be a tad on the 
slow side on metadata heavy operations (i.e. mkfs, installing lots of files, 
etc.). One trick some of us use is to use the -drive syntax (vs -hda) and set 
the cache option to unsafe or writeback for the install process. The other 
alternative is to use preallocated raw images (i.e. made with dd vs qemu-img). 
I've been informed that in 0.12.5 the writeback trick won't do any good due to 
some extra fsync()s. So your best bet is to upgrade to 0.13 and use 
cache=unsafe.


 
 Starting qemu with:
 qemu -m 512 -cdrom install-x86-minimal-20101116.iso -hda hda.img -hdb
 hdb.img
 
 
 However, mkfs always hangs indefinitely. Doesn't really matter if ext2/3/4,
 it always hangs at
 Writing superblocks and filesystem accounting information:


Have you tried strace'ing to see if it's actually doing something (just very 
slowly)?


 
 Any idea where to start looking for the problem? (please cc me as I'm not
 subscribed to this list)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Peter Zijlstra
On Wed, 2010-12-08 at 12:55 -0500, Rik van Riel wrote:
 
 
  Right, so another approach might be to simply swap the vruntime between
  curr and p.
 
 Doesn't that run into the same scale issue you described
 above?

Not really, but its tricky on SMP because vruntime only has meaning
within a cfs_rq.

The below is quickly cobbled together from a few patches I had laying
about dealing with similar issues.

The avg_vruntime() stuff takes the weighted average of the vruntimes on
a cfs runqueue, this weighted average corresponds to the 0-lag point,
that is the point where an ideal scheduler would have all tasks.

Using the 0-lag point you can compute the lag of a task, the lag is a
measure of service for a task, negative lag means the task is owed
services, positive lag means its got too much service (at least, thats
the sign convention used here, I always forget what the standard is).

What we do is, when @p the target task, is owed less service than
current, we flip lags such that p will become more eligible.

The trouble with all this is that computing the weighted average is
terribly expensive (it increases cost of all scheduler hot paths).

The dyn_vruntime stuff mixed in there is an attempt at computing
something similar, although its not used and I haven't tested the
quality of the approximation in a while.

Anyway, complete untested and such..

---
 include/linux/sched.h   |2 +
 kernel/sched.c  |   39 ++
 kernel/sched_debug.c|   31 -
 kernel/sched_fair.c |  179 ++-
 kernel/sched_features.h |8 +--
 5 files changed, 203 insertions(+), 56 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0fc8ee..538559e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1095,6 +1095,8 @@ struct sched_class {
 #ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif
+
+   void (*yield_to) (struct task_struct *p);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index 0debad9..fe0adb0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -313,6 +313,9 @@ struct cfs_rq {
struct load_weight load;
unsigned long nr_running;
 
+   s64 avg_vruntime;
+   u64 avg_load;
+
u64 exec_clock;
u64 min_vruntime;
 
@@ -5062,6 +5065,42 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
 }
 
+void yield_to(struct task_struct *p)
+{
+   struct task_struct *curr = current;
+   struct rq *p_rq, *rq;
+   unsigned long flags;
+   int on_rq;
+
+   local_irq_save(flags);
+   rq = this_rq();
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   if (p_rq != task_rq(p)) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   update_rq_clock(rq);
+   update_rq_clock(p_rq);
+
+   on_rq = p-se.on_rq;
+   if (on_rq)
+   dequeue_task(p_rq, p, 0);
+
+   ret = 0;
+   if (p-sched_class == curr-sched_class  curr-sched_class-yield_to)
+   curr-sched_class-yield_to(p);
+
+   if (on_rq)
+   enqueue_task(p_rq, p, 0);
+
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 static inline int should_resched(void)
 {
return need_resched()  !(preempt_count()  PREEMPT_ACTIVE);
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 1dfae3d..b5cc773 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -138,10 +138,9 @@ static void print_rq(struct seq_file *m, struct rq *rq, 
int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-   s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
-   spread, rq0_min_vruntime, spread0;
+   s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
struct rq *rq = cpu_rq(cpu);
-   struct sched_entity *last;
+   struct sched_entity *last, *first;
unsigned long flags;
 
SEQ_printf(m, \ncfs_rq[%d]:\n, cpu);
@@ -149,28 +148,26 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
SPLIT_NS(cfs_rq-exec_clock));
 
raw_spin_lock_irqsave(rq-lock, flags);
-   if (cfs_rq-rb_leftmost)
-   MIN_vruntime = (__pick_next_entity(cfs_rq))-vruntime;
+   first = __pick_first_entity(cfs_rq);
+   if (first)
+   left_vruntime = first-vruntime;
last = __pick_last_entity(cfs_rq);
if (last)
-   max_vruntime = last-vruntime;
+   right_vruntime = last-vruntime;
min_vruntime = cfs_rq-min_vruntime;
-   rq0_min_vruntime = cpu_rq(0)-cfs.min_vruntime;
raw_spin_unlock_irqrestore(rq-lock, flags);
-   SEQ_printf(m,   .%-30s: %Ld.%06ld\n, MIN_vruntime,
-   SPLIT_NS(MIN_vruntime));
+   SEQ_printf(m,   .%-30s: %Ld.%06ld\n, left_vruntime,
+   

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Peter Zijlstra
On Wed, 2010-12-08 at 21:00 +0100, Peter Zijlstra wrote:
 +   lag0 = avg_vruntime(cfs_rq_of(se));
 +   p_lag0 = avg_vruntime(cfs_rq_of(p_se));
 +
 +   lag = se-vruntime - avg_vruntime(cfs_rq);
 +   p_lag = p_se-vruntime - avg_vruntime(p_cfs_rq);
 +
 +   if (p_lag  lag) { /* if P is owed less service */
 +   se-vruntime = lag0 + p_lag;
 +   p_se-vruntime = p_lag + lag;
 +   } 

clearly that should read:

  p_se-vruntime = p_lag0 + lag;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Memory leaks in virtio drivers?

2010-12-08 Thread Freddie Cash
On Wed, Dec 1, 2010 at 10:16 AM, Freddie Cash fjwc...@gmail.com wrote:
 Just an update on this.  We made the change over the weekend to enable
 cache=off for all the VMs, including the libvirt managed ones (turns
 out, libvirtd only reads the .xml files at startup); and enabeld KSM
 on the host.

 5 days later, we have only 700 MB of swap used, and 15.2 GB of VM
 committed.  This appears to be the steady-state for the host, as it
 hasn't changed (cache, free, and buffer change a bit throughout the
 day).

 Unfortunately, this has exposed just how horribly unoptimised our
 storage array underneath it.  :(  It's a single 12-drive RAID6,
 auto-carved into 2 TB chunks, and then stitched back together via LVM
 into a single volume group.  With each VM getting it's own logical
 volume.  We have plans over the Christmas break to  re-do this as a
 RAID50 or possible a RAID10 + RAID50.

 Thanks for all the tips and pointers.  I'm starting to get all this
 figured out and understood.  There's been a lot of changes since
 KVM-72.  :)

One final update for the archives.

Further research showed that our I/O setup was non-optimal for the
RAID controllers were are using (3Ware 9650SE).  Modifying the
following sysfs variables dropper our I/O wait down to near 0,
allowing the controller to keepup with the requests, and eliminating
our use of swap (even without using cache=none):

(for each block device)
block/sda/queue/scheduler   = deadline
block/sda/queue/nr_requests = 512
block/sda/queue/max_sectors_kb  = 128
block/sda/queue/read_ahead_kb   = 16384

We're in the process of adding a separate 6-disk RAID10 array for our
most I/O-heavy filesystems, which should help even more.

So, it looks like our I/O setup was so horrible originally that the
controller couldn't keep up, and more RAM in the host was used for
buffering, until the host started swapping, and things would grind to
a halt.

Over a week later, and we're still sitting with 0 bytes swap used in
the host, with several 100 MB of free RAM (fluctuates), and I/O usage
rarely above 30%.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm hangs on mkfs

2010-12-08 Thread Hanno Böck
Am Wednesday 08 December 2010 schrieb Brian Jackson:
 qcow2 (I hope you are using that vs just qcow) is known to be a tad on the
 slow side on metadata heavy operations (i.e. mkfs, installing lots of
 files, etc.). One trick some of us use is to use the -drive syntax (vs
 -hda) and set the cache option to unsafe or writeback for the install
 process. The other alternative is to use preallocated raw images (i.e.
 made with dd vs qemu-img). I've been informed that in 0.12.5 the writeback
 trick won't do any good due to some extra fsync()s. So your best bet is to
 upgrade to 0.13 and use cache=unsafe.

This helped, thanks a lot.
(Maybe this could be documented at a more obvious place? I tried googling 
quite a lot for the solution - a faq entry explicitely mentioning mkfs? It 
seems a quite common task to me)

-- 
Hanno Böck  Blog:   http://www.hboeck.de/
GPG: 3DBD3B20   Jabber/Mail:ha...@hboeck.de

http://schokokeks.org - professional webhosting


signature.asc
Description: This is a digitally signed message part.


Re: [RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2010-12-08 Thread Rik van Riel

On 12/05/2010 07:56 AM, Avi Kivity wrote:


+ if (vcpu == me)
+ continue;
+ if (vcpu-spinning)
+ continue;


You may well want to wake up a spinner. Suppose

A takes a lock
B preempts A
B grabs a ticket, starts spinning, yields to A
A releases lock
A grabs ticket, starts spinning

at this point, we want A to yield to B, but it won't because of this check.


That's a good point.  I guess we'll have to benchmark both with
and without the vcpu-spinning logic.


+ if (!task)
+ continue;
+ if (waitqueue_active(vcpu-wq))
+ continue;
+ if (task-flags PF_VCPU)
+ continue;
+ kvm-last_boosted_vcpu = i;
+ yield_to(task);
+ break;
+ }


I think a random selection algorithm will be a better fit against
special guest behaviour.


Possibly, though I suspect we'd have to hit very heavy overcommit ratios
with very large VMs before round robin stops working.


- /* Sleep for 100 us, and hope lock-holder got scheduled */
- expires = ktime_add_ns(ktime_get(), 10UL);
- schedule_hrtimeout(expires, HRTIMER_MODE_ABS);
+ if (first_round last_boosted_vcpu == kvm-last_boosted_vcpu) {
+ /* We have not found anyone yet. */
+ first_round = 0;
+ goto again;


Need to guarantee termination.


We do that by setting first_round to 0 :)

We at most walk N+1 VCPUs in a VM with N VCPUs, with
this patch.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Rik van Riel

On 12/08/2010 03:00 PM, Peter Zijlstra wrote:


Anyway, complete untested and such..


Looks very promising.  I've been making a few changes in the same
direction (except for the fancy CFS bits) and have one way to solve
the one problem you pointed out in your patch.


+void yield_to(struct task_struct *p)
+{

...

+   on_rq = p-se.on_rq;
+   if (on_rq)
+   dequeue_task(p_rq, p, 0);
+
+   ret = 0;
+   if (p-sched_class == curr-sched_class  curr-sched_class-yield_to)
+   curr-sched_class-yield_to(p);
+
+   if (on_rq)
+   enqueue_task(p_rq, p, 0);



diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c886717..8689bcd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c



+static void yield_to_fair(struct task_stuct *p)
+{
+   struct sched_entity *se =current-se;
+   struct sched_entity *p_se =p-se;
+   u64 lag0, p_lag0;
+   s64 lag, p_lag;
+
+   lag0 = avg_vruntime(cfs_rq_of(se));
+   p_lag0 = avg_vruntime(cfs_rq_of(p_se));
+
+   lag = se-vruntime - avg_vruntime(cfs_rq);
+   p_lag = p_se-vruntime - avg_vruntime(p_cfs_rq);
+
+   if (p_lag  lag) { /* if P is owed less service */
+   se-vruntime = lag0 + p_lag;
+   p_se-vruntime = p_lag + lag;
+   }
+
+   /*
+* XXX try something smarter here
+*/
+   resched_task(p);
+   resched_task(current);
+}


If we do the dequeue_task and enqueue_task here, we can use
check_preempt_curr in yield_to_fair.

Alternatively, we can do the rescheduling from the main
yield_to function, not from yield_to_fair, by calling
check_preempt_curr on p and current after p has been
enqueued.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unittests failure for latest upstream kvm.git + qemu-kvm.git + kvm-unit-tests.git

2010-12-08 Thread Lucas Meneghel Rodrigues
Failure on one of the unittests: vmexit timed out after 10 minutes:

12/08 19:18:32 DEBUG|kvm_prepro:0052| Preprocessing VM 'vm1'...
12/08 19:18:32 DEBUG|kvm_prepro:0055| VM object found in environment
12/08 19:18:32 WARNI|kvm_prepro:0089| Could not send monitor command 
'screendump /usr/local/autotest/results/default/kvm.unittest/debug/pre_vm1.ppm' 
(Broken pipe)
12/08 19:18:32 DEBUG|kvm_vm:0767| VM is already down
12/08 19:18:32 DEBUG|kvm_vm:0357| Getting output of 'qemu -help'
12/08 19:18:32 DEBUG|kvm_vm:0668| Running qemu command:
/usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor 
unix:'/tmp/monitor-humanmonitor1-20101208-191729-GpeV',server,nowait -serial 
unix:'/tmp/serial-20101208-191729-GpeV',server,nowait -m 512 -smp 2 -kernel 
'/usr/local/autotest/tests/kvm/unittests/vmexit.flat' -vnc :0 -chardev 
file,id=testlog,path=/tmp/testlog-20101208-191729-GpeV -device 
testdev,chardev=testlog  -S
12/08 19:18:33 DEBUG|kvm_vm:0735| VM appears to be alive with PID 17597
12/08 19:18:33 INFO |  unittest:0096| Waiting for unittest vmexit to complete, 
timeout 600, output in /tmp/testlog-20101208-191729-GpeV
12/08 19:28:34 DEBUG| kvm_utils:0879| Timeout elapsed
12/08 19:28:34 ERROR|  unittest:0108| Exception happened during vmexit: Timeout 
elapsed (600s)
12/08 19:28:34 INFO |  unittest:0113| Unit test log collected and available 
under /usr/local/autotest/results/default/kvm.unittest/debug/vmexit.log

Looking at vmexit log:

enabling apic
enabling apic
cpuid 3984

Commit logs:

12/08 18:56:34 DEBUG|base_utils:0106| [stdout] HEAD is now at 66fc6be KVM: MMU: 
Avoid dropping accessed bit while removing write access
12/08 18:56:56 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git is 
66fc6be8d2b04153b753182610f919faf9c705bc (tag v2.6.32-57426-g66fc6be)
12/08 19:15:45 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git is 
cb1983b8809d0e06a97384a40bad1194a32fc814 (tag kvm-88-6330-gcb1983b)
12/08 19:15:48 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git is 
750bbdb58dbc9017b3adc4f4709cd285183a8e30 (no tag found)

Let me know what other info I can capture about this particular failure.

Lucas

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [qemu-kvm-next-tree] fix compile error of hw/device-assignment.c

2010-12-08 Thread Wei Yongjun
Fix the following compile error in next tree:
  CCx86_64-softmmu/device-assignment.o
hw/device-assignment.c: In function ‘assigned_device_pci_cap_init’:
hw/device-assignment.c:1463: error: ‘PCI_PM_CTRL_NO_SOFT_RST’ undeclared (first 
use in this function)
hw/device-assignment.c:1463: error: (Each undeclared identifier is reported 
only once
hw/device-assignment.c:1463: error: for each function it appears in.)

Signed-off-by: Wei Yongjun yj...@cn.fujitsu.com
---
 hw/device-assignment.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 50c6408..8446cd4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1460,7 +1460,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 /* assign_device will bring the device up to D0, so we don't need
  * to worry about doing that ourselves here. */
 pci_set_word(pci_dev-config + pos + PCI_PM_CTRL,
- PCI_PM_CTRL_NO_SOFT_RST);
+ PCI_PM_CTRL_NO_SOFT_RESET);
 
 pci_set_byte(pci_dev-config + pos + PCI_PM_PPB_EXTENSIONS, 0);
 pci_set_byte(pci_dev-config + pos + PCI_PM_DATA_REGISTER, 0);
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] qemu,kvm: Enable NMI support for user space irqchip

2010-12-08 Thread Lai Jiangshan

Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the
user space APIC emulation or some other source raised them.

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7dfc357..c4ebe28 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1417,6 +1417,14 @@ int kvm_arch_get_registers(CPUState *env)
 
 int kvm_arch_pre_run(CPUState *env, struct kvm_run *run)
 {
+#ifdef KVM_CAP_USER_NMI
+if (env-interrupt_request  CPU_INTERRUPT_NMI) {
+env-interrupt_request = ~CPU_INTERRUPT_NMI;
+DPRINTF(injected NMI\n);
+kvm_vcpu_ioctl(env, KVM_NMI);
+}
+#endif
+
 /* Try to inject an interrupt if the guest can accept it */
 if (run-ready_for_interrupt_injection 
 (env-interrupt_request  CPU_INTERRUPT_HARD) 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] qemu,qmp: convert do_inject_nmi() to QObject

2010-12-08 Thread Lai Jiangshan

Convert do_inject_nmi() to QObject, we need to use it(via libvirt).

It is trivial, as it never fails, doesn't have output nor return any data.

Signed-off-by:  Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 7a49b74..2e6b034 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -725,7 +725,8 @@ ETEXI
 .args_type  = cpu_index:i,
 .params = cpu,
 .help   = inject an NMI on the given CPU,
-.mhandler.cmd = do_inject_nmi,
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_inject_nmi,
 },
 #endif
 STEXI
diff --git a/monitor.c b/monitor.c
index 729a7cb..1f0d29e 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2120,7 +2120,7 @@ static void do_wav_capture(Monitor *mon, const QDict 
*qdict)
 #endif
 
 #if defined(TARGET_I386)
-static void do_inject_nmi(Monitor *mon, const QDict *qdict)
+static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
 CPUState *env;
 int cpu_index = qdict_get_int(qdict, cpu_index);
@@ -2130,6 +2130,7 @@ static void do_inject_nmi(Monitor *mon, const QDict 
*qdict)
 cpu_interrupt(env, CPU_INTERRUPT_NMI);
 break;
 }
+return 0;
 }
 #endif
 
diff --git a/qmp-commands.hx b/qmp-commands.hx
index a385b66..2506981 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -453,6 +453,22 @@ Example:
 
 EQMP
 
+#if defined(TARGET_I386)
+{
+.name   = nmi,
+.args_type  = cpu_index:i,
+.params = cpu,
+.help   = inject an NMI on the given CPU,
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_inject_nmi,
+},
+#endif
+SQMP
+...@item nmi @var{cpu}
+...@findex nmi
+Inject an NMI on the given CPU (x86 only).
+EQMP
+
 {
 .name   = migrate,
 .args_type  = detach:-d,blk:-b,inc:-i,uri:s,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] qumu,qmp: QError: New QERR_INVALID_KEY

2010-12-08 Thread Lai Jiangshan

Signed-off-by:  Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/qerror.c b/qerror.c
index ac2cdaf..a7ef758 100644
--- a/qerror.c
+++ b/qerror.c
@@ -117,6 +117,10 @@ static const QErrorStringTable qerror_table[] = {
 .desc  = Invalid block format '%(name)',
 },
 {
+.error_fmt = QERR_INVALID_KEY,
+.desc  = Invalid key: '%(name)...',
+},
+{
 .error_fmt = QERR_INVALID_PARAMETER,
 .desc  = Invalid parameter '%(name)',
 },
diff --git a/qerror.h b/qerror.h
index 943a24b..4fa95ef 100644
--- a/qerror.h
+++ b/qerror.h
@@ -102,6 +102,9 @@ QError *qobject_to_qerror(const QObject *obj);
 #define QERR_INVALID_BLOCK_FORMAT \
 { 'class': 'InvalidBlockFormat', 'data': { 'name': %s } }
 
+#define QERR_INVALID_KEY \
+{ 'class': 'InvalidKey', 'data': { 'name': %s } }
+
 #define QERR_INVALID_PARAMETER \
 { 'class': 'InvalidParameter', 'data': { 'name': %s } }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] qemu,qmp: QError: New QERR_TOO_MANY_KEYS

2010-12-08 Thread Lai Jiangshan

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/qerror.c b/qerror.c
index a7ef758..fd66d2a 100644
--- a/qerror.c
+++ b/qerror.c
@@ -197,6 +197,10 @@ static const QErrorStringTable qerror_table[] = {
 .desc  = Too many open files,
 },
 {
+.error_fmt = QERR_TOO_MANY_KEYS,
+.desc  = Too many keys,
+},
+{
 .error_fmt = QERR_UNDEFINED_ERROR,
 .desc  = An undefined error has ocurred,
 },
diff --git a/qerror.h b/qerror.h
index 4fa95ef..7f56f12 100644
--- a/qerror.h
+++ b/qerror.h
@@ -162,6 +162,9 @@ QError *qobject_to_qerror(const QObject *obj);
 #define QERR_TOO_MANY_FILES \
 { 'class': 'TooManyFiles', 'data': {} }
 
+#define QERR_TOO_MANY_KEYS \
+{ 'class': 'TooManyKeys', 'data': {} }
+
 #define QERR_UNDEFINED_ERROR \
 { 'class': 'UndefinedError', 'data': {} }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] qemu,qmp: QError: New QERR_UNKNOWN_KEY

2010-12-08 Thread Lai Jiangshan

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/qerror.c b/qerror.c
index fd66d2a..07b4cfc 100644
--- a/qerror.c
+++ b/qerror.c
@@ -205,6 +205,10 @@ static const QErrorStringTable qerror_table[] = {
 .desc  = An undefined error has ocurred,
 },
 {
+.error_fmt = QERR_UNKNOWN_KEY,
+.desc  = Unknown key: '%(name)',
+},
+{
 .error_fmt = QERR_VNC_SERVER_FAILED,
 .desc  = Could not start VNC server on %(target),
 },
diff --git a/qerror.h b/qerror.h
index 7f56f12..cf3ab8f 100644
--- a/qerror.h
+++ b/qerror.h
@@ -168,6 +168,9 @@ QError *qobject_to_qerror(const QObject *obj);
 #define QERR_UNDEFINED_ERROR \
 { 'class': 'UndefinedError', 'data': {} }
 
+#define QERR_UNKNOWN_KEY \
+{ 'class': 'UnknownKey', 'data': { 'name': %s } }
+
 #define QERR_VNC_SERVER_FAILED \
 { 'class': 'VNCServerFailed', 'data': { 'target': %s } }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] qemu,qmp: Convert do_sendkey() to QObject,QError

2010-12-08 Thread Lai Jiangshan

Convert do_sendkey() to QObject,QError, we need to use it.(via libvirt)

It is a trivial conversion, carefully converted the error reports.

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 23024ba..7a49b74 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -436,7 +436,8 @@ ETEXI
 .args_type  = string:s,hold_time:i?,
 .params = keys [hold_ms],
 .help   = send keys to the VM (e.g. 'sendkey ctrl-alt-f1', 
default hold time=100 ms),
-.mhandler.cmd = do_sendkey,
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_sendkey,
 },
 
 STEXI
diff --git a/monitor.c b/monitor.c
index ec31eac..729a7cb 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1678,7 +1678,7 @@ static void release_keys(void *opaque)
 }
 }
 
-static void do_sendkey(Monitor *mon, const QDict *qdict)
+static int do_sendkey(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
 char keyname_buf[16];
 char *separator;
@@ -1700,18 +1700,18 @@ static void do_sendkey(Monitor *mon, const QDict *qdict)
 if (keyname_len  0) {
 pstrcpy(keyname_buf, sizeof(keyname_buf), string);
 if (keyname_len  sizeof(keyname_buf) - 1) {
-monitor_printf(mon, invalid key: '%s...'\n, keyname_buf);
-return;
+qerror_report(QERR_INVALID_KEY, keyname_buf);
+return -1;
 }
 if (i == MAX_KEYCODES) {
-monitor_printf(mon, too many keys\n);
-return;
+qerror_report(QERR_TOO_MANY_KEYS);
+return -1;
 }
 keyname_buf[keyname_len] = 0;
 keycode = get_keycode(keyname_buf);
 if (keycode  0) {
-monitor_printf(mon, unknown key: '%s'\n, keyname_buf);
-return;
+qerror_report(QERR_UNKNOWN_KEY, keyname_buf);
+return -1;
 }
 keycodes[i++] = keycode;
 }
@@ -1730,6 +1730,7 @@ static void do_sendkey(Monitor *mon, const QDict *qdict)
 /* delayed key up events */
 qemu_mod_timer(key_timer, qemu_get_clock(vm_clock) +
muldiv64(get_ticks_per_sec(), hold_time, 1000));
+return 0;
 }
 
 static int mouse_button_state;
diff --git a/qmp-commands.hx b/qmp-commands.hx
index e5f157f..a385b66 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -225,6 +225,30 @@ Example:
 - { return: {} }
 
 EQMP
+{
+.name   = sendkey,
+.args_type  = string:s,hold_time:i?,
+.params = keys [hold_ms],
+.help   = send keys to the VM (e.g. 'sendkey ctrl-alt-f1', 
default hold time=100 ms),
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_sendkey,
+},
+
+SQMP
+...@item sendkey @var{keys}
+...@findex sendkey
+
+Send @var{keys} to the emulator. @var{keys} could be the name of the
+key or @code{#} followed by the raw value in either decimal or hexadecimal
+format. Use @code{-} to press several keys simultaneously. Example:
+...@example
+sendkey ctrl-alt-f1
+...@end example
+
+This command is useful to send keys that your graphical user interface
+intercepts at low level, such as @code{ctrl-alt-f1} in X Window.
+
+EQMP
 
 {
 .name   = system_reset,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] qemu,kvm: Enable NMI support for user space irqchip

2010-12-08 Thread Jan Kiszka
Am 09.12.2010 07:58, Lai Jiangshan wrote:
 
 Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the
 user space APIC emulation or some other source raised them.

In that light, the subject is not absolutely correct.

 
 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 7dfc357..c4ebe28 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1417,6 +1417,14 @@ int kvm_arch_get_registers(CPUState *env)
  
  int kvm_arch_pre_run(CPUState *env, struct kvm_run *run)
  {
 +#ifdef KVM_CAP_USER_NMI
 +if (env-interrupt_request  CPU_INTERRUPT_NMI) {
 +env-interrupt_request = ~CPU_INTERRUPT_NMI;
 +DPRINTF(injected NMI\n);
 +kvm_vcpu_ioctl(env, KVM_NMI);
 +}
 +#endif
 +
  /* Try to inject an interrupt if the guest can accept it */
  if (run-ready_for_interrupt_injection 
  (env-interrupt_request  CPU_INTERRUPT_HARD) 

Actually, we already depend on KVM_CAP_DESTROY_MEMORY_REGION_WORKS which
was introduced with 2.6.29 as well. I would suggest to simply extend the
static configure check and avoid new #ifdefs in the code.

Thanks for pushing this! Was obviously so trivial that it was forgotten...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html