Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.
2010/12/8 Isaku Yamahata yamah...@valinux.co.jp: QLIST_FOREACH_SAFE? Thanks! So, it should be, QLIST_FOREACH_SAFE(e, vm_change_state_head, entries, ne) { e-cb(e-opaque, running, reason); } I'll put it in the next spin. Yoshi On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote: By copying the next entry to a tmp pointer, qemu_del_vm_change_state_handler() can be called in the handler. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/vl.c b/vl.c index 805e11f..6b6aec0 100644 --- a/vl.c +++ b/vl.c @@ -1073,11 +1073,12 @@ void qemu_del_vm_change_state_handler(VMChangeStateEntry *e) void vm_state_notify(int running, int reason) { - VMChangeStateEntry *e; + VMChangeStateEntry *e, *ne; trace_vm_state_notify(running, reason); - for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) { + for (e = vm_change_state_head.lh_first; e; e = ne) { + ne = e-entries.le_next; e-cb(e-opaque, running, reason); } } -- 1.7.1.2 -- yamahata -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall
On 12/07/2010 11:16 PM, David Sharp wrote: I don't understand this. Can you elaborate? What does 32-bit addressable mean? The ring buffer gives you space that is a multiple of 4 bytes in length, and 32-bit aligned. Therefore it is useless to attempt to align the structure beyond 32-bit boundaries, eg, a 64-bit boundary, because it is unpredictable if the memory the structure will be written to is at a 64-bit boundary (addr % 8 could be 0 or 4). And predicated on packing the event structures? Is the structure __attribute__((packed)), or is it not? It is not packed in Linus' tree, but one of the patches before this patch in this patch series adds __attribute__((packed)). This patch assumes that the event packing patch has been applied. This patch should not be applied if the packing patch is not (hence, predicated). Thanks for the explanations, it makes sense now. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Fix OSXSAVE after migration
On 12/08/2010 04:49 AM, Sheng Yang wrote: CPUID's OSXSAVE is a mirror of CR4.OSXSAVE bit. We need to update the CPUID after migration. Applied, thanks. @@ -5585,6 +5585,8 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs-cr4; kvm_x86_ops-set_cr4(vcpu, sregs-cr4); + if (sregs-cr4 X86_CR4_OSXSAVE) + update_cpuid(vcpu); if (!is_long_mode(vcpu) is_pae(vcpu)) { load_pdptrs(vcpu, vcpu-arch.walk_mmu, vcpu-arch.cr3); mmu_reset_needed = 1; We really should use kvm_set_crX() here. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v9)
Am 06.12.2010 20:53, schrieb Christian Brunner: This is a new version of the rbd driver. The only difference from v8 is a check for a recent librados version in configure. If the librados version is too old, rbd support will be disabled. RBD is an block driver for the distributed file system Ceph (http://ceph.newdream.net/). This driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace (Yehuda also wrote a driver for the linux kernel, that can be used to access rbd volumes as a block device). Regards, Christian Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net Signed-off-by: Christian Brunner c...@muc.de Thanks. I still haven't managed to actually test it, but I've applied this to the block branch now based on your testing and Stefan's review (and the fact that it doesn't break my build any more). Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: SVM: Add xsetbv intercept
On 12/07/2010 06:15 PM, Joerg Roedel wrote: This patch implements the xsetbv intercept to the AMD part of KVM. This makes AVX usable in a save way for the guest on AVX capable AMD hardware. The patch is tested by using AVX in the guest and host in parallel and checking for data corruption. I also used the KVM xsave unit-tests and they all pass. Applied both, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/x86: enlarge number of possible CPUID leaves
Avi, Marcello, can you please commit this simple fix? (turning 40 to 80?) Without it QEMU crashes reliably on our new CPUs (they return 46 leaves) and causes pain in our testing, because we have to manually apply this patch on each tree. Thanks! Andre. Currently the number of CPUID leaves KVM handles is limited to 40. My desktop machine (AthlonII) already has 35 and future CPUs will expand this well beyond the limit. Extend the limit to 80 to make room for future processors. Signed-off-by: Andre Przywara andre.przyw...@amd.com --- arch/x86/include/asm/kvm_host.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) Hi, I found that either KVM or QEMU (possibly both) are broken in respect to handling more CPUID entries than the limit dictates. KVM will return -E2BIG, which is the same error as if the user hasn't provided enough space to hold all entries. Now QEMU will continue to enlarge the allocated memory until it gets into an out-of-memory condition. I have tried to fix this with teaching KVM how to deal with a capped number of entries (there are some bugs in the current code), but this will limit the number of CPUID entries KVM handles, which will surely cut of the lastly appended PV leaves. A proper fix would be to make this allocation dynamic. Is this a feasible way or will this lead to issues or side-effects? Regards, Andre. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 54e42c8..3cc80c4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -79,7 +79,7 @@ #define KVM_NUM_MMU_PAGES (1 KVM_MMU_HASH_SHIFT) #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 -#define KVM_MAX_CPUID_ENTRIES 40 +#define KVM_MAX_CPUID_ENTRIES 80 #define KVM_NR_FIXED_MTRR_REGION 88 #define KVM_NR_VAR_MTRR 8 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: cleanup CR8 handling
The handling of CR8 writes in KVM is currently somewhat cumbersome. This patch makes it look like the other CR register handlers and fixes a possible issue in VMX, where the RIP would be incremented despite an injected #GP. Signed-off-by: Andre Przywara andre.przyw...@amd.com --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/svm.c |7 --- arch/x86/kvm/vmx.c |4 ++-- arch/x86/kvm/x86.c | 18 -- 4 files changed, 15 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d968cc5..2b89195 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -659,7 +659,7 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason, int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); -void kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8); +int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8); int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val); int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val); unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ae943bb..ed5950c 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2610,16 +2610,17 @@ static int cr0_write_interception(struct vcpu_svm *svm) static int cr8_write_interception(struct vcpu_svm *svm) { struct kvm_run *kvm_run = svm-vcpu.run; + int r; u8 cr8_prev = kvm_get_cr8(svm-vcpu); /* instruction emulation calls kvm_set_cr8() */ - emulate_instruction(svm-vcpu, 0, 0, 0); + r = emulate_instruction(svm-vcpu, 0, 0, 0); if (irqchip_in_kernel(svm-vcpu.kvm)) { clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); - return 1; + return r == EMULATE_DONE; } if (cr8_prev = kvm_get_cr8(svm-vcpu)) - return 1; + return r == EMULATE_DONE; kvm_run-exit_reason = KVM_EXIT_SET_TPR; return 0; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 72cfdb7..e5ef924 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3164,8 +3164,8 @@ static int handle_cr(struct kvm_vcpu *vcpu) case 8: { u8 cr8_prev = kvm_get_cr8(vcpu); u8 cr8 = kvm_register_read(vcpu, reg); - kvm_set_cr8(vcpu, cr8); - skip_emulated_instruction(vcpu); + err = kvm_set_cr8(vcpu, cr8); + complete_insn_gp(vcpu, err); if (irqchip_in_kernel(vcpu-kvm)) return 1; if (cr8_prev = cr8) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ed373ba..63b8877 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -667,7 +667,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) } EXPORT_SYMBOL_GPL(kvm_set_cr3); -int __kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8) +int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8) { if (cr8 CR8_RESERVED_BITS) return 1; @@ -677,12 +677,6 @@ int __kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8) vcpu-arch.cr8 = cr8; return 0; } - -void kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8) -{ - if (__kvm_set_cr8(vcpu, cr8)) - kvm_inject_gp(vcpu, 0); -} EXPORT_SYMBOL_GPL(kvm_set_cr8); unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu) @@ -4104,7 +4098,7 @@ static int emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) res = kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val)); break; case 8: - res = __kvm_set_cr8(vcpu, val 0xfUL); + res = kvm_set_cr8(vcpu, val); break; default: vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr); @@ -5379,8 +5373,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) } /* re-sync apic's tpr */ - if (!irqchip_in_kernel(vcpu-kvm)) - kvm_set_cr8(vcpu, kvm_run-cr8); + if (!irqchip_in_kernel(vcpu-kvm)) { + if (kvm_set_cr8(vcpu, kvm_run-cr8) != 0) { + r = -EINVAL; + goto out; + } + } if (vcpu-arch.pio.count || vcpu-mmio_needed) { if (vcpu-mmio_needed) { -- 1.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 03/16] Keep track of ISA ports ISA device is using in qdev.
Store all io ports used by device in ISADevice structure. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/cs4231a.c |1 + hw/fdc.c |3 +++ hw/gus.c |4 hw/ide/isa.c |2 ++ hw/isa-bus.c | 25 + hw/isa.h |4 hw/m48t59.c |1 + hw/mc146818rtc.c |1 + hw/ne2000-isa.c |3 +++ hw/parallel.c|5 + hw/pckbd.c |3 +++ hw/sb16.c|4 hw/serial.c |1 + 13 files changed, 57 insertions(+), 0 deletions(-) diff --git a/hw/cs4231a.c b/hw/cs4231a.c index 4d5ce5c..598f032 100644 --- a/hw/cs4231a.c +++ b/hw/cs4231a.c @@ -645,6 +645,7 @@ static int cs4231a_initfn (ISADevice *dev) isa_init_irq (dev, s-pic, s-irq); for (i = 0; i 4; i++) { +isa_init_ioport(dev, i); register_ioport_write (s-port + i, 1, 1, cs_write, s); register_ioport_read (s-port + i, 1, 1, cs_read, s); } diff --git a/hw/fdc.c b/hw/fdc.c index a467c4b..22fb64a 100644 --- a/hw/fdc.c +++ b/hw/fdc.c @@ -1983,6 +1983,9 @@ static int isabus_fdc_init1(ISADevice *dev) fdctrl_write_port, fdctrl); register_ioport_write(iobase + 0x07, 1, 1, fdctrl_write_port, fdctrl); +isa_init_ioport_range(dev, iobase, 6); +isa_init_ioport(dev, iobase + 7); + isa_init_irq(isa-busdev, fdctrl-irq, isairq); fdctrl-dma_chann = dma_chann; diff --git a/hw/gus.c b/hw/gus.c index e9016d8..ff9e7c7 100644 --- a/hw/gus.c +++ b/hw/gus.c @@ -264,20 +264,24 @@ static int gus_initfn (ISADevice *dev) register_ioport_write (s-port, 1, 1, gus_writeb, s); register_ioport_write (s-port, 1, 2, gus_writew, s); +isa_init_ioport_range(dev, s-port, 2); register_ioport_read ((s-port + 0x100) 0xf00, 1, 1, gus_readb, s); register_ioport_read ((s-port + 0x100) 0xf00, 1, 2, gus_readw, s); +isa_init_ioport_range(dev, (s-port + 0x100) 0xf00, 2); register_ioport_write (s-port + 6, 10, 1, gus_writeb, s); register_ioport_write (s-port + 6, 10, 2, gus_writew, s); register_ioport_read (s-port + 6, 10, 1, gus_readb, s); register_ioport_read (s-port + 6, 10, 2, gus_readw, s); +isa_init_ioport_range(dev, s-port + 6, 10); register_ioport_write (s-port + 0x100, 8, 1, gus_writeb, s); register_ioport_write (s-port + 0x100, 8, 2, gus_writew, s); register_ioport_read (s-port + 0x100, 8, 1, gus_readb, s); register_ioport_read (s-port + 0x100, 8, 2, gus_readw, s); +isa_init_ioport_range(dev, s-port + 0x100, 8); DMA_register_channel (s-emu.gusdma, GUS_read_DMA, s); s-emu.himemaddr = s-himem; diff --git a/hw/ide/isa.c b/hw/ide/isa.c index 9856435..4206afd 100644 --- a/hw/ide/isa.c +++ b/hw/ide/isa.c @@ -70,6 +70,8 @@ static int isa_ide_initfn(ISADevice *dev) ide_bus_new(s-bus, s-dev.qdev); ide_init_ioport(s-bus, s-iobase, s-iobase2); isa_init_irq(dev, s-irq, s-isairq); +isa_init_ioport_range(dev, s-iobase, 8); +isa_init_ioport(dev, s-iobase2); ide_init2(s-bus, s-irq); vmstate_register(dev-qdev, 0, vmstate_ide_isa, s); return 0; diff --git a/hw/isa-bus.c b/hw/isa-bus.c index 26036e0..c0ac7e9 100644 --- a/hw/isa-bus.c +++ b/hw/isa-bus.c @@ -92,6 +92,31 @@ void isa_init_irq(ISADevice *dev, qemu_irq *p, int isairq) dev-nirqs++; } +static void isa_init_ioport_one(ISADevice *dev, uint16_t ioport) +{ +assert(dev-nioports ARRAY_SIZE(dev-ioports)); +dev-ioports[dev-nioports++] = ioport; +} + +static int isa_cmp_ports(const void *p1, const void *p2) +{ +return *(uint16_t*)p1 - *(uint16_t*)p2; +} + +void isa_init_ioport_range(ISADevice *dev, uint16_t start, uint16_t length) +{ +int i; +for (i = start; i start + length; i++) { +isa_init_ioport_one(dev, i); +} +qsort(dev-ioports, dev-nioports, sizeof(dev-ioports[0]), isa_cmp_ports); +} + +void isa_init_ioport(ISADevice *dev, uint16_t ioport) +{ +isa_init_ioport_range(dev, ioport, 1); +} + static int isa_qdev_init(DeviceState *qdev, DeviceInfo *base) { ISADevice *dev = DO_UPCAST(ISADevice, qdev, qdev); diff --git a/hw/isa.h b/hw/isa.h index aaf0272..4794b76 100644 --- a/hw/isa.h +++ b/hw/isa.h @@ -14,6 +14,8 @@ struct ISADevice { DeviceState qdev; uint32_t isairq[2]; int nirqs; +uint16_t ioports[32]; +int nioports; }; typedef int (*isa_qdev_initfn)(ISADevice *dev); @@ -26,6 +28,8 @@ ISABus *isa_bus_new(DeviceState *dev); void isa_bus_irqs(qemu_irq *irqs); qemu_irq isa_reserve_irq(int isairq); void isa_init_irq(ISADevice *dev, qemu_irq *p, int isairq); +void isa_init_ioport(ISADevice *dev, uint16_t ioport); +void isa_init_ioport_range(ISADevice *dev, uint16_t start, uint16_t length); void isa_qdev_register(ISADeviceInfo *info); ISADevice *isa_create(const char *name); ISADevice *isa_create_simple(const char *name); diff --git a/hw/m48t59.c b/hw/m48t59.c index c7492a6..75a94e1 100644 ---
[PATCHv8 02/16] Introduce new BusInfo callback get_fw_dev_path.
New get_fw_dev_path callback will be used for build device path usable by firmware in contrast to qdev qemu internal device path. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/qdev.h |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/hw/qdev.h b/hw/qdev.h index bc71110..f72fbde 100644 --- a/hw/qdev.h +++ b/hw/qdev.h @@ -49,6 +49,12 @@ struct DeviceState { typedef void (*bus_dev_printfn)(Monitor *mon, DeviceState *dev, int indent); typedef char *(*bus_get_dev_path)(DeviceState *dev); +/* + * This callback is used to create Open Firmware device path in accordance with + * OF spec http://forthworks.com/standards/of1275.pdf. Indicidual bus bindings + * can be found here http://playground.sun.com/1275/bindings/. + */ +typedef char *(*bus_get_fw_dev_path)(DeviceState *dev); typedef int (qbus_resetfn)(BusState *bus); struct BusInfo { @@ -56,6 +62,7 @@ struct BusInfo { size_t size; bus_dev_printfn print_dev; bus_get_dev_path get_dev_path; +bus_get_fw_dev_path get_fw_dev_path; qbus_resetfn *reset; Property *props; }; -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 16/16] Pass boot device list to firmware.
Signed-off-by: Gleb Natapov g...@redhat.com --- hw/fw_cfg.c | 14 ++ sysemu.h|1 + vl.c| 48 3 files changed, 63 insertions(+), 0 deletions(-) diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c index 7b9434f..20a816f 100644 --- a/hw/fw_cfg.c +++ b/hw/fw_cfg.c @@ -53,6 +53,7 @@ struct FWCfgState { FWCfgFiles *files; uint16_t cur_entry; uint32_t cur_offset; +Notifier machine_ready; }; static void fw_cfg_write(FWCfgState *s, uint8_t value) @@ -315,6 +316,15 @@ int fw_cfg_add_file(FWCfgState *s, const char *filename, uint8_t *data, return 1; } +static void fw_cfg_machine_ready(struct Notifier* n) +{ +uint32_t len; +FWCfgState *s = container_of(n, FWCfgState, machine_ready); +char *bootindex = get_boot_devices_list(len); + +fw_cfg_add_file(s, bootorder, (uint8_t*)bootindex, len); +} + FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port, target_phys_addr_t ctl_addr, target_phys_addr_t data_addr) { @@ -343,6 +353,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port, fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus); fw_cfg_add_i16(s, FW_CFG_BOOT_MENU, (uint16_t)boot_menu); + +s-machine_ready.notify = fw_cfg_machine_ready; +qemu_add_machine_init_done_notifier(s-machine_ready); + return s; } diff --git a/sysemu.h b/sysemu.h index c42f33a..38a20a3 100644 --- a/sysemu.h +++ b/sysemu.h @@ -196,4 +196,5 @@ void register_devices(void); void add_boot_device_path(int32_t bootindex, DeviceState *dev, const char *suffix); +char *get_boot_devices_list(uint32_t *size); #endif diff --git a/vl.c b/vl.c index 0d20d26..c4d3fc0 100644 --- a/vl.c +++ b/vl.c @@ -736,6 +736,54 @@ void add_boot_device_path(int32_t bootindex, DeviceState *dev, QTAILQ_INSERT_TAIL(fw_boot_order, node, link); } +/* + * This function returns null terminated string that consist of new line + * separated device pathes. + * + * memory pointed by size is assigned total length of the array in bytes + * + */ +char *get_boot_devices_list(uint32_t *size) +{ +FWBootEntry *i; +uint32_t total = 0; +char *list = NULL; + +QTAILQ_FOREACH(i, fw_boot_order, link) { +char *devpath = NULL, *bootpath; +int len; + +if (i-dev) { +devpath = qdev_get_fw_dev_path(i-dev); +assert(devpath); +} + +if (i-suffix devpath) { +bootpath = qemu_malloc(strlen(devpath) + strlen(i-suffix) + 1); +sprintf(bootpath, %s%s, devpath, i-suffix); +qemu_free(devpath); +} else if (devpath) { +bootpath = devpath; +} else { +bootpath = strdup(i-suffix); +assert(bootpath); +} + +if (total) { +list[total-1] = '\n'; +} +len = strlen(bootpath) + 1; +list = qemu_realloc(list, total + len); +memcpy(list[total], bootpath, len); +total += len; +qemu_free(bootpath); +} + +*size = total; + +return list; +} + static void numa_add(const char *optarg) { char option[128]; -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 04/16] Add get_fw_dev_path callback to ISA bus in qdev.
Use device ioports to create unique device path. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/isa-bus.c | 16 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/hw/isa-bus.c b/hw/isa-bus.c index c0ac7e9..c423c1b 100644 --- a/hw/isa-bus.c +++ b/hw/isa-bus.c @@ -31,11 +31,13 @@ static ISABus *isabus; target_phys_addr_t isa_mem_base = 0; static void isabus_dev_print(Monitor *mon, DeviceState *dev, int indent); +static char *isabus_get_fw_dev_path(DeviceState *dev); static struct BusInfo isa_bus_info = { .name = ISA, .size = sizeof(ISABus), .print_dev = isabus_dev_print, +.get_fw_dev_path = isabus_get_fw_dev_path, }; ISABus *isa_bus_new(DeviceState *dev) @@ -188,4 +190,18 @@ static void isabus_register_devices(void) sysbus_register_withprop(isabus_bridge_info); } +static char *isabus_get_fw_dev_path(DeviceState *dev) +{ +ISADevice *d = (ISADevice*)dev; +char path[40]; +int off; + +off = snprintf(path, sizeof(path), %s, qdev_fw_name(dev)); +if (d-nioports) { +snprintf(path + off, sizeof(path) - off, @%04x, d-ioports[0]); +} + +return strdup(path); +} + device_init(isabus_register_devices) -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 11/16] Add get_fw_dev_path callback to scsi bus.
Signed-off-by: Gleb Natapov g...@redhat.com --- hw/scsi-bus.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 93f0e9a..7febb86 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -5,9 +5,12 @@ #include qdev.h #include blockdev.h +static char *scsibus_get_fw_dev_path(DeviceState *dev); + static struct BusInfo scsi_bus_info = { .name = SCSI, .size = sizeof(SCSIBus), +.get_fw_dev_path = scsibus_get_fw_dev_path, .props = (Property[]) { DEFINE_PROP_UINT32(scsi-id, SCSIDevice, id, -1), DEFINE_PROP_END_OF_LIST(), @@ -518,3 +521,23 @@ void scsi_req_complete(SCSIRequest *req) req-tag, req-status); } + +static char *scsibus_get_fw_dev_path(DeviceState *dev) +{ +SCSIDevice *d = (SCSIDevice*)dev; +SCSIBus *bus = scsi_bus_from_device(d); +char path[100]; +int i; + +for (i = 0; i bus-ndev; i++) { +if (bus-devs[i] == d) { +break; +} +} + +assert(i != bus-ndev); + +snprintf(path, sizeof(path), %...@%x, qdev_fw_name(dev), i); + +return strdup(path); +} -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 10/16] Add get_fw_dev_path callback for usb bus.
Signed-off-by: Gleb Natapov g...@redhat.com --- hw/usb-bus.c | 42 ++ 1 files changed, 42 insertions(+), 0 deletions(-) diff --git a/hw/usb-bus.c b/hw/usb-bus.c index 256b881..8b4583c 100644 --- a/hw/usb-bus.c +++ b/hw/usb-bus.c @@ -5,11 +5,13 @@ #include monitor.h static void usb_bus_dev_print(Monitor *mon, DeviceState *qdev, int indent); +static char *usbbus_get_fw_dev_path(DeviceState *dev); static struct BusInfo usb_bus_info = { .name = USB, .size = sizeof(USBBus), .print_dev = usb_bus_dev_print, +.get_fw_dev_path = usbbus_get_fw_dev_path, }; static int next_usb_bus = 0; static QTAILQ_HEAD(, USBBus) busses = QTAILQ_HEAD_INITIALIZER(busses); @@ -307,3 +309,43 @@ USBDevice *usbdevice_create(const char *cmdline) } return usb-usbdevice_init(params); } + +static int usbbus_get_fw_dev_path_helper(USBDevice *d, USBBus *bus, char *p, + int len) +{ +int l = 0; +USBPort *port; + +QTAILQ_FOREACH(port, bus-used, next) { +if (port-dev == d) { +if (port-pdev) { +l = usbbus_get_fw_dev_path_helper(port-pdev, bus, p, len); +} +l += snprintf(p + l, len - l, %...@%x/, qdev_fw_name(d-qdev), + port-index); +break; +} +} + +return l; +} + +static char *usbbus_get_fw_dev_path(DeviceState *dev) +{ +USBDevice *d = (USBDevice*)dev; +USBBus *bus = usb_bus_from_device(d); +char path[100]; +int l; + +assert(d-attached != 0); + +l = usbbus_get_fw_dev_path_helper(d, bus, path, sizeof(path)); + +if (l == 0) { +abort(); +} + +path[l-1] = '\0'; + +return strdup(path); +} -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 01/16] Introduce fw_name field to DeviceInfo structure.
Add fw_name to DeviceInfo to use in device path building. In contrast to name fw_name should refer to functionality device provides instead of particular device model like name does. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/fdc.c|1 + hw/ide/isa.c|1 + hw/ide/qdev.c |1 + hw/isa-bus.c|1 + hw/lance.c |1 + hw/piix_pci.c |1 + hw/qdev.h |6 ++ hw/scsi-disk.c |1 + hw/usb-hub.c|1 + hw/usb-net.c|1 + hw/virtio-pci.c |1 + 11 files changed, 16 insertions(+), 0 deletions(-) diff --git a/hw/fdc.c b/hw/fdc.c index c159dcb..a467c4b 100644 --- a/hw/fdc.c +++ b/hw/fdc.c @@ -2040,6 +2040,7 @@ static const VMStateDescription vmstate_isa_fdc ={ static ISADeviceInfo isa_fdc_info = { .init = isabus_fdc_init1, .qdev.name = isa-fdc, +.qdev.fw_name = fdc, .qdev.size = sizeof(FDCtrlISABus), .qdev.no_user = 1, .qdev.vmsd = vmstate_isa_fdc, diff --git a/hw/ide/isa.c b/hw/ide/isa.c index 6b57e0d..9856435 100644 --- a/hw/ide/isa.c +++ b/hw/ide/isa.c @@ -98,6 +98,7 @@ ISADevice *isa_ide_init(int iobase, int iobase2, int isairq, static ISADeviceInfo isa_ide_info = { .qdev.name = isa-ide, +.qdev.fw_name = ide, .qdev.size = sizeof(ISAIDEState), .init = isa_ide_initfn, .qdev.reset = isa_ide_reset, diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c index 0808760..6d27b60 100644 --- a/hw/ide/qdev.c +++ b/hw/ide/qdev.c @@ -134,6 +134,7 @@ static int ide_drive_initfn(IDEDevice *dev) static IDEDeviceInfo ide_drive_info = { .qdev.name = ide-drive, +.qdev.fw_name = drive, .qdev.size = sizeof(IDEDrive), .init = ide_drive_initfn, .qdev.props = (Property[]) { diff --git a/hw/isa-bus.c b/hw/isa-bus.c index 4e306de..26036e0 100644 --- a/hw/isa-bus.c +++ b/hw/isa-bus.c @@ -153,6 +153,7 @@ static int isabus_bridge_init(SysBusDevice *dev) static SysBusDeviceInfo isabus_bridge_info = { .init = isabus_bridge_init, .qdev.name = isabus-bridge, +.qdev.fw_name = isa, .qdev.size = sizeof(SysBusDevice), .qdev.no_user = 1, }; diff --git a/hw/lance.c b/hw/lance.c index dc12144..1a3bb1a 100644 --- a/hw/lance.c +++ b/hw/lance.c @@ -141,6 +141,7 @@ static void lance_reset(DeviceState *dev) static SysBusDeviceInfo lance_info = { .init = lance_init, .qdev.name = lance, +.qdev.fw_name = ethernet, .qdev.size = sizeof(SysBusPCNetState), .qdev.reset = lance_reset, .qdev.vmsd = vmstate_lance, diff --git a/hw/piix_pci.c b/hw/piix_pci.c index b5589b9..38f9d9e 100644 --- a/hw/piix_pci.c +++ b/hw/piix_pci.c @@ -365,6 +365,7 @@ static PCIDeviceInfo i440fx_info[] = { static SysBusDeviceInfo i440fx_pcihost_info = { .init = i440fx_pcihost_initfn, .qdev.name= i440FX-pcihost, +.qdev.fw_name = pci, .qdev.size= sizeof(I440FXState), .qdev.no_user = 1, }; diff --git a/hw/qdev.h b/hw/qdev.h index 3fac364..bc71110 100644 --- a/hw/qdev.h +++ b/hw/qdev.h @@ -141,6 +141,7 @@ typedef void (*qdev_resetfn)(DeviceState *dev); struct DeviceInfo { const char *name; +const char *fw_name; const char *alias; const char *desc; size_t size; @@ -306,6 +307,11 @@ void qdev_prop_set_defaults(DeviceState *dev, Property *props); void qdev_prop_register_global_list(GlobalProperty *props); void qdev_prop_set_globals(DeviceState *dev); +static inline const char *qdev_fw_name(DeviceState *dev) +{ +return dev-info-fw_name ? : dev-info-alias ? : dev-info-name; +} + /* This is a nasty hack to allow passing a NULL bus to qdev_create. */ extern struct BusInfo system_bus_info; diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index 6e49404..851046f 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -1230,6 +1230,7 @@ static int scsi_disk_initfn(SCSIDevice *dev) static SCSIDeviceInfo scsi_disk_info = { .qdev.name= scsi-disk, +.qdev.fw_name = disk, .qdev.desc= virtual scsi disk or cdrom, .qdev.size= sizeof(SCSIDiskState), .qdev.reset = scsi_disk_reset, diff --git a/hw/usb-hub.c b/hw/usb-hub.c index 2a1edfc..8e3a96b 100644 --- a/hw/usb-hub.c +++ b/hw/usb-hub.c @@ -545,6 +545,7 @@ static int usb_hub_initfn(USBDevice *dev) static struct USBDeviceInfo hub_info = { .product_desc = QEMU USB Hub, .qdev.name = usb-hub, +.qdev.fw_name= hub, .qdev.size = sizeof(USBHubState), .init = usb_hub_initfn, .handle_packet = usb_hub_handle_packet, diff --git a/hw/usb-net.c b/hw/usb-net.c index 58c672f..f6bed21 100644 --- a/hw/usb-net.c +++ b/hw/usb-net.c @@ -1496,6 +1496,7 @@ static USBDevice *usb_net_init(const char *cmdline) static struct USBDeviceInfo net_info = { .product_desc = QEMU USB Network Interface, .qdev.name = usb-net, +.qdev.fw_name= network, .qdev.size = sizeof(USBNetState), .init = usb_net_initfn, .handle_packet =
[PATCHv8 07/16] Add get_fw_dev_path callback for system bus.
Prints out mmio or pio used to access child device. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/pci_host.c |2 ++ hw/sysbus.c | 30 ++ hw/sysbus.h |4 3 files changed, 36 insertions(+), 0 deletions(-) diff --git a/hw/pci_host.c b/hw/pci_host.c index bc5b771..28d45bf 100644 --- a/hw/pci_host.c +++ b/hw/pci_host.c @@ -197,6 +197,7 @@ void pci_host_conf_register_ioport(pio_addr_t ioport, PCIHostState *s) { pci_host_init(s); register_ioport_simple(s-conf_noswap_handler, ioport, 4, 4); +sysbus_init_ioports(s-busdev, ioport, 4); } int pci_host_data_register_mmio(PCIHostState *s, int swap) @@ -215,4 +216,5 @@ void pci_host_data_register_ioport(pio_addr_t ioport, PCIHostState *s) register_ioport_simple(s-data_noswap_handler, ioport, 4, 1); register_ioport_simple(s-data_noswap_handler, ioport, 4, 2); register_ioport_simple(s-data_noswap_handler, ioport, 4, 4); +sysbus_init_ioports(s-busdev, ioport, 4); } diff --git a/hw/sysbus.c b/hw/sysbus.c index d817721..1583bd8 100644 --- a/hw/sysbus.c +++ b/hw/sysbus.c @@ -22,11 +22,13 @@ #include monitor.h static void sysbus_dev_print(Monitor *mon, DeviceState *dev, int indent); +static char *sysbus_get_fw_dev_path(DeviceState *dev); struct BusInfo system_bus_info = { .name = System, .size = sizeof(BusState), .print_dev = sysbus_dev_print, +.get_fw_dev_path = sysbus_get_fw_dev_path, }; void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq) @@ -106,6 +108,16 @@ void sysbus_init_mmio_cb(SysBusDevice *dev, target_phys_addr_t size, dev-mmio[n].cb = cb; } +void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t size) +{ +pio_addr_t i; + +for (i = 0; i size; i++) { +assert(dev-num_pio QDEV_MAX_PIO); +dev-pio[dev-num_pio++] = ioport++; +} +} + static int sysbus_device_init(DeviceState *dev, DeviceInfo *base) { SysBusDeviceInfo *info = container_of(base, SysBusDeviceInfo, qdev); @@ -171,3 +183,21 @@ static void sysbus_dev_print(Monitor *mon, DeviceState *dev, int indent) indent, , s-mmio[i].addr, s-mmio[i].size); } } + +static char *sysbus_get_fw_dev_path(DeviceState *dev) +{ +SysBusDevice *s = sysbus_from_qdev(dev); +char path[40]; +int off; + +off = snprintf(path, sizeof(path), %s, qdev_fw_name(dev)); + +if (s-num_mmio) { +snprintf(path + off, sizeof(path) - off, @TARGET_FMT_plx, + s-mmio[0].addr); +} else if (s-num_pio) { +snprintf(path + off, sizeof(path) - off, @i%04x, s-pio[0]); +} + +return strdup(path); +} diff --git a/hw/sysbus.h b/hw/sysbus.h index 5980901..e9eb618 100644 --- a/hw/sysbus.h +++ b/hw/sysbus.h @@ -6,6 +6,7 @@ #include qdev.h #define QDEV_MAX_MMIO 32 +#define QDEV_MAX_PIO 32 #define QDEV_MAX_IRQ 256 typedef struct SysBusDevice SysBusDevice; @@ -23,6 +24,8 @@ struct SysBusDevice { mmio_mapfunc cb; ram_addr_t iofunc; } mmio[QDEV_MAX_MMIO]; +int num_pio; +pio_addr_t pio[QDEV_MAX_PIO]; }; typedef int (*sysbus_initfn)(SysBusDevice *dev); @@ -45,6 +48,7 @@ void sysbus_init_mmio_cb(SysBusDevice *dev, target_phys_addr_t size, mmio_mapfunc cb); void sysbus_init_irq(SysBusDevice *dev, qemu_irq *p); void sysbus_pass_irq(SysBusDevice *dev, SysBusDevice *target); +void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t size); void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 13/16] Change fw_cfg_add_file() to get full file path as a parameter.
Change fw_cfg_add_file() to get full file path as a parameter instead of building one internally. Two reasons for that. First caller may need to know how file is named. Second this moves policy of file naming out from fw_cfg. Platform may want to use more then two levels of directories for instance. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/fw_cfg.c | 16 hw/fw_cfg.h |4 ++-- hw/loader.c | 16 ++-- 3 files changed, 20 insertions(+), 16 deletions(-) diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c index 72866ae..7b9434f 100644 --- a/hw/fw_cfg.c +++ b/hw/fw_cfg.c @@ -277,10 +277,9 @@ int fw_cfg_add_callback(FWCfgState *s, uint16_t key, FWCfgCallback callback, return 1; } -int fw_cfg_add_file(FWCfgState *s, const char *dir, const char *filename, -uint8_t *data, uint32_t len) +int fw_cfg_add_file(FWCfgState *s, const char *filename, uint8_t *data, +uint32_t len) { -const char *basename; int i, index; if (!s-files) { @@ -297,15 +296,8 @@ int fw_cfg_add_file(FWCfgState *s, const char *dir, const char *filename, fw_cfg_add_bytes(s, FW_CFG_FILE_FIRST + index, data, len); -basename = strrchr(filename, '/'); -if (basename) { -basename++; -} else { -basename = filename; -} - -snprintf(s-files-f[index].name, sizeof(s-files-f[index].name), - %s/%s, dir, basename); +pstrcpy(s-files-f[index].name, sizeof(s-files-f[index].name), +filename); for (i = 0; i index; i++) { if (strcmp(s-files-f[index].name, s-files-f[i].name) == 0) { FW_CFG_DPRINTF(%s: skip duplicate: %s\n, __FUNCTION__, diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h index 4d13a4f..856bf91 100644 --- a/hw/fw_cfg.h +++ b/hw/fw_cfg.h @@ -60,8 +60,8 @@ int fw_cfg_add_i32(FWCfgState *s, uint16_t key, uint32_t value); int fw_cfg_add_i64(FWCfgState *s, uint16_t key, uint64_t value); int fw_cfg_add_callback(FWCfgState *s, uint16_t key, FWCfgCallback callback, void *callback_opaque, uint8_t *data, size_t len); -int fw_cfg_add_file(FWCfgState *s, const char *dir, const char *filename, -uint8_t *data, uint32_t len); +int fw_cfg_add_file(FWCfgState *s, const char *filename, uint8_t *data, +uint32_t len); FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port, target_phys_addr_t crl_addr, target_phys_addr_t data_addr); diff --git a/hw/loader.c b/hw/loader.c index 49ac1fa..1e98326 100644 --- a/hw/loader.c +++ b/hw/loader.c @@ -592,8 +592,20 @@ int rom_add_file(const char *file, const char *fw_dir, } close(fd); rom_insert(rom); -if (rom-fw_file fw_cfg) -fw_cfg_add_file(fw_cfg, rom-fw_dir, rom-fw_file, rom-data, rom-romsize); +if (rom-fw_file fw_cfg) { +const char *basename; +char fw_file_name[56]; + +basename = strrchr(rom-fw_file, '/'); +if (basename) { +basename++; +} else { +basename = rom-fw_file; +} +snprintf(fw_file_name, sizeof(fw_file_name), %s/%s, rom-fw_dir, + basename); +fw_cfg_add_file(fw_cfg, fw_file_name, rom-data, rom-romsize); +} return 0; err: -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 12/16] Add bootindex parameter to net/block/fd device
If bootindex is specified on command line a string that describes device in firmware readable way is added into sorted list. Later this list will be passed into firmware to control boot order. Signed-off-by: Gleb Natapov g...@redhat.com --- block_int.h |4 +++- hw/e1000.c |4 hw/eepro100.c |3 +++ hw/fdc.c|8 hw/ide/qdev.c |5 + hw/ne2000.c |3 +++ hw/pcnet.c |4 hw/qdev.c | 32 hw/qdev.h |1 + hw/rtl8139.c|4 hw/scsi-disk.c |1 + hw/usb-net.c|2 ++ hw/virtio-blk.c |2 ++ hw/virtio-net.c |2 ++ net.h |4 +++- sysemu.h|2 ++ vl.c| 40 17 files changed, 119 insertions(+), 2 deletions(-) diff --git a/block_int.h b/block_int.h index 3c3adb5..0a0e47d 100644 --- a/block_int.h +++ b/block_int.h @@ -227,6 +227,7 @@ typedef struct BlockConf { uint16_t logical_block_size; uint16_t min_io_size; uint32_t opt_io_size; +int32_t bootindex; } BlockConf; static inline unsigned int get_physical_block_exp(BlockConf *conf) @@ -249,6 +250,7 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf) DEFINE_PROP_UINT16(physical_block_size, _state, \ _conf.physical_block_size, 512), \ DEFINE_PROP_UINT16(min_io_size, _state, _conf.min_io_size, 0), \ -DEFINE_PROP_UINT32(opt_io_size, _state, _conf.opt_io_size, 0) +DEFINE_PROP_UINT32(opt_io_size, _state, _conf.opt_io_size, 0),\ +DEFINE_PROP_INT32(bootindex, _state, _conf.bootindex, -1) \ #endif /* BLOCK_INT_H */ diff --git a/hw/e1000.c b/hw/e1000.c index 57d08cf..e411b03 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -30,6 +30,7 @@ #include net.h #include net/checksum.h #include loader.h +#include sysemu.h #include e1000_hw.h @@ -1154,6 +1155,9 @@ static int pci_e1000_init(PCIDevice *pci_dev) d-dev.qdev.info-name, d-dev.qdev.id, d); qemu_format_nic_info_str(d-nic-nc, macaddr); + +add_boot_device_path(d-conf.bootindex, pci_dev-qdev, /ethernet-...@0); + return 0; } diff --git a/hw/eepro100.c b/hw/eepro100.c index f8a700a..a464e9b 100644 --- a/hw/eepro100.c +++ b/hw/eepro100.c @@ -46,6 +46,7 @@ #include pci.h #include net.h #include eeprom93xx.h +#include sysemu.h #define KiB 1024 @@ -1907,6 +1908,8 @@ static int e100_nic_init(PCIDevice *pci_dev) s-vmstate-name = s-nic-nc.model; vmstate_register(pci_dev-qdev, -1, s-vmstate, s); +add_boot_device_path(s-conf.bootindex, pci_dev-qdev, /ethernet-...@0); + return 0; } diff --git a/hw/fdc.c b/hw/fdc.c index 22fb64a..a7c7c17 100644 --- a/hw/fdc.c +++ b/hw/fdc.c @@ -35,6 +35,7 @@ #include sysbus.h #include qdev-addr.h #include blockdev.h +#include sysemu.h // /* debug Floppy devices */ @@ -523,6 +524,8 @@ typedef struct FDCtrlSysBus { typedef struct FDCtrlISABus { ISADevice busdev; struct FDCtrl state; +int32_t bootindexA; +int32_t bootindexB; } FDCtrlISABus; static uint32_t fdctrl_read (void *opaque, uint32_t reg) @@ -1992,6 +1995,9 @@ static int isabus_fdc_init1(ISADevice *dev) qdev_set_legacy_instance_id(dev-qdev, iobase, 2); ret = fdctrl_init_common(fdctrl); +add_boot_device_path(isa-bootindexA, dev-qdev, /flo...@0); +add_boot_device_path(isa-bootindexB, dev-qdev, /flo...@1); + return ret; } @@ -2051,6 +2057,8 @@ static ISADeviceInfo isa_fdc_info = { .qdev.props = (Property[]) { DEFINE_PROP_DRIVE(driveA, FDCtrlISABus, state.drives[0].bs), DEFINE_PROP_DRIVE(driveB, FDCtrlISABus, state.drives[1].bs), +DEFINE_PROP_INT32(bootindexA, FDCtrlISABus, bootindexA, -1), +DEFINE_PROP_INT32(bootindexB, FDCtrlISABus, bootindexB, -1), DEFINE_PROP_END_OF_LIST(), }, }; diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c index 01a181b..2bb5c27 100644 --- a/hw/ide/qdev.c +++ b/hw/ide/qdev.c @@ -21,6 +21,7 @@ #include qemu-error.h #include hw/ide/internal.h #include blockdev.h +#include sysemu.h /* - */ @@ -143,6 +144,10 @@ static int ide_drive_initfn(IDEDevice *dev) if (!dev-serial) { dev-serial = qemu_strdup(s-drive_serial_str); } + +add_boot_device_path(dev-conf.bootindex, dev-qdev, + dev-unit ? /d...@1 : /d...@0); + return 0; } diff --git a/hw/ne2000.c b/hw/ne2000.c index 126e7cf..a030106 100644 --- a/hw/ne2000.c +++ b/hw/ne2000.c @@ -26,6 +26,7 @@ #include net.h #include ne2000.h #include loader.h +#include sysemu.h /* debug NE2000 card */ //#define DEBUG_NE2000 @@ -746,6 +747,8 @@ static int pci_ne2000_init(PCIDevice *pci_dev) } } +add_boot_device_path(s-c.bootindex, pci_dev-qdev, /ethernet-...@0); + return 0; }
[PATCHv8 09/16] Record which USBDevice USBPort belongs too.
Ports on root hub will have NULL here. This is needed to reconstruct path from device to its root hub to build device path. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/usb-bus.c |3 ++- hw/usb-hub.c |2 +- hw/usb-musb.c |2 +- hw/usb-ohci.c |2 +- hw/usb-uhci.c |2 +- hw/usb.h |3 ++- 6 files changed, 8 insertions(+), 6 deletions(-) diff --git a/hw/usb-bus.c b/hw/usb-bus.c index b692503..256b881 100644 --- a/hw/usb-bus.c +++ b/hw/usb-bus.c @@ -110,11 +110,12 @@ USBDevice *usb_create_simple(USBBus *bus, const char *name) } void usb_register_port(USBBus *bus, USBPort *port, void *opaque, int index, - usb_attachfn attach) + USBDevice *pdev, usb_attachfn attach) { port-opaque = opaque; port-index = index; port-attach = attach; +port-pdev = pdev; QTAILQ_INSERT_TAIL(bus-free, port, next); bus-nfree++; } diff --git a/hw/usb-hub.c b/hw/usb-hub.c index 8e3a96b..8a3f829 100644 --- a/hw/usb-hub.c +++ b/hw/usb-hub.c @@ -535,7 +535,7 @@ static int usb_hub_initfn(USBDevice *dev) for (i = 0; i s-nb_ports; i++) { port = s-ports[i]; usb_register_port(usb_bus_from_device(dev), - port-port, s, i, usb_hub_attach); + port-port, s, i, s-dev, usb_hub_attach); port-wPortStatus = PORT_STAT_POWER; port-wPortChange = 0; } diff --git a/hw/usb-musb.c b/hw/usb-musb.c index 7f15842..9efe7a6 100644 --- a/hw/usb-musb.c +++ b/hw/usb-musb.c @@ -343,7 +343,7 @@ struct MUSBState { } usb_bus_new(s-bus, NULL /* FIXME */); -usb_register_port(s-bus, s-port, s, 0, musb_attach); +usb_register_port(s-bus, s-port, s, 0, NULL, musb_attach); return s; } diff --git a/hw/usb-ohci.c b/hw/usb-ohci.c index 8fb2f83..1247295 100644 --- a/hw/usb-ohci.c +++ b/hw/usb-ohci.c @@ -1705,7 +1705,7 @@ static void usb_ohci_init(OHCIState *ohci, DeviceState *dev, usb_bus_new(ohci-bus, dev); ohci-num_ports = num_ports; for (i = 0; i num_ports; i++) { -usb_register_port(ohci-bus, ohci-rhport[i].port, ohci, i, ohci_attach); +usb_register_port(ohci-bus, ohci-rhport[i].port, ohci, i, NULL, ohci_attach); } ohci-async_td = 0; diff --git a/hw/usb-uhci.c b/hw/usb-uhci.c index 1d83400..b9b822f 100644 --- a/hw/usb-uhci.c +++ b/hw/usb-uhci.c @@ -1115,7 +1115,7 @@ static int usb_uhci_common_initfn(UHCIState *s) usb_bus_new(s-bus, s-dev.qdev); for(i = 0; i NB_PORTS; i++) { -usb_register_port(s-bus, s-ports[i].port, s, i, uhci_attach); +usb_register_port(s-bus, s-ports[i].port, s, i, NULL, uhci_attach); } s-frame_timer = qemu_new_timer(vm_clock, uhci_frame_timer, s); s-expire_time = qemu_get_clock(vm_clock) + diff --git a/hw/usb.h b/hw/usb.h index 00d2802..0b32d77 100644 --- a/hw/usb.h +++ b/hw/usb.h @@ -203,6 +203,7 @@ struct USBPort { USBDevice *dev; usb_attachfn attach; void *opaque; +USBDevice *pdev; int index; /* internal port index, may be used with the opaque */ QTAILQ_ENTRY(USBPort) next; }; @@ -312,7 +313,7 @@ USBDevice *usb_create(USBBus *bus, const char *name); USBDevice *usb_create_simple(USBBus *bus, const char *name); USBDevice *usbdevice_create(const char *cmdline); void usb_register_port(USBBus *bus, USBPort *port, void *opaque, int index, - usb_attachfn attach); + USBDevice *pdev, usb_attachfn attach); void usb_unregister_port(USBBus *bus, USBPort *port); int usb_device_attach(USBDevice *dev); int usb_device_detach(USBDevice *dev); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 08/16] Add get_fw_dev_path callback for pci bus.
Signed-off-by: Gleb Natapov g...@redhat.com --- hw/pci.c | 108 - 1 files changed, 85 insertions(+), 23 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 0c15b13..e7ea907 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -43,6 +43,7 @@ static void pcibus_dev_print(Monitor *mon, DeviceState *dev, int indent); static char *pcibus_get_dev_path(DeviceState *dev); +static char *pcibus_get_fw_dev_path(DeviceState *dev); static int pcibus_reset(BusState *qbus); struct BusInfo pci_bus_info = { @@ -50,6 +51,7 @@ struct BusInfo pci_bus_info = { .size = sizeof(PCIBus), .print_dev = pcibus_dev_print, .get_dev_path = pcibus_get_dev_path, +.get_fw_dev_path = pcibus_get_fw_dev_path, .reset = pcibus_reset, .props = (Property[]) { DEFINE_PROP_PCI_DEVFN(addr, PCIDevice, devfn, -1), @@ -1117,45 +1119,63 @@ void pci_msi_notify(PCIDevice *dev, unsigned int vector) typedef struct { uint16_t class; const char *desc; +const char *fw_name; +uint16_t fw_ign_bits; } pci_class_desc; static const pci_class_desc pci_class_descriptions[] = { -{ 0x0100, SCSI controller}, -{ 0x0101, IDE controller}, -{ 0x0102, Floppy controller}, -{ 0x0103, IPI controller}, -{ 0x0104, RAID controller}, +{ 0x0001, VGA controller, display}, +{ 0x0100, SCSI controller, scsi}, +{ 0x0101, IDE controller, ide}, +{ 0x0102, Floppy controller, fdc}, +{ 0x0103, IPI controller, ipi}, +{ 0x0104, RAID controller, raid}, { 0x0106, SATA controller}, { 0x0107, SAS controller}, { 0x0180, Storage controller}, -{ 0x0200, Ethernet controller}, -{ 0x0201, Token Ring controller}, -{ 0x0202, FDDI controller}, -{ 0x0203, ATM controller}, +{ 0x0200, Ethernet controller, ethernet}, +{ 0x0201, Token Ring controller, token-ring}, +{ 0x0202, FDDI controller, fddi}, +{ 0x0203, ATM controller, atm}, { 0x0280, Network controller}, -{ 0x0300, VGA controller}, +{ 0x0300, VGA controller, display, 0x00ff}, { 0x0301, XGA controller}, { 0x0302, 3D controller}, { 0x0380, Display controller}, -{ 0x0400, Video controller}, -{ 0x0401, Audio controller}, +{ 0x0400, Video controller, video}, +{ 0x0401, Audio controller, sound}, { 0x0402, Phone}, { 0x0480, Multimedia controller}, -{ 0x0500, RAM controller}, -{ 0x0501, Flash controller}, +{ 0x0500, RAM controller, memory}, +{ 0x0501, Flash controller, flash}, { 0x0580, Memory controller}, -{ 0x0600, Host bridge}, -{ 0x0601, ISA bridge}, -{ 0x0602, EISA bridge}, -{ 0x0603, MC bridge}, -{ 0x0604, PCI bridge}, -{ 0x0605, PCMCIA bridge}, -{ 0x0606, NUBUS bridge}, -{ 0x0607, CARDBUS bridge}, +{ 0x0600, Host bridge, host}, +{ 0x0601, ISA bridge, isa}, +{ 0x0602, EISA bridge, eisa}, +{ 0x0603, MC bridge, mca}, +{ 0x0604, PCI bridge, pci}, +{ 0x0605, PCMCIA bridge, pcmcia}, +{ 0x0606, NUBUS bridge, nubus}, +{ 0x0607, CARDBUS bridge, cardbus}, { 0x0608, RACEWAY bridge}, { 0x0680, Bridge}, -{ 0x0c03, USB controller}, +{ 0x0700, Serial port, serial}, +{ 0x0701, Parallel port, parallel}, +{ 0x0800, Interrupt controller, interrupt-controller}, +{ 0x0801, DMA controller, dma-controller}, +{ 0x0802, Timer, timer}, +{ 0x0803, RTC, rtc}, +{ 0x0900, Keyboard, keyboard}, +{ 0x0901, Pen, pen}, +{ 0x0902, Mouse, mouse}, +{ 0x0A00, Dock station, dock, 0x00ff}, +{ 0x0B00, i386 cpu, cpu, 0x00ff}, +{ 0x0c00, Fireware contorller, fireware}, +{ 0x0c01, Access bus controller, access-bus}, +{ 0x0c02, SSA controller, ssa}, +{ 0x0c03, USB controller, usb}, +{ 0x0c04, Fibre channel controller, fibre-channel}, { 0, NULL} }; @@ -1960,6 +1980,48 @@ static void pcibus_dev_print(Monitor *mon, DeviceState *dev, int indent) } } +static char *pci_dev_fw_name(DeviceState *dev, char *buf, int len) +{ +PCIDevice *d = (PCIDevice *)dev; +const char *name = NULL; +const pci_class_desc *desc = pci_class_descriptions; +int class = pci_get_word(d-config + PCI_CLASS_DEVICE); + +while (desc-desc + (class ~desc-fw_ign_bits) != + (desc-class ~desc-fw_ign_bits)) { +desc++; +} + +if (desc-desc) { +name = desc-fw_name; +} + +if (name) { +pstrcpy(buf, len, name); +} else { +snprintf(buf, len, pci%04x,%04x, + pci_get_word(d-config + PCI_VENDOR_ID), + pci_get_word(d-config + PCI_DEVICE_ID)); +} + +return buf; +} + +static char *pcibus_get_fw_dev_path(DeviceState *dev) +{ +PCIDevice *d = (PCIDevice *)dev; +char path[50], name[33]; +int off; + +off = snprintf(path, sizeof(path), %...@%x, + pci_dev_fw_name(dev, name, sizeof name), + PCI_SLOT(d-devfn)); +if
[PATCHv8 05/16] Store IDE bus id in IDEBus structure for easy access.
Signed-off-by: Gleb Natapov g...@redhat.com --- hw/ide/cmd646.c |4 ++-- hw/ide/internal.h |3 ++- hw/ide/isa.c |2 +- hw/ide/piix.c |4 ++-- hw/ide/qdev.c |3 ++- hw/ide/via.c |4 ++-- 6 files changed, 11 insertions(+), 9 deletions(-) diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c index dfe6091..ea5d2dc 100644 --- a/hw/ide/cmd646.c +++ b/hw/ide/cmd646.c @@ -253,8 +253,8 @@ static int pci_cmd646_ide_initfn(PCIDevice *dev) pci_conf[PCI_INTERRUPT_PIN] = 0x01; // interrupt on pin 1 irq = qemu_allocate_irqs(cmd646_set_irq, d, 2); -ide_bus_new(d-bus[0], d-dev.qdev); -ide_bus_new(d-bus[1], d-dev.qdev); +ide_bus_new(d-bus[0], d-dev.qdev, 0); +ide_bus_new(d-bus[1], d-dev.qdev, 1); ide_init2(d-bus[0], irq[0]); ide_init2(d-bus[1], irq[1]); diff --git a/hw/ide/internal.h b/hw/ide/internal.h index 85f4a16..71af66f 100644 --- a/hw/ide/internal.h +++ b/hw/ide/internal.h @@ -449,6 +449,7 @@ struct IDEBus { IDEDevice *slave; BMDMAState *bmdma; IDEState ifs[2]; +int bus_id; uint8_t unit; uint8_t cmd; qemu_irq irq; @@ -567,7 +568,7 @@ void ide_init2_with_non_qdev_drives(IDEBus *bus, DriveInfo *hd0, void ide_init_ioport(IDEBus *bus, int iobase, int iobase2); /* hw/ide/qdev.c */ -void ide_bus_new(IDEBus *idebus, DeviceState *dev); +void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id); IDEDevice *ide_create_drive(IDEBus *bus, int unit, DriveInfo *drive); #endif /* HW_IDE_INTERNAL_H */ diff --git a/hw/ide/isa.c b/hw/ide/isa.c index 4206afd..8c59c5a 100644 --- a/hw/ide/isa.c +++ b/hw/ide/isa.c @@ -67,7 +67,7 @@ static int isa_ide_initfn(ISADevice *dev) { ISAIDEState *s = DO_UPCAST(ISAIDEState, dev, dev); -ide_bus_new(s-bus, s-dev.qdev); +ide_bus_new(s-bus, s-dev.qdev, 0); ide_init_ioport(s-bus, s-iobase, s-iobase2); isa_init_irq(dev, s-irq, s-isairq); isa_init_ioport_range(dev, s-iobase, 8); diff --git a/hw/ide/piix.c b/hw/ide/piix.c index e02b89a..1c0cb0c 100644 --- a/hw/ide/piix.c +++ b/hw/ide/piix.c @@ -125,8 +125,8 @@ static int pci_piix_ide_initfn(PCIIDEState *d) vmstate_register(d-dev.qdev, 0, vmstate_ide_pci, d); -ide_bus_new(d-bus[0], d-dev.qdev); -ide_bus_new(d-bus[1], d-dev.qdev); +ide_bus_new(d-bus[0], d-dev.qdev, 0); +ide_bus_new(d-bus[1], d-dev.qdev, 1); ide_init_ioport(d-bus[0], 0x1f0, 0x3f6); ide_init_ioport(d-bus[1], 0x170, 0x376); diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c index 6d27b60..88ff657 100644 --- a/hw/ide/qdev.c +++ b/hw/ide/qdev.c @@ -29,9 +29,10 @@ static struct BusInfo ide_bus_info = { .size = sizeof(IDEBus), }; -void ide_bus_new(IDEBus *idebus, DeviceState *dev) +void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id) { qbus_create_inplace(idebus-qbus, ide_bus_info, dev, NULL); +idebus-bus_id = bus_id; } static int ide_qdev_init(DeviceState *qdev, DeviceInfo *base) diff --git a/hw/ide/via.c b/hw/ide/via.c index 66be0c4..78857e8 100644 --- a/hw/ide/via.c +++ b/hw/ide/via.c @@ -154,8 +154,8 @@ static int vt82c686b_ide_initfn(PCIDevice *dev) vmstate_register(dev-qdev, 0, vmstate_ide_pci, d); -ide_bus_new(d-bus[0], d-dev.qdev); -ide_bus_new(d-bus[1], d-dev.qdev); +ide_bus_new(d-bus[0], d-dev.qdev, 0); +ide_bus_new(d-bus[1], d-dev.qdev, 1); ide_init2(d-bus[0], isa_reserve_irq(14)); ide_init2(d-bus[1], isa_reserve_irq(15)); ide_init_ioport(d-bus[0], 0x1f0, 0x3f6); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 06/16] Add get_fw_dev_path callback to IDE bus.
Signed-off-by: Gleb Natapov g...@redhat.com --- hw/ide/qdev.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c index 88ff657..01a181b 100644 --- a/hw/ide/qdev.c +++ b/hw/ide/qdev.c @@ -24,9 +24,12 @@ /* - */ +static char *idebus_get_fw_dev_path(DeviceState *dev); + static struct BusInfo ide_bus_info = { .name = IDE, .size = sizeof(IDEBus), +.get_fw_dev_path = idebus_get_fw_dev_path, }; void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id) @@ -35,6 +38,16 @@ void ide_bus_new(IDEBus *idebus, DeviceState *dev, int bus_id) idebus-bus_id = bus_id; } +static char *idebus_get_fw_dev_path(DeviceState *dev) +{ +char path[30]; + +snprintf(path, sizeof(path), %...@%d, qdev_fw_name(dev), + ((IDEBus*)dev-parent_bus)-bus_id); + +return strdup(path); +} + static int ide_qdev_init(DeviceState *qdev, DeviceInfo *base) { IDEDevice *dev = DO_UPCAST(IDEDevice, qdev, qdev); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 14/16] Add bootindex for option roms.
Extend -option-rom command to have additional parameter ,bootindex=. Signed-off-by: Gleb Natapov g...@redhat.com --- hw/loader.c| 16 +++- hw/loader.h|8 hw/multiboot.c |3 ++- hw/ne2000.c|2 +- hw/nseries.c |4 ++-- hw/palm.c |6 +++--- hw/pc.c|7 --- hw/pci.c |2 +- hw/pcnet-pci.c |2 +- qemu-config.c | 17 + sysemu.h |6 +- vl.c | 11 +-- 12 files changed, 60 insertions(+), 24 deletions(-) diff --git a/hw/loader.c b/hw/loader.c index 1e98326..eb198f6 100644 --- a/hw/loader.c +++ b/hw/loader.c @@ -107,7 +107,7 @@ int load_image_targphys(const char *filename, size = get_image_size(filename); if (size 0) -rom_add_file_fixed(filename, addr); +rom_add_file_fixed(filename, addr, -1); return size; } @@ -557,10 +557,11 @@ static void rom_insert(Rom *rom) } int rom_add_file(const char *file, const char *fw_dir, - target_phys_addr_t addr) + target_phys_addr_t addr, int32_t bootindex) { Rom *rom; int rc, fd = -1; +char devpath[100]; rom = qemu_mallocz(sizeof(*rom)); rom-name = qemu_strdup(file); @@ -605,7 +606,12 @@ int rom_add_file(const char *file, const char *fw_dir, snprintf(fw_file_name, sizeof(fw_file_name), %s/%s, rom-fw_dir, basename); fw_cfg_add_file(fw_cfg, fw_file_name, rom-data, rom-romsize); +snprintf(devpath, sizeof(devpath), /r...@%s, fw_file_name); +} else { +snprintf(devpath, sizeof(devpath), /rom@ TARGET_FMT_plx, addr); } + +add_boot_device_path(bootindex, NULL, devpath); return 0; err: @@ -635,12 +641,12 @@ int rom_add_blob(const char *name, const void *blob, size_t len, int rom_add_vga(const char *file) { -return rom_add_file(file, vgaroms, 0); +return rom_add_file(file, vgaroms, 0, -1); } -int rom_add_option(const char *file) +int rom_add_option(const char *file, int32_t bootindex) { -return rom_add_file(file, genroms, 0); +return rom_add_file(file, genroms, 0, bootindex); } static void rom_reset(void *unused) diff --git a/hw/loader.h b/hw/loader.h index 1f82fc5..fc6bdff 100644 --- a/hw/loader.h +++ b/hw/loader.h @@ -22,7 +22,7 @@ void pstrcpy_targphys(const char *name, int rom_add_file(const char *file, const char *fw_dir, - target_phys_addr_t addr); + target_phys_addr_t addr, int32_t bootindex); int rom_add_blob(const char *name, const void *blob, size_t len, target_phys_addr_t addr); int rom_load_all(void); @@ -31,8 +31,8 @@ int rom_copy(uint8_t *dest, target_phys_addr_t addr, size_t size); void *rom_ptr(target_phys_addr_t addr); void do_info_roms(Monitor *mon); -#define rom_add_file_fixed(_f, _a) \ -rom_add_file(_f, NULL, _a) +#define rom_add_file_fixed(_f, _a, _i) \ +rom_add_file(_f, NULL, _a, _i) #define rom_add_blob_fixed(_f, _b, _l, _a) \ rom_add_blob(_f, _b, _l, _a) @@ -43,6 +43,6 @@ void do_info_roms(Monitor *mon); #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA) int rom_add_vga(const char *file); -int rom_add_option(const char *file); +int rom_add_option(const char *file, int32_t bootindex); #endif diff --git a/hw/multiboot.c b/hw/multiboot.c index e710bbb..7cc3055 100644 --- a/hw/multiboot.c +++ b/hw/multiboot.c @@ -331,7 +331,8 @@ int load_multiboot(void *fw_cfg, fw_cfg_add_bytes(fw_cfg, FW_CFG_INITRD_DATA, mb_bootinfo_data, sizeof(bootinfo)); -option_rom[nb_option_roms] = multiboot.bin; +option_rom[nb_option_roms].name = multiboot.bin; +option_rom[nb_option_roms].bootindex = 0; nb_option_roms++; return 1; /* yes, we are multiboot */ diff --git a/hw/ne2000.c b/hw/ne2000.c index a030106..5966359 100644 --- a/hw/ne2000.c +++ b/hw/ne2000.c @@ -742,7 +742,7 @@ static int pci_ne2000_init(PCIDevice *pci_dev) if (!pci_dev-qdev.hotplugged) { static int loaded = 0; if (!loaded) { -rom_add_option(pxe-ne2k_pci.bin); +rom_add_option(pxe-ne2k_pci.bin, -1); loaded = 1; } } diff --git a/hw/nseries.c b/hw/nseries.c index 04a028d..2f6f473 100644 --- a/hw/nseries.c +++ b/hw/nseries.c @@ -1326,7 +1326,7 @@ static void n8x0_init(ram_addr_t ram_size, const char *boot_device, qemu_register_reset(n8x0_boot_init, s); } -if (option_rom[0] (boot_device[0] == 'n' || !kernel_filename)) { +if (option_rom[0].name (boot_device[0] == 'n' || !kernel_filename)) { int rom_size; uint8_t nolo_tags[0x1]; /* No, wait, better start at the ROM. */ @@ -1341,7 +1341,7 @@ static void n8x0_init(ram_addr_t ram_size, const char *boot_device, * * The code above is for loading the `zImage' file from Nokia * images. */ -rom_size =
[PATCHv8 00/16] boot order specification
Forget to save a couple of buffers before sending version 7 :( Anthony, Blue can this be applied now? Gleb Natapov (16): Introduce fw_name field to DeviceInfo structure. Introduce new BusInfo callback get_fw_dev_path. Keep track of ISA ports ISA device is using in qdev. Add get_fw_dev_path callback to ISA bus in qdev. Store IDE bus id in IDEBus structure for easy access. Add get_fw_dev_path callback to IDE bus. Add get_fw_dev_path callback for system bus. Add get_fw_dev_path callback for pci bus. Record which USBDevice USBPort belongs too. Add get_fw_dev_path callback for usb bus. Add get_fw_dev_path callback to scsi bus. Add bootindex parameter to net/block/fd device Change fw_cfg_add_file() to get full file path as a parameter. Add bootindex for option roms. Add notifier that will be called when machine is fully created. Pass boot device list to firmware. block_int.h |4 +- hw/cs4231a.c |1 + hw/e1000.c|4 ++ hw/eepro100.c |3 + hw/fdc.c | 12 ++ hw/fw_cfg.c | 30 -- hw/fw_cfg.h |4 +- hw/gus.c |4 ++ hw/ide/cmd646.c |4 +- hw/ide/internal.h |3 +- hw/ide/isa.c |5 ++- hw/ide/piix.c |4 +- hw/ide/qdev.c | 22 ++- hw/ide/via.c |4 +- hw/isa-bus.c | 42 +++ hw/isa.h |4 ++ hw/lance.c|1 + hw/loader.c | 32 --- hw/loader.h |8 ++-- hw/m48t59.c |1 + hw/mc146818rtc.c |1 + hw/multiboot.c|3 +- hw/ne2000-isa.c |3 + hw/ne2000.c |5 ++- hw/nseries.c |4 +- hw/palm.c |6 +- hw/parallel.c |5 ++ hw/pc.c |7 ++- hw/pci.c | 110 --- hw/pci_host.c |2 + hw/pckbd.c|3 + hw/pcnet-pci.c|2 +- hw/pcnet.c|4 ++ hw/piix_pci.c |1 + hw/qdev.c | 32 +++ hw/qdev.h | 14 ++ hw/rtl8139.c |4 ++ hw/sb16.c |4 ++ hw/scsi-bus.c | 23 +++ hw/scsi-disk.c|2 + hw/serial.c |1 + hw/sysbus.c | 30 ++ hw/sysbus.h |4 ++ hw/usb-bus.c | 45 - hw/usb-hub.c |3 +- hw/usb-musb.c |2 +- hw/usb-net.c |3 + hw/usb-ohci.c |2 +- hw/usb-uhci.c |2 +- hw/usb.h |3 +- hw/virtio-blk.c |2 + hw/virtio-net.c |2 + hw/virtio-pci.c |1 + net.h |4 +- qemu-config.c | 17 sysemu.h | 11 +- vl.c | 114 - 57 files changed, 593 insertions(+), 80 deletions(-) -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv8 15/16] Add notifier that will be called when machine is fully created.
Action that depends on fully initialized device model should register with this notifier chain. Signed-off-by: Gleb Natapov g...@redhat.com --- sysemu.h |2 ++ vl.c | 15 +++ 2 files changed, 17 insertions(+), 0 deletions(-) diff --git a/sysemu.h b/sysemu.h index 48f8eee..c42f33a 100644 --- a/sysemu.h +++ b/sysemu.h @@ -60,6 +60,8 @@ void qemu_system_reset(void); void qemu_add_exit_notifier(Notifier *notify); void qemu_remove_exit_notifier(Notifier *notify); +void qemu_add_machine_init_done_notifier(Notifier *notify); + void do_savevm(Monitor *mon, const QDict *qdict); int load_vmstate(const char *name); void do_delvm(Monitor *mon, const QDict *qdict); diff --git a/vl.c b/vl.c index 844d6a5..0d20d26 100644 --- a/vl.c +++ b/vl.c @@ -254,6 +254,9 @@ static void *boot_set_opaque; static NotifierList exit_notifiers = NOTIFIER_LIST_INITIALIZER(exit_notifiers); +static NotifierList machine_init_done_notifiers = +NOTIFIER_LIST_INITIALIZER(machine_init_done_notifiers); + int kvm_allowed = 0; uint32_t xen_domid; enum xen_mode xen_mode = XEN_EMULATE; @@ -1782,6 +1785,16 @@ static void qemu_run_exit_notifiers(void) notifier_list_notify(exit_notifiers); } +void qemu_add_machine_init_done_notifier(Notifier *notify) +{ +notifier_list_add(machine_init_done_notifiers, notify); +} + +static void qemu_run_machine_init_done_notifiers(void) +{ +notifier_list_notify(machine_init_done_notifiers); +} + static const QEMUOption *lookup_opt(int argc, char **argv, const char **poptarg, int *poptind) { @@ -3028,6 +3041,8 @@ int main(int argc, char **argv, char **envp) } qemu_register_reset((void *)qbus_reset_all, sysbus_get_default()); +qemu_run_machine_init_done_notifiers(); + qemu_system_reset(); if (loadvm) { if (load_vmstate(loadvm) 0) { -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/x86: enlarge number of possible CPUID leaves
On 12/08/2010 01:13 PM, Andre Przywara wrote: Avi, Marcello, can you please commit this simple fix? (turning 40 to 80?) Without it QEMU crashes reliably on our new CPUs (they return 46 leaves) and causes pain in our testing, because we have to manually apply this patch on each tree. Sorry about that, applied now. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/x86: enlarge number of possible CPUID leaves
On 12/01/2010 02:55 PM, Andre Przywara wrote: Avi Kivity wrote: On 12/01/2010 01:17 PM, Andre Przywara wrote: Currently the number of CPUID leaves KVM handles is limited to 40. My desktop machine (AthlonII) already has 35 and future CPUs will expand this well beyond the limit. Extend the limit to 80 to make room for future processors. Signed-off-by: Andre Przywaraandre.przyw...@amd.com --- arch/x86/include/asm/kvm_host.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) Hi, I found that either KVM or QEMU (possibly both) are broken in respect to handling more CPUID entries than the limit dictates. KVM will return -E2BIG, which is the same error as if the user hasn't provided enough space to hold all entries. Now QEMU will continue to enlarge the allocated memory until it gets into an out-of-memory condition. I have tried to fix this with teaching KVM how to deal with a capped number of entries (there are some bugs in the current code), but this will limit the number of CPUID entries KVM handles, which will surely cut of the lastly appended PV leaves. A proper fix would be to make this allocation dynamic. Is this a feasible way or will this lead to issues or side-effects? Well, there has to be some limit, or userspace can allocate unbounded kernel memory. But this limit should not be a compile-time constant, but a runtime one. The needed size depends on the host CPU (plus the KVM PV leaves) and thus could be determined once for all VMs and vCPUs at module load-time. But then we cannot use the static array allocation we currently have in struct kvm_vcpu_arch: struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES]; So we would use a kind-of dynamic allocation bounded by the host CPU's need. But for the code is does not make much difference to a real dynamic allocation. Also we could implement kvm_dev_ioctl_get_supported_cpuid without the vmalloc, if we don't care about some dozens of copy_to_user() calls in this function. Then we would not need this limit in GET_SUPPORTED_CPUID at all, but it will strike us again at KVM_SET_CPUID[2], where we may not fulfill the promises we gave earlier. Having said this, what about that: kvm_dev_ioctl_get_supported_cpuid is invariant to the VM or vCPU (as it is used by a system ioctl), so it could be run once at initialization, which would limit the ioctl implementation to a plain bounded copy. Would you want such a patch (removing the vmalloc and maybe even the limit)? Making GET_SUPPORTED_CPUID data static would be an improvement, yes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.
On 12/08/2010 02:11 AM, Yoshiaki Tamura wrote: 2010/12/8 Isaku Yamahatayamah...@valinux.co.jp: QLIST_FOREACH_SAFE? Thanks! So, it should be, QLIST_FOREACH_SAFE(e,vm_change_state_head, entries, ne) { e-cb(e-opaque, running, reason); } I'll put it in the next spin. This is still brittle though because it only allows the current handler to delete itself. A better approach is to borrow the technique we use with file descriptors (using a deleted flag) as that is robust against deletion of any elements in a handler. Regards, Anthony Liguori Yoshi On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote: By copying the next entry to a tmp pointer, qemu_del_vm_change_state_handler() can be called in the handler. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- vl.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/vl.c b/vl.c index 805e11f..6b6aec0 100644 --- a/vl.c +++ b/vl.c @@ -1073,11 +1073,12 @@ void qemu_del_vm_change_state_handler(VMChangeStateEntry *e) void vm_state_notify(int running, int reason) { -VMChangeStateEntry *e; +VMChangeStateEntry *e, *ne; trace_vm_state_notify(running, reason); -for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) { +for (e = vm_change_state_head.lh_first; e; e = ne) { +ne = e-entries.le_next; e-cb(e-opaque, running, reason); } } -- 1.7.1.2 -- yamahata -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Fix build error on s390 due to missing tlbs_dirty
Make it available for all archs. Signed-off-by: Avi Kivity a...@redhat.com --- include/linux/kvm_host.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bd0da8f..b5021db 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -256,8 +256,8 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count; - long tlbs_dirty; #endif + long tlbs_dirty; }; /* The guest did something we don't support. */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/28] nVMX: Nested VMX, v7
Hi, This is the seventh iteration of the nested VMX patch set. It fixes a bunch of bugs in the previous iteration, and in particular it now works correctly with EPT in the L0 hypervisor, so ept=0 no longer needs to be specified. This new set of patches should apply to the current KVM trunk (I checked with 66fc6be8d2b04153b753182610f919faf9c705bc). In particular it uses the recently added is_guest_mode() function (common to both nested svm and vmx) instead of inventing our own flag. About nested VMX: - The following 28 patches implement nested VMX support. This feature enables a guest to use the VMX APIs in order to run its own nested guests. In other words, it allows running hypervisors (that use VMX) under KVM. Multiple guest hypervisors can be run concurrently, and each of those can in turn host multiple guests. The theory behind this work, our implementation, and its performance characteristics were presented in OSDI 2010 (the USENIX Symposium on Operating Systems Design and Implementation). Our paper was titled The Turtles Project: Design and Implementation of Nested Virtualization, and was awarded Jay Lepreau Best Paper. The paper is available online, at: http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf This patch set does not include all the features described in the paper. In particular, this patch set is missing nested EPT (shadow page tables are used in L1, while L0 can use shadow page tables or EPT). It is also missing some features required to run VMWare Server as a guest. These missing features will be sent as follow-on patchs. Running nested VMX: -- The current patches have a number of requirements, which will be relaxed in follow-on patches: 1. This version was only tested with KVM (64-bit) as a guest hypervisor, and Linux as a nested guest. 2. SMP is supported in the code, but is unfortunately buggy in this version and often leads to hangs. Use the nosmp option in the L0 (topmost) kernel to avoid this bug (and to reduce your performance ;-)).. 3. No modifications are required to user space (qemu). However, qemu does not currently list VMX as a CPU feature in its emulated CPUs (even when they are named after CPUs that do normally have VMX). Therefore, the -cpu host option should be given to qemu, to tell it to support CPU features which exist in the host - and in particular VMX. This requirement can be made unnecessary by a trivial patch to qemu (which I will submit in the future). 4. The nested VMX feature is currently disabled by default. It must be explicitly enabled with the nested=1 option to the kvm-intel module. 5. Nested VPID is not properly supported in this version. You must give the vpid=0 module options to kvm-intel to turn this feature off. Patch statistics: - Documentation/kvm/nested-vmx.txt | 237 ++ arch/x86/include/asm/kvm_host.h |2 arch/x86/include/asm/vmx.h | 31 arch/x86/kvm/svm.c |6 arch/x86/kvm/vmx.c | 2416 - arch/x86/kvm/x86.c | 16 arch/x86/kvm/x86.h |6 7 files changed, 2676 insertions(+), 38 deletions(-) -- Nadav Har'El IBM Haifa Research Lab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/28] nVMX: Add nested module option to vmx.c
This patch adds a module option nested to vmx.c, which controls whether the guest can use VMX instructions, i.e., whether we allow nested virtualization. A similar, but separate, option already exists for the SVM module. This option currently defaults to 0, meaning that nested VMX must be explicitly enabled by giving nested=1. When nested VMX matures, the default should probably be changed to enable nested VMX by default - just like nested SVM is currently enabled by default. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c |8 1 file changed, 8 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:48.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:48.0 +0200 @@ -69,6 +69,14 @@ module_param(emulate_invalid_guest_state static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +/* + * If nested=1, nested virtualization is supported, i.e., the guest may use + * VMX and be a hypervisor for its own guests. If nested=0, the guest may not + * use VMX instructions. + */ +static int nested = 0; +module_param(nested, int, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features
If the nested module option is enabled, add the VMX CPU feature to the list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl. Qemu uses this ioctl, and intersects KVM's list with its own list of desired cpu features (depending on the -cpu option given to qemu) to determine the final list of features presented to the guest. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c |2 ++ 1 file changed, 2 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:48.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:48.0 +0200 @@ -4284,6 +4284,8 @@ static void vmx_cpuid_update(struct kvm_ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) { + if (func == 1 nested) + entry-ecx |= bit(X86_FEATURE_VMX); } static struct kvm_x86_ops vmx_x86_ops = { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/28] nVMX: Implement VMXON and VMXOFF
This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 102 ++- 1 file changed, 100 insertions(+), 2 deletions(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:48.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:48.0 +0200 @@ -127,6 +127,17 @@ struct shared_msr_entry { u64 mask; }; +/* + * The nested_vmx structure is part of vcpu_vmx, and holds information we need + * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example, + * the current VMCS set by L1, a list of the VMCSs used to run the active + * L2 guests on the hardware, and more. + */ +struct nested_vmx { + /* Has the level1 guest done vmxon? */ + bool vmxon; +}; + struct vcpu_vmx { struct kvm_vcpu vcpu; struct list_head local_vcpus_link; @@ -174,6 +185,9 @@ struct vcpu_vmx { u32 exit_reason; bool rdtscp_enabled; + + /* Support for a guest hypervisor (nested VMX) */ + struct nested_vmx nested; }; static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) @@ -3653,6 +3667,90 @@ static int handle_invalid_op(struct kvm_ } /* + * Emulate the VMXON instruction. + * Currently, we just remember that VMX is active, and do not save or even + * inspect the argument to VMXON (the so-called VMXON pointer) because we + * do not currently need to store anything in that guest-allocated memory + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their + * argument is different from the VMXON pointer (which the spec says they do). + */ +static int handle_vmon(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + /* The Intel VMX Instruction Reference lists a bunch of bits that +* are prerequisite to running VMXON, most notably CR4.VMXE must be +* set to 1. Otherwise, we should fail with #UD. We test these now: +*/ + if (!nested || + !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) || + !kvm_read_cr0_bits(vcpu, X86_CR0_PE) || + (vmx_get_rflags(vcpu) X86_EFLAGS_VM)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + vmx_get_segment(vcpu, cs, VCPU_SREG_CS); + if (is_long_mode(vcpu) !cs.l) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (vmx_get_cpl(vcpu)) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + vmx-nested.vmxon = true; + + skip_emulated_instruction(vcpu); + return 1; +} + +/* + * Intel's VMX Instruction Reference specifies a common set of prerequisites + * for running VMX instructions (except VMXON, whose prerequisites are + * slightly different). It also specifies what exception to inject otherwise. + */ +static int nested_vmx_check_permission(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!vmx-nested.vmxon) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 0; + } + + vmx_get_segment(vcpu, cs, VCPU_SREG_CS); + if ((vmx_get_rflags(vcpu) X86_EFLAGS_VM) || + (is_long_mode(vcpu) !cs.l)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 0; + } + + if (vmx_get_cpl(vcpu)) { + kvm_inject_gp(vcpu, 0); + return 0; + } + + return 1; +} + +/* Emulate the VMXOFF instruction */ +static int handle_vmoff(struct kvm_vcpu *vcpu) +{ + if (!nested_vmx_check_permission(vcpu)) + return 1; + + to_vmx(vcpu)-nested.vmxon = false; + + skip_emulated_instruction(vcpu); + return 1; +} + +/* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs * to be done to userspace and return 0. @@ -3680,8 +3778,8 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_VMREAD] = handle_vmx_insn, [EXIT_REASON_VMRESUME]= handle_vmx_insn, [EXIT_REASON_VMWRITE] = handle_vmx_insn, - [EXIT_REASON_VMOFF] = handle_vmx_insn, - [EXIT_REASON_VMON]= handle_vmx_insn, + [EXIT_REASON_VMOFF] = handle_vmoff, + [EXIT_REASON_VMON]= handle_vmon, [EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold, [EXIT_REASON_APIC_ACCESS] = handle_apic_access, [EXIT_REASON_WBINVD] = handle_wbinvd, -- To unsubscribe from this list: send
[PATCH 04/28] nVMX: Allow setting the VMXE bit in CR4
This patch allows the guest to enable the VMXE bit in CR4, which is a prerequisite to running VMXON. Whether to allow setting the VMXE bit now depends on the architecture (svm or vmx), so its checking has moved to kvm_x86_ops-set_cr4(). This function now returns an int: If kvm_x86_ops-set_cr4() returns 1, __kvm_set_cr4() will also return 1, and this will cause kvm_set_cr4() will throw a #GP. Turning on the VMXE bit is allowed only when the nested module option is on, and turning it off is forbidden after a vmxon. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/svm.c |6 +- arch/x86/kvm/vmx.c | 13 +++-- arch/x86/kvm/x86.c |4 +--- 4 files changed, 18 insertions(+), 7 deletions(-) --- .before/arch/x86/include/asm/kvm_host.h 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/include/asm/kvm_host.h 2010-12-08 18:56:49.0 +0200 @@ -535,7 +535,7 @@ struct kvm_x86_ops { void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu); void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0); void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); - void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4); + int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4); void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer); void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt); void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt); --- .before/arch/x86/kvm/svm.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/svm.c 2010-12-08 18:56:49.0 +0200 @@ -1370,11 +1370,14 @@ static void svm_set_cr0(struct kvm_vcpu update_cr0_intercept(svm); } -static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { unsigned long host_cr4_mce = read_cr4() X86_CR4_MCE; unsigned long old_cr4 = to_svm(vcpu)-vmcb-save.cr4; + if (cr4 X86_CR4_VMXE) + return 1; + if (npt_enabled ((old_cr4 ^ cr4) X86_CR4_PGE)) force_new_asid(vcpu); @@ -1383,6 +1386,7 @@ static void svm_set_cr4(struct kvm_vcpu cr4 |= X86_CR4_PAE; cr4 |= host_cr4_mce; to_svm(vcpu)-vmcb-save.cr4 = cr4; + return 0; } static void svm_set_segment(struct kvm_vcpu *vcpu, --- .before/arch/x86/kvm/x86.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/x86.c 2010-12-08 18:56:49.0 +0200 @@ -610,11 +610,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u !load_pdptrs(vcpu, vcpu-arch.walk_mmu, vcpu-arch.cr3)) return 1; - if (cr4 X86_CR4_VMXE) + if (kvm_x86_ops-set_cr4(vcpu, cr4)) return 1; - kvm_x86_ops-set_cr4(vcpu, cr4); - if ((cr4 ^ old_cr4) pdptr_bits) kvm_mmu_reset_context(vcpu); --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 @@ -1876,7 +1876,7 @@ static void ept_save_pdptrs(struct kvm_v (unsigned long *)vcpu-arch.regs_dirty); } -static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); +static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); static void ept_update_paging_mode_cr0(unsigned long *hw_cr0, unsigned long cr0, @@ -1971,11 +1971,19 @@ static void vmx_set_cr3(struct kvm_vcpu vmcs_writel(GUEST_CR3, guest_cr3); } -static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ? KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON); + if (cr4 X86_CR4_VMXE) { + if (!nested) + return 1; + } else { + if (nested to_vmx(vcpu)-nested.vmxon) + return 1; + } + vcpu-arch.cr4 = cr4; if (enable_ept) { if (!is_paging(vcpu)) { @@ -1988,6 +1996,7 @@ static void vmx_set_cr4(struct kvm_vcpu vmcs_writel(CR4_READ_SHADOW, cr4); vmcs_writel(GUEST_CR4, hw_cr4); + return 0; } static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/28] nVMX: Introduce vmcs12: a VMCS structure for L1
An implementation of VMX needs to define a VMCS structure. This structure is kept in guest memory, but is opaque to the guest (who can only read or write it with VMX instructions). This patch starts to define the VMCS structure which our nested VMX implementation will present to L1. We call it vmcs12, as it is the VMCS that L1 keeps for its L2 guests. We will add more content to this structure in later patches. This patch also adds the notion (as required by the VMX spec) of L1's current VMCS, and finally includes utility functions for mapping the guest-allocated VMCSs in host memory. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 64 +++ 1 file changed, 64 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 @@ -128,6 +128,34 @@ struct shared_msr_entry { }; /* + * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a + * single nested guest (L2), hence the name vmcs12. Any VMX implementation has + * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is + * stored in guest memory specified by VMPTRLD, but is opaque to the guest, + * which must access it using VMREAD/VMWRITE/VMCLEAR instructions. More + * than one of these structures may exist, if L1 runs multiple L2 guests. + * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the + * underlying hardware which will be used to run L2. + * This structure is packed in order to preserve the binary content after live + * migration. If there are changes in the content or layout, VMCS12_REVISION + * must be changed. + */ +struct __packed vmcs12 { + /* According to the Intel spec, a VMCS region must start with the +* following two fields. Then follow implementation-specific data. +*/ + u32 revision_id; + u32 abort; +}; + +/* + * VMCS12_REVISION is an arbitrary id that should be changed if the content or + * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and + * VMPTRLD verifies that the VMCS region that L1 is loading contains this id. + */ +#define VMCS12_REVISION 0x11e57ed0 + +/* * The nested_vmx structure is part of vcpu_vmx, and holds information we need * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example, * the current VMCS set by L1, a list of the VMCSs used to run the active @@ -136,6 +164,12 @@ struct shared_msr_entry { struct nested_vmx { /* Has the level1 guest done vmxon? */ bool vmxon; + + /* The guest-physical address of the current VMCS L1 keeps for L2 */ + gpa_t current_vmptr; + /* The host-usable pointer to the above */ + struct page *current_vmcs12_page; + struct vmcs12 *current_vmcs12; }; struct vcpu_vmx { @@ -195,6 +229,28 @@ static inline struct vcpu_vmx *to_vmx(st return container_of(vcpu, struct vcpu_vmx, vcpu); } +static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr) +{ + struct page *page = gfn_to_page(vcpu-kvm, addr PAGE_SHIFT); + if (is_error_page(page)) { + kvm_release_page_clean(page); + return NULL; + } + return page; +} + +static void nested_release_page(struct page *page) +{ + kunmap(page); + kvm_release_page_dirty(page); +} + +static void nested_release_page_clean(struct page *page) +{ + kunmap(page); + kvm_release_page_clean(page); +} + static int init_rmode(struct kvm *kvm); static u64 construct_eptp(unsigned long root_hpa); static void kvm_cpu_vmxon(u64 addr); @@ -3755,6 +3811,9 @@ static int handle_vmoff(struct kvm_vcpu to_vmx(vcpu)-nested.vmxon = false; + if (to_vmx(vcpu)-nested.current_vmptr != -1ull) + nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page); + skip_emulated_instruction(vcpu); return 1; } @@ -4183,6 +4242,8 @@ static void vmx_free_vcpu(struct kvm_vcp struct vcpu_vmx *vmx = to_vmx(vcpu); free_vpid(vmx); + if (vmx-nested.vmxon to_vmx(vcpu)-nested.current_vmptr != -1ull) + nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page); vmx_free_vmcs(vcpu); kfree(vmx-guest_msrs); kvm_vcpu_uninit(vcpu); @@ -4249,6 +4310,9 @@ static struct kvm_vcpu *vmx_create_vcpu( goto free_vmcs; } + vmx-nested.current_vmptr = -1ull; + vmx-nested.current_vmcs12 = NULL; + return vmx-vcpu; free_vmcs: -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs
When the guest can use VMX instructions (when the nested module option is on), it should also be able to read and write VMX MSRs, e.g., to query about VMX capabilities. This patch adds this support. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 117 +++ arch/x86/kvm/x86.c |6 +- 2 files changed, 122 insertions(+), 1 deletion(-) --- .before/arch/x86/kvm/x86.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/x86.c 2010-12-08 18:56:49.0 +0200 @@ -796,7 +796,11 @@ static u32 msrs_to_save[] = { #ifdef CONFIG_X86_64 MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR, #endif - MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA + MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA, + MSR_IA32_FEATURE_CONTROL, MSR_IA32_VMX_BASIC, + MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS, + MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS, + MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP, }; static unsigned num_msrs_to_save; --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 @@ -1211,6 +1211,119 @@ static void vmx_adjust_tsc_offset(struct } /* + * If we allow our guest to use VMX instructions (i.e., nested VMX), we should + * also let it use VMX-specific MSRs. + * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 0 when we handled a + * VMX-specific MSR, or 1 when we haven't (and the caller should handled it + * like all other MSRs). + */ +static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) +{ + u64 vmx_msr = 0; + u32 vmx_msr_high, vmx_msr_low; + + switch (msr_index) { + case MSR_IA32_FEATURE_CONTROL: + *pdata = 0; + break; + case MSR_IA32_VMX_BASIC: + /* +* This MSR reports some information about VMX support of the +* processor. We should return information about the VMX we +* emulate for the guest, and the VMCS structure we give it - +* not about the VMX support of the underlying hardware. +* However, some capabilities of the underlying hardware are +* used directly by our emulation (e.g., the physical address +* width), so these are copied from what the hardware reports. +*/ + *pdata = VMCS12_REVISION | (((u64)sizeof(struct vmcs12)) 32); + rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr); +#define VMX_BASIC_64 0x0001LLU +#define VMX_BASIC_MEM_TYPE 0x003cLLU +#define VMX_BASIC_INOUT0x0040LLU + *pdata |= vmx_msr + (VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT); + break; +#define CORE2_PINBASED_CTLS_MUST_BE_ONE0x0016 +#define MSR_IA32_VMX_TRUE_PINBASED_CTLS0x48d + case MSR_IA32_VMX_TRUE_PINBASED_CTLS: + case MSR_IA32_VMX_PINBASED_CTLS: + vmx_msr_low = CORE2_PINBASED_CTLS_MUST_BE_ONE; + vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE | + PIN_BASED_EXT_INTR_MASK | + PIN_BASED_NMI_EXITING | + PIN_BASED_VIRTUAL_NMIS; + *pdata = vmx_msr_low | ((u64)vmx_msr_high 32); + break; + case MSR_IA32_VMX_PROCBASED_CTLS: + /* This MSR determines which vm-execution controls the L1 +* hypervisor may ask, or may not ask, to enable. Normally we +* can only allow enabling features which the hardware can +* support, but we limit ourselves to allowing only known +* features that were tested nested. We allow disabling any +* feature (even if the hardware can't disable it). +*/ + rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high); + + vmx_msr_low = 0; /* allow disabling any feature */ + vmx_msr_high = /* do not expose new untested features */ + CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING | + CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | + CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | + CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW | + CPU_BASED_USE_MSR_BITMAPS | +#ifdef CONFIG_X86_64 + CPU_BASED_CR8_LOAD_EXITING | + CPU_BASED_CR8_STORE_EXITING | +#endif + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + *pdata = vmx_msr_low | ((u64)vmx_msr_high 32); + break; + case MSR_IA32_VMX_EXIT_CTLS: +
[PATCH 07/28] nVMX: Decoding memory operands of VMX instructions
This patch includes a utility function for decoding pointer operands of VMX instructions issued by L1 (a guest hypervisor) Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 59 +++ arch/x86/kvm/x86.c |3 +- arch/x86/kvm/x86.h |3 ++ 3 files changed, 64 insertions(+), 1 deletion(-) --- .before/arch/x86/kvm/x86.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/x86.c 2010-12-08 18:56:49.0 +0200 @@ -3688,7 +3688,7 @@ static int kvm_fetch_guest_virt(gva_t ad exception); } -static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes, +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes, struct kvm_vcpu *vcpu, struct x86_exception *exception) { @@ -3696,6 +3696,7 @@ static int kvm_read_guest_virt(gva_t add return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access, exception); } +EXPORT_SYMBOL_GPL(kvm_read_guest_virt); static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes, struct kvm_vcpu *vcpu, --- .before/arch/x86/kvm/x86.h 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/x86.h 2010-12-08 18:56:49.0 +0200 @@ -74,6 +74,9 @@ void kvm_before_handle_nmi(struct kvm_vc void kvm_after_handle_nmi(struct kvm_vcpu *vcpu); int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq); +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes, + struct kvm_vcpu *vcpu, struct x86_exception *exception); + void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data); #endif --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 @@ -3936,6 +3936,65 @@ static int handle_vmoff(struct kvm_vcpu } /* + * Decode the memory-address operand of a vmx instruction, as recorded on an + * exit caused by such an instruction (run by a guest hypervisor). + * On success, returns 0. When the operand is invalid, returns 1 and throws + * #UD or #GP. + */ +static int get_vmx_mem_address(struct kvm_vcpu *vcpu, +unsigned long exit_qualification, +u32 vmx_instruction_info, gva_t *ret) +{ + /* +* According to Vol. 3B, Information for VM Exits Due to Instruction +* Execution, on an exit, vmx_instruction_info holds most of the +* addressing components of the operand. Only the displacement part +* is put in exit_qualification (see 3B, Basic VM-Exit Information). +* For how an actual address is calculated from all these components, +* refer to Vol. 1, Operand Addressing. +*/ + int scaling = vmx_instruction_info 3; + int addr_size = (vmx_instruction_info 7) 7; + bool is_reg = vmx_instruction_info (1u 10); + int seg_reg = (vmx_instruction_info 15) 7; + int index_reg = (vmx_instruction_info 18) 0xf; + bool index_is_valid = !(vmx_instruction_info (1u 22)); + int base_reg = (vmx_instruction_info 23) 0xf; + bool base_is_valid = !(vmx_instruction_info (1u 27)); + + if (is_reg) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + switch (addr_size) { + case 1: /* 32 bit. high bits are undefined according to the spec: */ + exit_qualification = 0x; + break; + case 2: /* 64 bit */ + break; + default: /* 16 bit */ + return 1; + } + + /* Addr = segment_base + offset */ + /* offset = base + [index * scale] + displacement */ + *ret = vmx_get_segment_base(vcpu, seg_reg); + if (base_is_valid) + *ret += kvm_register_read(vcpu, base_reg); + if (index_is_valid) + *ret += kvm_register_read(vcpu, index_reg)scaling; + *ret += exit_qualification; /* holds the displacement */ + /* +* TODO: throw #GP (and return 1) in various cases that the VM* +* instructions require it - e.g., offset beyond segment limit, +* unusable or unreadable/unwritable segment, non-canonical 64-bit +* address, and so on. Currently these are not checked. +*/ + return 0; +} + +/* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs * to be done to userspace and return 0. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12
In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a hardware VMCS for each active vmcs12 (i.e., for each L2 guest). We call each of these L0 VMCSs a vmcs02, as it is the VMCS that L0 uses to run its nested guest L2. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 96 +++ 1 file changed, 96 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:49.0 +0200 @@ -155,6 +155,12 @@ struct __packed vmcs12 { */ #define VMCS12_REVISION 0x11e57ed0 +struct vmcs_list { + struct list_head list; + gpa_t vmcs12_addr; + struct vmcs *vmcs02; +}; + /* * The nested_vmx structure is part of vcpu_vmx, and holds information we need * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example, @@ -170,6 +176,10 @@ struct nested_vmx { /* The host-usable pointer to the above */ struct page *current_vmcs12_page; struct vmcs12 *current_vmcs12; + + /* list of real (hardware) VMCS, one for each L2 guest of L1 */ + struct list_head vmcs02_list; /* a vmcs_list */ + int vmcs02_num; }; struct vcpu_vmx { @@ -1736,6 +1746,85 @@ static void free_vmcs(struct vmcs *vmcs) free_pages((unsigned long)vmcs, vmcs_config.order); } +static struct vmcs *nested_get_current_vmcs(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct vmcs_list *list_item, *n; + + list_for_each_entry_safe(list_item, n, vmx-nested.vmcs02_list, list) + if (list_item-vmcs12_addr == vmx-nested.current_vmptr) + return list_item-vmcs02; + + return NULL; +} + +/* + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one + * does not already exist. The allocation is done in L0 memory, so to avoid + * denial-of-service attack by guests, we limit the number of concurrently- + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not + * trigger this limit. + */ +static const int NESTED_MAX_VMCS = 256; +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu) +{ + struct vmcs_list *new_l2_guest; + struct vmcs *vmcs02; + + if (nested_get_current_vmcs(vcpu)) + return 0; /* nothing to do - we already have a VMCS */ + + if (to_vmx(vcpu)-nested.vmcs02_num = NESTED_MAX_VMCS) + return -ENOMEM; + + new_l2_guest = (struct vmcs_list *) + kmalloc(sizeof(struct vmcs_list), GFP_KERNEL); + if (!new_l2_guest) + return -ENOMEM; + + vmcs02 = alloc_vmcs(); + if (!vmcs02) { + kfree(new_l2_guest); + return -ENOMEM; + } + + new_l2_guest-vmcs12_addr = to_vmx(vcpu)-nested.current_vmptr; + new_l2_guest-vmcs02 = vmcs02; + list_add((new_l2_guest-list), (to_vmx(vcpu)-nested.vmcs02_list)); + to_vmx(vcpu)-nested.vmcs02_num++; + return 0; +} + +/* Free a vmcs12's associated vmcs02, and remove it from vmcs02_list */ +static void nested_free_vmcs(struct kvm_vcpu *vcpu, gpa_t vmptr) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct vmcs_list *list_item, *n; + + list_for_each_entry_safe(list_item, n, vmx-nested.vmcs02_list, list) + if (list_item-vmcs12_addr == vmptr) { + free_vmcs(list_item-vmcs02); + list_del((list_item-list)); + kfree(list_item); + vmx-nested.vmcs02_num--; + return; + } +} + +static void free_l1_state(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct vmcs_list *list_item, *n; + + list_for_each_entry_safe(list_item, n, + vmx-nested.vmcs02_list, list) { + free_vmcs(list_item-vmcs02); + list_del((list_item-list)); + kfree(list_item); + } + vmx-nested.vmcs02_num = 0; +} + static void free_kvm_area(void) { int cpu; @@ -3884,6 +3973,9 @@ static int handle_vmon(struct kvm_vcpu * return 1; } + INIT_LIST_HEAD((vmx-nested.vmcs02_list)); + vmx-nested.vmcs02_num = 0; + vmx-nested.vmxon = true; skip_emulated_instruction(vcpu); @@ -3931,6 +4023,8 @@ static int handle_vmoff(struct kvm_vcpu if (to_vmx(vcpu)-nested.current_vmptr != -1ull) nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page); + free_l1_state(vcpu); + skip_emulated_instruction(vcpu); return 1; } @@ -4420,6 +4514,8 @@ static void vmx_free_vcpu(struct kvm_vcp free_vpid(vmx); if (vmx-nested.vmxon to_vmx(vcpu)-nested.current_vmptr != -1ull) nested_release_page(to_vmx(vcpu)-nested.current_vmcs12_page); + if (vmx-nested.vmxon) + free_l1_state(vcpu);
[PATCH 11/28] nVMX: Implement VMCLEAR
This patch implements the VMCLEAR instruction. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 60 ++- 1 file changed, 59 insertions(+), 1 deletion(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 @@ -279,6 +279,8 @@ struct __packed vmcs12 { u32 abort; struct vmcs_fields fields; + + bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ }; /* @@ -4413,6 +4415,62 @@ static void nested_vmx_failValid(struct get_vmcs12_fields(vcpu)-vm_instruction_error = vm_instruction_error; } +/* Emulate the VMCLEAR instruction */ +static int handle_vmclear(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + gva_t gva; + gpa_t vmcs12_addr; + struct vmcs12 *vmcs12; + struct page *page; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), + vmcs_read32(VMX_INSTRUCTION_INFO), gva)) + return 1; + + if (kvm_read_guest_virt(gva, vmcs12_addr, sizeof(vmcs12_addr), + vcpu, NULL)) { + kvm_queue_exception(vcpu, PF_VECTOR); + return 1; + } + + if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) { + nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS); + skip_emulated_instruction(vcpu); + return 1; + } + + if (vmcs12_addr == vmx-nested.current_vmptr) { + nested_release_page(vmx-nested.current_vmcs12_page); + vmx-nested.current_vmptr = -1ull; + } + + page = nested_get_page(vcpu, vmcs12_addr); + if (page == NULL) { + /* +* For accurate processor emulation, VMCLEAR beyond available +* physical memory should do nothing at all. However, it is +* possible that a nested vmx bug, not a guest hypervisor bug, +* resulted in this case, so let's shut down before doing any +* more damage: +*/ + set_bit(KVM_REQ_TRIPLE_FAULT, vcpu-requests); + return 1; + } + vmcs12 = kmap(page); + vmcs12-launch_state = 0; + nested_release_page(page); + + nested_free_vmcs(vcpu, vmcs12_addr); + + skip_emulated_instruction(vcpu); + nested_vmx_succeed(vcpu); + return 1; +} + /* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs @@ -4434,7 +4492,7 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_INVD]= handle_invd, [EXIT_REASON_INVLPG] = handle_invlpg, [EXIT_REASON_VMCALL] = handle_vmcall, - [EXIT_REASON_VMCLEAR] = handle_vmx_insn, + [EXIT_REASON_VMCLEAR] = handle_vmclear, [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, [EXIT_REASON_VMPTRLD] = handle_vmx_insn, [EXIT_REASON_VMPTRST] = handle_vmx_insn, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/28] nVMX: Implement VMPTRLD
This patch implements the VMPTRLD instruction. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 61 ++- 1 file changed, 60 insertions(+), 1 deletion(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 @@ -4471,6 +4471,65 @@ static int handle_vmclear(struct kvm_vcp return 1; } +/* Emulate the VMPTRLD instruction */ +static int handle_vmptrld(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + gva_t gva; + gpa_t vmcs12_addr; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), + vmcs_read32(VMX_INSTRUCTION_INFO), gva)) + return 1; + + if (kvm_read_guest_virt(gva, vmcs12_addr, sizeof(vmcs12_addr), + vcpu, NULL)) { + kvm_queue_exception(vcpu, PF_VECTOR); + return 1; + } + + if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) { + nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS); + skip_emulated_instruction(vcpu); + return 1; + } + + if (vmx-nested.current_vmptr != vmcs12_addr) { + struct vmcs12 *new_vmcs12; + struct page *page; + page = nested_get_page(vcpu, vmcs12_addr); + if (page == NULL) { + nested_vmx_failInvalid(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } + new_vmcs12 = kmap(page); + if (new_vmcs12-revision_id != VMCS12_REVISION) { + nested_release_page_clean(page); + nested_vmx_failValid(vcpu, + VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); + skip_emulated_instruction(vcpu); + return 1; + } + if (vmx-nested.current_vmptr != -1ull) + nested_release_page(vmx-nested.current_vmcs12_page); + + vmx-nested.current_vmptr = vmcs12_addr; + vmx-nested.current_vmcs12 = new_vmcs12; + vmx-nested.current_vmcs12_page = page; + + if (nested_create_current_vmcs(vcpu)) + return -ENOMEM; + } + + nested_vmx_succeed(vcpu); + skip_emulated_instruction(vcpu); + return 1; +} + /* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs @@ -4494,7 +4553,7 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_VMCALL] = handle_vmcall, [EXIT_REASON_VMCLEAR] = handle_vmclear, [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, - [EXIT_REASON_VMPTRLD] = handle_vmx_insn, + [EXIT_REASON_VMPTRLD] = handle_vmptrld, [EXIT_REASON_VMPTRST] = handle_vmx_insn, [EXIT_REASON_VMREAD] = handle_vmx_insn, [EXIT_REASON_VMRESUME]= handle_vmx_insn, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/28] nVMX: Implement VMPTRST
This patch implements the VMPTRST instruction. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 27 ++- arch/x86/kvm/x86.c |3 ++- arch/x86/kvm/x86.h |3 +++ 3 files changed, 31 insertions(+), 2 deletions(-) --- .before/arch/x86/kvm/x86.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/x86.c 2010-12-08 18:56:50.0 +0200 @@ -3705,7 +3705,7 @@ static int kvm_read_guest_virt_system(gv return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception); } -static int kvm_write_guest_virt_system(gva_t addr, void *val, +int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes, struct kvm_vcpu *vcpu, struct x86_exception *exception) @@ -3736,6 +3736,7 @@ static int kvm_write_guest_virt_system(g out: return r; } +EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system); static int emulator_read_emulated(unsigned long addr, void *val, --- .before/arch/x86/kvm/x86.h 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/x86.h 2010-12-08 18:56:50.0 +0200 @@ -77,6 +77,9 @@ int kvm_inject_realmode_interrupt(struct int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes, struct kvm_vcpu *vcpu, struct x86_exception *exception); +int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes, + struct kvm_vcpu *vcpu, struct x86_exception *exception); + void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data); #endif --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 @@ -4530,6 +4530,31 @@ static int handle_vmptrld(struct kvm_vcp return 1; } +/* Emulate the VMPTRST instruction */ +static int handle_vmptrst(struct kvm_vcpu *vcpu) +{ + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + gva_t vmcs_gva; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (get_vmx_mem_address(vcpu, exit_qualification, + vmx_instruction_info, vmcs_gva)) + return 1; + /* ok to use *_system, because handle_vmread verified cpl=0 */ + if (kvm_write_guest_virt_system(vmcs_gva, +(void *)to_vmx(vcpu)-nested.current_vmptr, +sizeof(u64), vcpu, NULL)) { + kvm_queue_exception(vcpu, PF_VECTOR); + return 1; + } + nested_vmx_succeed(vcpu); + skip_emulated_instruction(vcpu); + return 1; +} + /* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs @@ -4554,7 +4579,7 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_VMCLEAR] = handle_vmclear, [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, [EXIT_REASON_VMPTRLD] = handle_vmptrld, - [EXIT_REASON_VMPTRST] = handle_vmx_insn, + [EXIT_REASON_VMPTRST] = handle_vmptrst, [EXIT_REASON_VMREAD] = handle_vmx_insn, [EXIT_REASON_VMRESUME]= handle_vmx_insn, [EXIT_REASON_VMWRITE] = handle_vmx_insn, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/28] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
This patch contains code to prepare the VMCS which can be used to actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we built for L1). VMREAD/WRITE can only access one VMCS at a time (the current VMCS), which makes it difficult for us to read from vmcs01 while writing to vmcs12. This is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then read that memory copy while writing to vmcs12. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 409 +++ 1 file changed, 409 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 @@ -805,6 +805,28 @@ static inline bool report_flexpriority(v return flexpriority_enabled; } +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu) +{ + return cpu_has_vmx_tpr_shadow() + get_vmcs12_fields(vcpu)-cpu_based_vm_exec_control + CPU_BASED_TPR_SHADOW; +} + +static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu) +{ + return cpu_has_secondary_exec_ctrls() + get_vmcs12_fields(vcpu)-cpu_based_vm_exec_control + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; +} + +static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu + *vcpu) +{ + return nested_cpu_has_secondary_exec_ctrls(vcpu) + (get_vmcs12_fields(vcpu)-secondary_vm_exec_control + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); +} + static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr) { int i; @@ -1253,6 +1275,37 @@ static void vmx_load_host_state(struct v preempt_enable(); } +int load_vmcs_host_state(struct vmcs_fields *src) +{ + vmcs_write16(HOST_ES_SELECTOR, src-host_es_selector); + vmcs_write16(HOST_CS_SELECTOR, src-host_cs_selector); + vmcs_write16(HOST_SS_SELECTOR, src-host_ss_selector); + vmcs_write16(HOST_DS_SELECTOR, src-host_ds_selector); + vmcs_write16(HOST_FS_SELECTOR, src-host_fs_selector); + vmcs_write16(HOST_GS_SELECTOR, src-host_gs_selector); + vmcs_write16(HOST_TR_SELECTOR, src-host_tr_selector); + + if (vmcs_config.vmexit_ctrl VM_EXIT_LOAD_IA32_PAT) + vmcs_write64(HOST_IA32_PAT, src-host_ia32_pat); + + vmcs_write32(HOST_IA32_SYSENTER_CS, src-host_ia32_sysenter_cs); + + vmcs_writel(HOST_CR0, src-host_cr0); + vmcs_writel(HOST_CR3, src-host_cr3); + vmcs_writel(HOST_CR4, src-host_cr4); + vmcs_writel(HOST_FS_BASE, src-host_fs_base); + vmcs_writel(HOST_GS_BASE, src-host_gs_base); + vmcs_writel(HOST_TR_BASE, src-host_tr_base); + vmcs_writel(HOST_GDTR_BASE, src-host_gdtr_base); + vmcs_writel(HOST_IDTR_BASE, src-host_idtr_base); + vmcs_writel(HOST_RSP, src-host_rsp); + vmcs_writel(HOST_RIP, src-host_rip); + vmcs_writel(HOST_IA32_SYSENTER_ESP, src-host_ia32_sysenter_esp); + vmcs_writel(HOST_IA32_SYSENTER_EIP, src-host_ia32_sysenter_eip); + + return 0; +} + /* * Switches to specified vcpu, until a matching vcpu_put(), but assumes * vcpu mutex is already taken. @@ -5365,6 +5418,362 @@ static void vmx_set_supported_cpuid(u32 entry-ecx |= bit(X86_FEATURE_VMX); } +/* + * Make a copy of the current VMCS to ordinary memory. This is needed because + * in VMX you cannot read and write to two VMCS at the same time, so when we + * want to do this (in prepare_vmcs02, which needs to read from vmcs01 while + * preparing vmcs02), we need to first save a copy of one VMCS's fields in + * memory, and then use that copy. + */ +void save_vmcs(struct vmcs_fields *dst) +{ + dst-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); + dst-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); + dst-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR); + dst-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR); + dst-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR); + dst-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR); + dst-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR); + dst-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR); + dst-host_es_selector = vmcs_read16(HOST_ES_SELECTOR); + dst-host_cs_selector = vmcs_read16(HOST_CS_SELECTOR); + dst-host_ss_selector = vmcs_read16(HOST_SS_SELECTOR); + dst-host_ds_selector = vmcs_read16(HOST_DS_SELECTOR); + dst-host_fs_selector = vmcs_read16(HOST_FS_SELECTOR); + dst-host_gs_selector = vmcs_read16(HOST_GS_SELECTOR); + dst-host_tr_selector = vmcs_read16(HOST_TR_SELECTOR); + dst-io_bitmap_a = vmcs_read64(IO_BITMAP_A); + dst-io_bitmap_b = vmcs_read64(IO_BITMAP_B); + if (cpu_has_vmx_msr_bitmap()) +
[PATCH 16/28] nVMX: Move register-syncing to a function
Move code that syncs dirty RSP and RIP registers back to the VMCS, into a function. We will need to call this function from additional places in the next patch. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 @@ -5033,6 +5033,15 @@ static void vmx_cancel_injection(struct vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); } +static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu) +{ + if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty)) + vmcs_writel(GUEST_RSP, vcpu-arch.regs[VCPU_REGS_RSP]); + if (test_bit(VCPU_REGS_RIP, (unsigned long *)vcpu-arch.regs_dirty)) + vmcs_writel(GUEST_RIP, vcpu-arch.regs[VCPU_REGS_RIP]); + vcpu-arch.regs_dirty = 0; +} + #ifdef CONFIG_X86_64 #define R r #define Q q @@ -5054,10 +5063,7 @@ static void vmx_vcpu_run(struct kvm_vcpu if (vmx-emulation_required emulate_invalid_guest_state) return; - if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty)) - vmcs_writel(GUEST_RSP, vcpu-arch.regs[VCPU_REGS_RSP]); - if (test_bit(VCPU_REGS_RIP, (unsigned long *)vcpu-arch.regs_dirty)) - vmcs_writel(GUEST_RIP, vcpu-arch.regs[VCPU_REGS_RIP]); + sync_cached_regs_to_vmcs(vcpu); /* When single-stepping over STI and MOV SS, we must clear the * corresponding interruptibility bits in the guest state. Otherwise @@ -5165,7 +5171,6 @@ static void vmx_vcpu_run(struct kvm_vcpu vcpu-arch.regs_avail = ~((1 VCPU_REGS_RIP) | (1 VCPU_REGS_RSP) | (1 VCPU_EXREG_PDPTR)); - vcpu-arch.regs_dirty = 0; vmx-idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/28] nVMX: Implement VMLAUNCH and VMRESUME
Implement the VMLAUNCH and VMRESUME instructions, allowing a guest hypervisor to run its own guests. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 235 ++- 1 file changed, 232 insertions(+), 3 deletions(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:50.0 +0200 @@ -281,6 +281,9 @@ struct __packed vmcs12 { struct vmcs_fields fields; bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ + + int cpu; + int launched; }; /* @@ -315,6 +318,21 @@ struct nested_vmx { /* list of real (hardware) VMCS, one for each L2 guest of L1 */ struct list_head vmcs02_list; /* a vmcs_list */ int vmcs02_num; + + /* Level 1 state for switching to level 2 and back */ + struct { + u64 efer; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + int cpu; + int launched; + } l1_state; + /* Saving the VMCS that we used for running L1 */ + struct vmcs *vmcs01; + struct vmcs_fields *vmcs01_fields; + /* Saving some vcpu-arch.* data we had for L1, while running L2 */ + unsigned long l1_arch_cr3; }; struct vcpu_vmx { @@ -1344,6 +1362,16 @@ static void vmx_vcpu_load(struct kvm_vcp rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp); vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */ + + if (vmx-nested.vmcs01_fields != NULL) { + struct vmcs_fields *vmcs01 = vmx-nested.vmcs01_fields; + vmcs01-host_tr_base = vmcs_readl(HOST_TR_BASE); + vmcs01-host_gdtr_base = vmcs_readl(HOST_GDTR_BASE); + vmcs01-host_ia32_sysenter_esp = + vmcs_readl(HOST_IA32_SYSENTER_ESP); + if (is_guest_mode(vcpu)) + load_vmcs_host_state(vmcs01); + } } } @@ -2173,6 +2201,9 @@ static void free_l1_state(struct kvm_vcp kfree(list_item); } vmx-nested.vmcs02_num = 0; + + kfree(vmx-nested.vmcs01_fields); + vmx-nested.vmcs01_fields = NULL; } static void free_kvm_area(void) @@ -4326,6 +4357,10 @@ static int handle_vmon(struct kvm_vcpu * INIT_LIST_HEAD((vmx-nested.vmcs02_list)); vmx-nested.vmcs02_num = 0; + vmx-nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!vmx-nested.vmcs01_fields) + return -ENOMEM; + vmx-nested.vmxon = true; skip_emulated_instruction(vcpu); @@ -4524,6 +4559,50 @@ static int handle_vmclear(struct kvm_vcp return 1; } +static int nested_vmx_run(struct kvm_vcpu *vcpu); + +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch) +{ + if (!nested_vmx_check_permission(vcpu)) + return 1; + + /* yet another strange pre-requisite listed in the VMX spec */ + if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) + GUEST_INTR_STATE_MOV_SS) { + nested_vmx_failValid(vcpu, + VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS); + skip_emulated_instruction(vcpu); + return 1; + } + + if (to_vmx(vcpu)-nested.current_vmcs12-launch_state == launch) { + /* Must use VMLAUNCH for the first time, VMRESUME later */ + nested_vmx_failValid(vcpu, + launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS : +VMXERR_VMRESUME_NONLAUNCHED_VMCS); + skip_emulated_instruction(vcpu); + return 1; + } + + skip_emulated_instruction(vcpu); + + nested_vmx_run(vcpu); + return 1; +} + +/* Emulate the VMLAUNCH instruction */ +static int handle_vmlaunch(struct kvm_vcpu *vcpu) +{ + return handle_launch_or_resume(vcpu, true); +} + +/* Emulate the VMRESUME instruction */ +static int handle_vmresume(struct kvm_vcpu *vcpu) +{ + + return handle_launch_or_resume(vcpu, false); +} + enum vmcs_field_type { VMCS_FIELD_TYPE_U16 = 0, VMCS_FIELD_TYPE_U64 = 1, @@ -4797,11 +4876,11 @@ static int (*kvm_vmx_exit_handlers[])(st [EXIT_REASON_INVLPG] = handle_invlpg, [EXIT_REASON_VMCALL] = handle_vmcall, [EXIT_REASON_VMCLEAR] = handle_vmclear, - [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, + [EXIT_REASON_VMLAUNCH]= handle_vmlaunch, [EXIT_REASON_VMPTRLD] = handle_vmptrld, [EXIT_REASON_VMPTRST] = handle_vmptrst, [EXIT_REASON_VMREAD] = handle_vmread, - [EXIT_REASON_VMRESUME]= handle_vmx_insn, + [EXIT_REASON_VMRESUME]=
[PATCH 19/28] nVMX: Exiting from L2 to L1
This patch implements nested_vmx_vmexit(), called when the nested L2 guest exits and we want to run its L1 parent and let it handle this exit. Note that this will not necessarily be called on every L2 exit. L0 may decide to handle a particular exit on its own, without L1's involvement; In that case, L0 will handle the exit, and resume running L2, without running L1 and without calling nested_vmx_vmexit(). The logic for deciding whether to handle a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), will appear in the next patch. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 233 +++ 1 file changed, 233 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 @@ -5092,6 +5092,8 @@ static void __vmx_complete_interrupts(st static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { + if (is_guest_mode(vmx-vcpu)) + return; __vmx_complete_interrupts(vmx, vmx-idt_vectoring_info, VM_EXIT_INSTRUCTION_LEN, IDT_VECTORING_ERROR_CODE); @@ -6002,6 +6004,237 @@ static int nested_vmx_run(struct kvm_vcp return 1; } +/* + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date + * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK) + * without L0 trapping the change and updating vmcs12. + * This function returns the value we should put in vmcs12.guest_cr0. It's not + * enough to just return the current (vmcs02) GUEST_CR0. This may not be the + * guest cr0 that L1 thought it was giving its L2 guest - it is possible that + * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked + * to trap this change and instead set just the read shadow. If this is the + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where + * L1 believes they already are. + */ +static inline unsigned long +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12) +{ + unsigned long guest_cr0_bits = + vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask; + return (vmcs_readl(GUEST_CR0) guest_cr0_bits) | + (vmcs_readl(CR0_READ_SHADOW) ~guest_cr0_bits); +} + +static inline unsigned long +vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12) +{ + unsigned long guest_cr4_bits = + vcpu-arch.cr4_guest_owned_bits | vmcs12-cr4_guest_host_mask; + return (vmcs_readl(GUEST_CR4) guest_cr4_bits) | + (vmcs_readl(CR4_READ_SHADOW) ~guest_cr4_bits); +} + +/* + * prepare_vmcs12 is called when the nested L2 guest exits and we want to + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this + * function updates it to reflect the changes to the guest state while L2 was + * running (and perhaps made some exits which were handled directly by L0 + * without going back to L1), and to reflect the exit reason. + * Note that we do not have to copy here all VMCS fields, just those that + * could have changed by the L2 guest or the exit - i.e., the guest-state and + * exit-information fields only. Other fields are modified by L1 with VMWRITE, + * which already writes to vmcs12 directly. + */ +void prepare_vmcs12(struct kvm_vcpu *vcpu) +{ + struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu); + + /* update guest state fields: */ + vmcs12-guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12); + vmcs12-guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12); + + vmcs12-guest_dr7 = vmcs_readl(GUEST_DR7); + vmcs12-guest_rsp = vmcs_readl(GUEST_RSP); + vmcs12-guest_rip = vmcs_readl(GUEST_RIP); + vmcs12-guest_rflags = vmcs_readl(GUEST_RFLAGS); + + vmcs12-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); + vmcs12-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); + vmcs12-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR); + vmcs12-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR); + vmcs12-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR); + vmcs12-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR); + vmcs12-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR); + vmcs12-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR); + vmcs12-guest_es_limit = vmcs_read32(GUEST_ES_LIMIT); + vmcs12-guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT); + vmcs12-guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT); + vmcs12-guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT); + vmcs12-guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT); + vmcs12-guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT); + vmcs12-guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT); + vmcs12-guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT); + vmcs12-guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT); + vmcs12-guest_idtr_limit =
[PATCH 21/28] nVMX: Correct handling of interrupt injection
When KVM wants to inject an interrupt, the guest should think a real interrupt has happened. Normally (in the non-nested case) this means checking that the guest doesn't block interrupts (and if it does, inject when it doesn't - using the interrupt window VMX mechanism), and setting up the appropriate VMCS fields for the guest to receive the interrupt. However, when we are running a nested guest (L2) and its hypervisor (L1) requested exits on interrupts (as most hypervisors do), the most efficient thing to do is to exit L2, telling L1 that the exit was caused by an interrupt, the one we were injecting; Only when L1 asked not to be notified of interrupts, we should inject directly to the running L2 guest (i.e., the normal code path). However, properly doing what is described above requires invasive changes to the flow of the existing code, which we elected not to do in this stage. Instead we do something more simplistic and less efficient: we modify vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt now, to exit from L2 to L1 before continuing the normal code. The normal kvm code then notices that L1 is blocking interrupts, and sets the interrupt window to inject the interrupt later to L1. Shortly after, L1 gets the interrupt while it is itself running, not as an exit from L2. The cost is an extra L1 exit (the interrupt window). Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 33 + 1 file changed, 33 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 @@ -3462,9 +3462,25 @@ out: return ret; } +/* + * In nested virtualization, check if L1 asked to exit on external interrupts. + * For most existing hypervisors, this will always return true. + */ +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu) +{ + return get_vmcs12_fields(vcpu)-pin_based_vm_exec_control + PIN_BASED_EXT_INTR_MASK; +} + static void enable_irq_window(struct kvm_vcpu *vcpu) { u32 cpu_based_vm_exec_control; + if (is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) + /* We can get here when nested_run_pending caused +* vmx_interrupt_allowed() to return false. In this case, do +* nothing - the interrupt will be injected later. +*/ + return; cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; @@ -3578,6 +3594,13 @@ static void vmx_set_nmi_mask(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) { + if (to_vmx(vcpu)-nested.nested_run_pending) + return 0; + nested_vmx_vmexit(vcpu, true); + /* fall through to normal code, but now in L1, not L2 */ + } + return (vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); @@ -5126,6 +5149,14 @@ static int vmx_handle_exit(struct kvm_vc if (enable_ept is_paging(vcpu)) vcpu-arch.cr3 = vmcs_readl(GUEST_CR3); + /* +* the KVM_REQ_EVENT optimization bit is only on for one entry, and if +* we did not inject a still-pending event to L1 now because of +* nested_run_pending, we need to re-enable this bit. +*/ + if (vmx-nested.nested_run_pending) + kvm_make_request(KVM_REQ_EVENT, vcpu); + if (exit_reason == EXIT_REASON_VMLAUNCH || exit_reason == EXIT_REASON_VMRESUME) vmx-nested.nested_run_pending = 1; @@ -5317,6 +5348,8 @@ static void vmx_complete_interrupts(stru static void vmx_cancel_injection(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return; __vmx_complete_interrupts(to_vmx(vcpu), vmcs_read32(VM_ENTRY_INTR_INFO_FIELD), VM_ENTRY_INSTRUCTION_LEN, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 22/28] nVMX: Correct handling of exception injection
Similar to the previous patch, but concerning injection of exceptions rather than external interrupts. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 26 ++ 1 file changed, 26 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 @@ -1491,6 +1491,25 @@ static void skip_emulated_instruction(st vmx_set_interrupt_shadow(vcpu, 0); } +/* + * KVM wants to inject page-faults which it got to the guest. This function + * checks whether in a nested guest, we need to inject them to L1 or L2. + * This function assumes it is called with the exit reason in vmcs02 being + * a #PF exception (this is the only case in which KVM injects a #PF when L2 + * is running). + */ +static int nested_pf_handled(struct kvm_vcpu *vcpu) +{ + struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu); + + /* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */ + if (!(vmcs12-exception_bitmap PF_VECTOR)) + return 0; + + nested_vmx_vmexit(vcpu, false); + return 1; +} + static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, bool has_error_code, u32 error_code, bool reinject) @@ -1498,6 +1517,10 @@ static void vmx_queue_exception(struct k struct vcpu_vmx *vmx = to_vmx(vcpu); u32 intr_info = nr | INTR_INFO_VALID_MASK; + if (nr == PF_VECTOR is_guest_mode(vcpu) + nested_pf_handled(vcpu)) + return; + if (has_error_code) { vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code); intr_info |= INTR_INFO_DELIVER_CODE_MASK; @@ -3533,6 +3556,9 @@ static void vmx_inject_nmi(struct kvm_vc { struct vcpu_vmx *vmx = to_vmx(vcpu); + if (is_guest_mode(vcpu)) + return; + if (!cpu_has_virtual_nmis()) { /* * Tracking the NMI-blocked state in software is built upon -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/28] nVMX: Correct handling of idt vectoring info
This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested case. When a guest exits while handling an interrupt or exception, we get this information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1, there's nothing we need to do, because L1 will see this field in vmcs12, and handle it itself. However, when L2 exits and L0 handles the exit itself and plans to return to L2, L0 must inject this event to L2. In the normal non-nested case, the idt_vectoring_info case is discovered after the exit, and the decision to inject (though not the injection itself) is made at that point. However, in the nested case a decision of whether to return to L2 or L1 also happens during the injection phase (see the previous patches), so in the nested case we can only decide what to do about the idt_vectoring_info right after the injection, i.e., in the beginning of vmx_vcpu_run, which is the first time we know for sure if we're staying in L2 (i.e., nested_mode is true). Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 32 1 file changed, 32 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 @@ -335,6 +335,10 @@ struct nested_vmx { unsigned long l1_arch_cr3; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; + /* true if last exit was of L2, and had a valid idt_vectoring_info */ + bool valid_idt_vectoring_info; + /* These are saved if valid_idt_vectoring_info */ + u32 vm_exit_instruction_len, idt_vectoring_error_code; }; struct vcpu_vmx { @@ -5384,6 +5388,22 @@ static void vmx_cancel_injection(struct vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); } +static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx) +{ + int irq = vmx-idt_vectoring_info VECTORING_INFO_VECTOR_MASK; + int type = vmx-idt_vectoring_info VECTORING_INFO_TYPE_MASK; + int errCodeValid = vmx-idt_vectoring_info + VECTORING_INFO_DELIVER_CODE_MASK; + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + irq | type | INTR_INFO_VALID_MASK | errCodeValid); + + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, + vmx-nested.vm_exit_instruction_len); + if (errCodeValid) + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, + vmx-nested.idt_vectoring_error_code); +} + static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu) { if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty)) @@ -5405,6 +5425,9 @@ static void vmx_vcpu_run(struct kvm_vcpu { struct vcpu_vmx *vmx = to_vmx(vcpu); + if (is_guest_mode(vcpu) vmx-nested.valid_idt_vectoring_info) + nested_handle_valid_idt_vectoring_info(vmx); + /* Record the guest's net vcpu time for enforced NMI injections. */ if (unlikely(!cpu_has_virtual_nmis() vmx-soft_vnmi_blocked)) vmx-entry_time = ktime_get(); @@ -5525,6 +5548,15 @@ static void vmx_vcpu_run(struct kvm_vcpu vmx-idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); + vmx-nested.valid_idt_vectoring_info = is_guest_mode(vcpu) + (vmx-idt_vectoring_info VECTORING_INFO_VALID_MASK); + if (vmx-nested.valid_idt_vectoring_info) { + vmx-nested.vm_exit_instruction_len = + vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + vmx-nested.idt_vectoring_error_code = + vmcs_read32(IDT_VECTORING_ERROR_CODE); + } + asm(mov %0, %%ds; mov %0, %%es : : r(__USER_DS)); vmx-launched = 1; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions
When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a previous patch). When L2 modifies bits that L1 doesn't care about, we let it think (via CR[04]_READ_SHADOW) that it did these modifications, while only changing (in GUEST_CR[04]) the bits that L0 doesn't shadow. This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may want to leave TS on, while pretending to allow the guest to change it. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 54 --- 1 file changed, 51 insertions(+), 3 deletions(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:51.0 +0200 @@ -3877,6 +3877,54 @@ static void complete_insn_gp(struct kvm_ skip_emulated_instruction(vcpu); } +/* called to set cr0 as approriate for a mov-to-cr0 exit. */ +static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val) +{ + if (is_guest_mode(vcpu)) { + /* +* We get here when L2 changed cr0 in a way that did not change +* any of L1's shadowed bits (see nested_vmx_exit_handled_cr), +* but did change L0 shadowed bits. This can currently happen +* with the TS bit: L0 may want to leave TS on (for lazy fpu +* loading) while pretending to allow the guest to change it. +*/ + vmcs_writel(GUEST_CR0, + (val vcpu-arch.cr0_guest_owned_bits) | + (vmcs_readl(GUEST_CR0) ~vcpu-arch.cr0_guest_owned_bits)); + vmcs_writel(CR0_READ_SHADOW, val); + vcpu-arch.cr0 = val; + return 0; + } else + return kvm_set_cr0(vcpu, val); +} + +static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val) +{ + if (is_guest_mode(vcpu)) { + vmcs_writel(GUEST_CR4, + (val vcpu-arch.cr4_guest_owned_bits) | + (vmcs_readl(GUEST_CR4) ~vcpu-arch.cr4_guest_owned_bits)); + vmcs_writel(CR4_READ_SHADOW, val); + vcpu-arch.cr4 = val; + return 0; + } else + return kvm_set_cr4(vcpu, val); +} + + +/* called to set cr0 as approriate for clts instruction exit. */ +static void handle_clts(struct kvm_vcpu *vcpu) +{ + if (is_guest_mode(vcpu)) { + /* As in handle_set_cr0(), we can't call vmx_set_cr0 here */ + vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0) ~X86_CR0_TS); + vmcs_writel(CR0_READ_SHADOW, + vmcs_readl(CR0_READ_SHADOW) ~X86_CR0_TS); + vcpu-arch.cr0 = ~X86_CR0_TS; + } else + vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS)); +} + static int handle_cr(struct kvm_vcpu *vcpu) { unsigned long exit_qualification, val; @@ -3893,7 +3941,7 @@ static int handle_cr(struct kvm_vcpu *vc trace_kvm_cr_write(cr, val); switch (cr) { case 0: - err = kvm_set_cr0(vcpu, val); + err = handle_set_cr0(vcpu, val); complete_insn_gp(vcpu, err); return 1; case 3: @@ -3901,7 +3949,7 @@ static int handle_cr(struct kvm_vcpu *vc complete_insn_gp(vcpu, err); return 1; case 4: - err = kvm_set_cr4(vcpu, val); + err = handle_set_cr4(vcpu, val); complete_insn_gp(vcpu, err); return 1; case 8: { @@ -3919,7 +3967,7 @@ static int handle_cr(struct kvm_vcpu *vc }; break; case 2: /* clts */ - vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS)); + handle_clts(vcpu); trace_kvm_cr_write(0, kvm_read_cr0(vcpu)); skip_emulated_instruction(vcpu); vmx_fpu_activate(vcpu); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 25/28] nVMX: Further fixes for lazy FPU loading
KVM's Lazy FPU loading means that sometimes L0 needs to set CR0.TS, even if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and NM exceptions, even if we have a guest hypervisor (L1) who didn't want these traps. And of course, conversely: If L1 wanted to trap these events, we must let it, even if L0 is not interested in them. This patch fixes some existing KVM code (in update_exception_bitmap(), vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's and L1's needs. Note that handle_cr() was already fixed in the above patch, and that new code in introduced in previous patches already handles CR0 correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()). Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 38 +++--- 1 file changed, 35 insertions(+), 3 deletions(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:52.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:52.0 +0200 @@ -1098,6 +1098,15 @@ static void update_exception_bitmap(stru eb = ~(1u PF_VECTOR); /* bypass_guest_pf = 0 */ if (vcpu-fpu_active) eb = ~(1u NM_VECTOR); + + /* When we are running a nested L2 guest and L1 specified for it a +* certain exception bitmap, we must trap the same exceptions and pass +* them to L1. When running L2, we will only handle the exceptions +* specified above if L1 did not want them. +*/ + if (is_guest_mode(vcpu)) + eb |= get_vmcs12_fields(vcpu)-exception_bitmap; + vmcs_write32(EXCEPTION_BITMAP, eb); } @@ -1415,8 +1424,19 @@ static void vmx_fpu_activate(struct kvm_ cr0 = ~(X86_CR0_TS | X86_CR0_MP); cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP); vmcs_writel(GUEST_CR0, cr0); - update_exception_bitmap(vcpu); vcpu-arch.cr0_guest_owned_bits = X86_CR0_TS; + if (is_guest_mode(vcpu)) { + /* While we (L0) no longer care about NM exceptions or cr0.TS +* changes, our guest hypervisor (L1) might care in which case +* we must trap them for it. +*/ + u32 eb = vmcs_read32(EXCEPTION_BITMAP) ~(1u NM_VECTOR); + struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu); + eb |= vmcs12-exception_bitmap; + vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; + vmcs_write32(EXCEPTION_BITMAP, eb); + } else + update_exception_bitmap(vcpu); vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); } @@ -1424,12 +1444,24 @@ static void vmx_decache_cr0_guest_bits(s static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu) { + /* Note that there is no vcpu-fpu_active = 0 here. The caller must +* set this *before* calling this function. +*/ vmx_decache_cr0_guest_bits(vcpu); vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP); - update_exception_bitmap(vcpu); + vmcs_write32(EXCEPTION_BITMAP, + vmcs_read32(EXCEPTION_BITMAP) | (1u NM_VECTOR)); vcpu-arch.cr0_guest_owned_bits = 0; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - vmcs_writel(CR0_READ_SHADOW, vcpu-arch.cr0); + if (is_guest_mode(vcpu)) + /* Unfortunately in nested mode we play with arch.cr0's PG +* bit, so we musn't copy it all, just the relevant TS bit +*/ + vmcs_writel(CR0_READ_SHADOW, + (vmcs_readl(CR0_READ_SHADOW) ~X86_CR0_TS) | + (vcpu-arch.cr0 X86_CR0_TS)); + else + vmcs_writel(CR0_READ_SHADOW, vcpu-arch.cr0); } static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 26/28] nVMX: Additional TSC-offset handling
In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to set vmcs12.tsc_offset, for this change to survive the next nested entry (see prepare_vmcs02()). Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 11 +++ 1 file changed, 11 insertions(+) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:52.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:52.0 +0200 @@ -1665,12 +1665,23 @@ static u64 guest_read_tsc(void) static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { vmcs_write64(TSC_OFFSET, offset); + if (is_guest_mode(vcpu)) + /* +* We are only changing TSC_OFFSET when L2 is running if for +* some reason L1 chose not to trap the TSC MSR. Since +* prepare_vmcs12() does not copy tsc_offset, we need to also +* set the vmcs12 field here. +*/ + get_vmcs12_fields(vcpu)-tsc_offset = offset - + to_vmx(vcpu)-nested.vmcs01_fields-tsc_offset; } static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) { u64 offset = vmcs_read64(TSC_OFFSET); vmcs_write64(TSC_OFFSET, offset + adjustment); + if (is_guest_mode(vcpu)) + get_vmcs12_fields(vcpu)-tsc_offset += adjustment; } /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 27/28] nVMX: Miscellenous small corrections
Small corrections of KVM (spelling, etc.) not directly related to nested VMX. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- .before/arch/x86/kvm/vmx.c 2010-12-08 18:56:52.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-12-08 18:56:52.0 +0200 @@ -933,7 +933,7 @@ static void vmcs_load(struct vmcs *vmcs) : =g(error) : a(phys_addr), m(phys_addr) : cc, memory); if (error) - printk(KERN_ERR kvm: vmptrld %p/%llx fail\n, + printk(KERN_ERR kvm: vmptrld %p/%llx failed\n, vmcs, phys_addr); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 28/28] nVMX: Documentation
This patch includes a brief introduction to the nested vmx feature in the Documentation/kvm directory. The document also includes a copy of the vmcs12 structure, as requested by Avi Kivity. Signed-off-by: Nadav Har'El n...@il.ibm.com --- Documentation/kvm/nested-vmx.txt | 237 + 1 file changed, 237 insertions(+) --- .before/Documentation/kvm/nested-vmx.txt2010-12-08 18:56:52.0 +0200 +++ .after/Documentation/kvm/nested-vmx.txt 2010-12-08 18:56:52.0 +0200 @@ -0,0 +1,237 @@ +Nested VMX +== + +Overview +- + +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) +to easily and efficiently run guest operating systems. Normally, these guests +*cannot* themselves be hypervisors running their own guests, because in VMX, +guests cannot use VMX instructions. + +The Nested VMX feature adds this missing capability - of running guest +hypervisors (which use VMX) with their own nested guests. It does so by +allowing a guest to use VMX instructions, and correctly and efficiently +emulating them using the single level of VMX available in the hardware. + +We describe in much greater detail the theory behind the nested VMX feature, +its implementation and its performance characteristics, in the OSDI 2010 paper +The Turtles Project: Design and Implementation of Nested Virtualization, +available at: + + http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf + + +Terminology +--- + +Single-level virtualization has two levels - the host (KVM) and the guests. +In nested virtualization, we have three levels: The host (KVM), which we call +L0, the guest hypervisor, which we call L1, and the nested guest, which we +call L2. + + +Known limitations +- + +The current code supports running Linux under a guest KVM using shadow +page tables. It supports multiple guest hypervisors, each can run multiple +guests. Only 64-bit guest hypervisors are supported. SMP is supported, but +is known to be buggy in this release. +Additional patches for running Windows under guest KVM, and Linux under +guest VMware server, and support for nested EPT, are currently running in +the lab, and will be sent as follow-on patchsets. + + +Running nested VMX +-- + +The nested VMX feature is disabled by default. It can be enabled by giving +the nested=1 option to the kvm-intel module. + + +ABIs + + +Nested VMX aims to present a standard and (eventually) fully-functional VMX +implementation for the a guest hypervisor to use. As such, the official +specification of the ABI that it provides is Intel's VMX specification, +namely volume 3B of their Intel 64 and IA-32 Architectures Software +Developer's Manual. Not all of VMX's features are currently fully supported, +but the goal is to eventually support them all, starting with the VMX features +which are used in practice by popular hypervisors (KVM and others). + +As a VMX implementation, nested VMX presents a VMCS structure to L1. +As mandated by the spec, other than the two fields revision_id and abort, +this structure is *opaque* to its user, who is not supposed to know or care +about its internal structure. Rather, the structure is accessed through the +VMREAD and VMWRITE instructions. +Still, for debugging purposes, KVM developers might be interested to know the +internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. +For convenience, we repeat its content here. If the internals of this structure +changes, this can break live migration across KVM versions. VMCS12_REVISION +(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs +is ever changed. + +struct __packed vmcs12 { + /* According to the Intel spec, a VMCS region must start with the +* following two fields. Then follow implementation-specific data. +*/ + u32 revision_id; + u32 abort; + + struct shadow_vmcs shadow_vmcs; + + bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ + + int cpu; + int launched; +} + +struct __packed shadow_vmcs { + u16 virtual_processor_id; + u16 guest_es_selector; + u16 guest_cs_selector; + u16 guest_ss_selector; + u16 guest_ds_selector; + u16 guest_fs_selector; + u16 guest_gs_selector; + u16 guest_ldtr_selector; + u16 guest_tr_selector; + u16 host_es_selector; + u16 host_cs_selector; + u16 host_ss_selector; + u16 host_ds_selector; + u16 host_fs_selector; + u16 host_gs_selector; + u16 host_tr_selector; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + u64 vm_exit_msr_store_addr; + u64 vm_exit_msr_load_addr; + u64 vm_entry_msr_load_addr; + u64 tsc_offset; + u64 virtual_apic_page_addr; + u64 apic_access_addr; + u64 ept_pointer; + u64 guest_physical_address; + u64 vmcs_link_pointer; + u64
Re: [RFC PATCH 2/3] sched: add yield_to function
On 12/03/2010 08:23 AM, Peter Zijlstra wrote: On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote: unsigned long clone_flags); + +#ifdef CONFIG_SCHED_HRTICK +extern u64 slice_remain(struct task_struct *); +extern void yield_to(struct task_struct *); +#else +static inline void yield_to(struct task_struct *p) yield() +#endif That does SCHED_HRTICK have to do with any of this? Legacy from an old prototype this patch is based on. I'll get rid of that. +/** + * requeue_task - requeue a task which priority got changed by yield_to priority doesn't seem the right word, you're not actually changing anything related to p-*prio True, I'll change the comment. + * @rq: the tasks's runqueue + * @p: the task in question + * Must be called with the runqueue lock held. Will cause the CPU to + * reschedule if p is now at the head of the runqueue. + */ +void requeue_task(struct rq *rq, struct task_struct *p) +{ + assert_spin_locked(rq-lock); + + if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p)) + return; + + dequeue_task(rq, p, 0); + enqueue_task(rq, p, 0); + + resched_task(p); I guess that wants to be something like check_preempt_curr() Done. Thanks for pointing that out. @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, return ret; } +#ifdef CONFIG_SCHED_HRTICK Still wondering what all this has to do with SCHED_HRTICK.. +/* + * Yield the CPU, giving the remainder of our time slice to task p. + * Typically used to hand CPU time to another thread inside the same + * process, eg. when p holds a resource other threads are waiting for. + * Giving priority to p may help get that resource released sooner. + */ +void yield_to(struct task_struct *p) +{ + unsigned long flags; + struct sched_entity *se =p-se; + struct rq *rq; + struct cfs_rq *cfs_rq; + u64 remain = slice_remain(current); + + rq = task_rq_lock(p,flags); + if (task_running(rq, p) || task_has_rt_policy(p)) + goto out; See, this all ain't nice, slice_remain() don't make no sense to be called for !fair tasks. Why not write: if (curr-sched_class == p-sched_class curr-sched_class-yield_to) curr-sched_class-yield_to(curr, p); or something, and then implement sched_class_fair::yield_to only, leaving it a NOP for all other classes. Done. + cfs_rq = cfs_rq_of(se); + se-vruntime -= remain; + if (se-vruntime cfs_rq-min_vruntime) + se-vruntime = cfs_rq-min_vruntime; Now here we have another problem, remain was measured in wall-time, and then you go change a virtual time measure using that. These things are related like: vt = t/weight So you're missing a weight factor somewhere. Also, that check against min_vruntime doesn't really make much sense. OK, how do I do this? + requeue_task(rq, p); Just makes me wonder why you added requeue task to begin with.. why not simply dequeue at the top of this function, and enqueue at the tail, like all the rest does: see rt_mutex_setprio(), set_user_nice(), sched_move_task(). Done. + out: + task_rq_unlock(rq,flags); + yield(); +} +EXPORT_SYMBOL(yield_to); EXPORT_SYMBOL_GPL() pretty please, I really hate how kvm is a module and needs to export hooks all over the core kernel :/ Done. Right, so another approach might be to simply swap the vruntime between curr and p. Doesn't that run into the same scale issue you described above? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm hangs on mkfs
Hi, I tried a relatively simple task with qemu-kvm. I have two qcow hd images and try to create filesystems on them using a gentoo installation disk. Starting qemu with: qemu -m 512 -cdrom install-x86-minimal-20101116.iso -hda hda.img -hdb hdb.img However, mkfs always hangs indefinitely. Doesn't really matter if ext2/3/4, it always hangs at Writing superblocks and filesystem accounting information: Any idea where to start looking for the problem? (please cc me as I'm not subscribed to this list) -- Hanno Böck Blog: http://www.hboeck.de/ GPG: 3DBD3B20 Jabber/Mail:ha...@hboeck.de http://schokokeks.org - professional webhosting signature.asc Description: This is a digitally signed message part.
Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.
On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote: On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote: On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote: Usually nobody usually thinks about that scenario (me included and specially), but kvmclock can be actually disabled in the host. It happens in two scenarios: 1. host too old. 2. we passed -kvmclock to our -cpu parameter. In both cases, we should not register kvmclock savevm section. This patch achives that by registering this section only if kvmclock is actually currently enabled in cpuid. The only caveat is that we have to register the savevm section a little bit later, since we won't know the final kvmclock state before cpuid gets parsed. What is the problem of registering the section? Restoring the value if the host does not support it returns an error? Can't you ignore the error if kvmclock is not reported in cpuid, in the restore handler? We can change the restore handler, but not the restore handler of binaries that are already out there. The motivation here is precisely to address migration to hosts without kvmclock, so it's better to have a way to disable, than to count on the fact that the other side will be able to ignore it. OK. Can't you register conditionally on kvmclock cpuid bit at the end of kvm_arch_init_vcpu, in target-i386/kvm.c? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Fix missing lock for kvm_io_bus_unregister_dev()
On Wed, Dec 08, 2010 at 01:45:06AM +0900, Takuya Yoshikawa wrote: Memo: - kvm_io_bus_register_dev() was protected as far as I checked. - kvm_create_pit() was commented like Caller must hold slots_lock but kvm_free_pit() was not. So I don't know if I should protect the whole kvm_free_pit(). What is the best fix? -- or I misread something? Thanks, Takuya === From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp The comment for kvm_io_bus_unregister_dev() says Caller must hold slots_lock but some callers don't do so. Though this patch fixes these, more consistent locking manner may be needed in the future. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/i8254.c |2 ++ arch/x86/kvm/i8259.c |2 ++ arch/x86/kvm/x86.c |2 ++ virt/kvm/ioapic.c|2 ++ 4 files changed, 8 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index efad723..f64393c 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -744,9 +744,11 @@ void kvm_free_pit(struct kvm *kvm) struct hrtimer *timer; if (kvm-arch.vpit) { + mutex_lock(kvm-slots_lock); kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS, kvm-arch.vpit-dev); kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS, kvm-arch.vpit-speaker_dev); + mutex_unlock(kvm-slots_lock); kvm_unregister_irq_mask_notifier(kvm, 0, kvm-arch.vpit-mask_notifier); kvm_unregister_irq_ack_notifier(kvm, This is supposedly safe because this is only called from vm destroy context, when dropping the last reference. diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index f628234..d9fe35d 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -596,7 +596,9 @@ void kvm_destroy_pic(struct kvm *kvm) struct kvm_pic *vpic = kvm-arch.vpic; if (vpic) { + mutex_lock(kvm-slots_lock); kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS, vpic-dev); + mutex_unlock(kvm-slots_lock); kvm-arch.vpic = NULL; kfree(vpic); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ed373ba..48e59d1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3313,8 +3313,10 @@ long kvm_arch_vm_ioctl(struct file *filp, if (vpic) { r = kvm_ioapic_init(kvm); if (r) { + mutex_lock(kvm-slots_lock); kvm_io_bus_unregister_dev(kvm, KVM_PIO_BUS, vpic-dev); + mutex_unlock(kvm-slots_lock); kfree(vpic); goto create_irqchip_unlock; } diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 0b9df83..fa76380 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -409,7 +409,9 @@ void kvm_ioapic_destroy(struct kvm *kvm) struct kvm_ioapic *ioapic = kvm-arch.vioapic; if (ioapic) { + mutex_lock(kvm-slots_lock); kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, ioapic-dev); + mutex_unlock(kvm-slots_lock); kvm-arch.vioapic = NULL; kfree(ioapic); } -- 1.7.1 It seems the best way to fix is to move irq_lock and slots_lock acquision from kvm_set_irq_routing/kvm_ioapic_destroy/kvm_destroy_pic to their callers. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm hangs on mkfs
On Wednesday, December 08, 2010 01:09:25 pm Hanno Böck wrote: Hi, I tried a relatively simple task with qemu-kvm. I have two qcow hd images and try to create filesystems on them using a gentoo installation disk. qcow2 (I hope you are using that vs just qcow) is known to be a tad on the slow side on metadata heavy operations (i.e. mkfs, installing lots of files, etc.). One trick some of us use is to use the -drive syntax (vs -hda) and set the cache option to unsafe or writeback for the install process. The other alternative is to use preallocated raw images (i.e. made with dd vs qemu-img). I've been informed that in 0.12.5 the writeback trick won't do any good due to some extra fsync()s. So your best bet is to upgrade to 0.13 and use cache=unsafe. Starting qemu with: qemu -m 512 -cdrom install-x86-minimal-20101116.iso -hda hda.img -hdb hdb.img However, mkfs always hangs indefinitely. Doesn't really matter if ext2/3/4, it always hangs at Writing superblocks and filesystem accounting information: Have you tried strace'ing to see if it's actually doing something (just very slowly)? Any idea where to start looking for the problem? (please cc me as I'm not subscribed to this list) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] sched: add yield_to function
On Wed, 2010-12-08 at 12:55 -0500, Rik van Riel wrote: Right, so another approach might be to simply swap the vruntime between curr and p. Doesn't that run into the same scale issue you described above? Not really, but its tricky on SMP because vruntime only has meaning within a cfs_rq. The below is quickly cobbled together from a few patches I had laying about dealing with similar issues. The avg_vruntime() stuff takes the weighted average of the vruntimes on a cfs runqueue, this weighted average corresponds to the 0-lag point, that is the point where an ideal scheduler would have all tasks. Using the 0-lag point you can compute the lag of a task, the lag is a measure of service for a task, negative lag means the task is owed services, positive lag means its got too much service (at least, thats the sign convention used here, I always forget what the standard is). What we do is, when @p the target task, is owed less service than current, we flip lags such that p will become more eligible. The trouble with all this is that computing the weighted average is terribly expensive (it increases cost of all scheduler hot paths). The dyn_vruntime stuff mixed in there is an attempt at computing something similar, although its not used and I haven't tested the quality of the approximation in a while. Anyway, complete untested and such.. --- include/linux/sched.h |2 + kernel/sched.c | 39 ++ kernel/sched_debug.c| 31 - kernel/sched_fair.c | 179 ++- kernel/sched_features.h |8 +-- 5 files changed, 203 insertions(+), 56 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b0fc8ee..538559e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1095,6 +1095,8 @@ struct sched_class { #ifdef CONFIG_FAIR_GROUP_SCHED void (*task_move_group) (struct task_struct *p, int on_rq); #endif + + void (*yield_to) (struct task_struct *p); }; struct load_weight { diff --git a/kernel/sched.c b/kernel/sched.c index 0debad9..fe0adb0 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -313,6 +313,9 @@ struct cfs_rq { struct load_weight load; unsigned long nr_running; + s64 avg_vruntime; + u64 avg_load; + u64 exec_clock; u64 min_vruntime; @@ -5062,6 +5065,42 @@ SYSCALL_DEFINE0(sched_yield) return 0; } +void yield_to(struct task_struct *p) +{ + struct task_struct *curr = current; + struct rq *p_rq, *rq; + unsigned long flags; + int on_rq; + + local_irq_save(flags); + rq = this_rq(); +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + if (p_rq != task_rq(p)) { + double_rq_unlock(rq, p_rq); + goto again; + } + + update_rq_clock(rq); + update_rq_clock(p_rq); + + on_rq = p-se.on_rq; + if (on_rq) + dequeue_task(p_rq, p, 0); + + ret = 0; + if (p-sched_class == curr-sched_class curr-sched_class-yield_to) + curr-sched_class-yield_to(p); + + if (on_rq) + enqueue_task(p_rq, p, 0); + + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(yield_to); + static inline int should_resched(void) { return need_resched() !(preempt_count() PREEMPT_ACTIVE); diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c index 1dfae3d..b5cc773 100644 --- a/kernel/sched_debug.c +++ b/kernel/sched_debug.c @@ -138,10 +138,9 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu) void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) { - s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1, - spread, rq0_min_vruntime, spread0; + s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread; struct rq *rq = cpu_rq(cpu); - struct sched_entity *last; + struct sched_entity *last, *first; unsigned long flags; SEQ_printf(m, \ncfs_rq[%d]:\n, cpu); @@ -149,28 +148,26 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) SPLIT_NS(cfs_rq-exec_clock)); raw_spin_lock_irqsave(rq-lock, flags); - if (cfs_rq-rb_leftmost) - MIN_vruntime = (__pick_next_entity(cfs_rq))-vruntime; + first = __pick_first_entity(cfs_rq); + if (first) + left_vruntime = first-vruntime; last = __pick_last_entity(cfs_rq); if (last) - max_vruntime = last-vruntime; + right_vruntime = last-vruntime; min_vruntime = cfs_rq-min_vruntime; - rq0_min_vruntime = cpu_rq(0)-cfs.min_vruntime; raw_spin_unlock_irqrestore(rq-lock, flags); - SEQ_printf(m, .%-30s: %Ld.%06ld\n, MIN_vruntime, - SPLIT_NS(MIN_vruntime)); + SEQ_printf(m, .%-30s: %Ld.%06ld\n, left_vruntime, +
Re: [RFC PATCH 2/3] sched: add yield_to function
On Wed, 2010-12-08 at 21:00 +0100, Peter Zijlstra wrote: + lag0 = avg_vruntime(cfs_rq_of(se)); + p_lag0 = avg_vruntime(cfs_rq_of(p_se)); + + lag = se-vruntime - avg_vruntime(cfs_rq); + p_lag = p_se-vruntime - avg_vruntime(p_cfs_rq); + + if (p_lag lag) { /* if P is owed less service */ + se-vruntime = lag0 + p_lag; + p_se-vruntime = p_lag + lag; + } clearly that should read: p_se-vruntime = p_lag0 + lag; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory leaks in virtio drivers?
On Wed, Dec 1, 2010 at 10:16 AM, Freddie Cash fjwc...@gmail.com wrote: Just an update on this. We made the change over the weekend to enable cache=off for all the VMs, including the libvirt managed ones (turns out, libvirtd only reads the .xml files at startup); and enabeld KSM on the host. 5 days later, we have only 700 MB of swap used, and 15.2 GB of VM committed. This appears to be the steady-state for the host, as it hasn't changed (cache, free, and buffer change a bit throughout the day). Unfortunately, this has exposed just how horribly unoptimised our storage array underneath it. :( It's a single 12-drive RAID6, auto-carved into 2 TB chunks, and then stitched back together via LVM into a single volume group. With each VM getting it's own logical volume. We have plans over the Christmas break to re-do this as a RAID50 or possible a RAID10 + RAID50. Thanks for all the tips and pointers. I'm starting to get all this figured out and understood. There's been a lot of changes since KVM-72. :) One final update for the archives. Further research showed that our I/O setup was non-optimal for the RAID controllers were are using (3Ware 9650SE). Modifying the following sysfs variables dropper our I/O wait down to near 0, allowing the controller to keepup with the requests, and eliminating our use of swap (even without using cache=none): (for each block device) block/sda/queue/scheduler = deadline block/sda/queue/nr_requests = 512 block/sda/queue/max_sectors_kb = 128 block/sda/queue/read_ahead_kb = 16384 We're in the process of adding a separate 6-disk RAID10 array for our most I/O-heavy filesystems, which should help even more. So, it looks like our I/O setup was so horrible originally that the controller couldn't keep up, and more RAM in the host was used for buffering, until the host started swapping, and things would grind to a halt. Over a week later, and we're still sitting with 0 bytes swap used in the host, with several 100 MB of free RAM (fluctuates), and I/O usage rarely above 30%. -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm hangs on mkfs
Am Wednesday 08 December 2010 schrieb Brian Jackson: qcow2 (I hope you are using that vs just qcow) is known to be a tad on the slow side on metadata heavy operations (i.e. mkfs, installing lots of files, etc.). One trick some of us use is to use the -drive syntax (vs -hda) and set the cache option to unsafe or writeback for the install process. The other alternative is to use preallocated raw images (i.e. made with dd vs qemu-img). I've been informed that in 0.12.5 the writeback trick won't do any good due to some extra fsync()s. So your best bet is to upgrade to 0.13 and use cache=unsafe. This helped, thanks a lot. (Maybe this could be documented at a more obvious place? I tried googling quite a lot for the solution - a faq entry explicitely mentioning mkfs? It seems a quite common task to me) -- Hanno Böck Blog: http://www.hboeck.de/ GPG: 3DBD3B20 Jabber/Mail:ha...@hboeck.de http://schokokeks.org - professional webhosting signature.asc Description: This is a digitally signed message part.
Re: [RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
On 12/05/2010 07:56 AM, Avi Kivity wrote: + if (vcpu == me) + continue; + if (vcpu-spinning) + continue; You may well want to wake up a spinner. Suppose A takes a lock B preempts A B grabs a ticket, starts spinning, yields to A A releases lock A grabs ticket, starts spinning at this point, we want A to yield to B, but it won't because of this check. That's a good point. I guess we'll have to benchmark both with and without the vcpu-spinning logic. + if (!task) + continue; + if (waitqueue_active(vcpu-wq)) + continue; + if (task-flags PF_VCPU) + continue; + kvm-last_boosted_vcpu = i; + yield_to(task); + break; + } I think a random selection algorithm will be a better fit against special guest behaviour. Possibly, though I suspect we'd have to hit very heavy overcommit ratios with very large VMs before round robin stops working. - /* Sleep for 100 us, and hope lock-holder got scheduled */ - expires = ktime_add_ns(ktime_get(), 10UL); - schedule_hrtimeout(expires, HRTIMER_MODE_ABS); + if (first_round last_boosted_vcpu == kvm-last_boosted_vcpu) { + /* We have not found anyone yet. */ + first_round = 0; + goto again; Need to guarantee termination. We do that by setting first_round to 0 :) We at most walk N+1 VCPUs in a VM with N VCPUs, with this patch. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] sched: add yield_to function
On 12/08/2010 03:00 PM, Peter Zijlstra wrote: Anyway, complete untested and such.. Looks very promising. I've been making a few changes in the same direction (except for the fancy CFS bits) and have one way to solve the one problem you pointed out in your patch. +void yield_to(struct task_struct *p) +{ ... + on_rq = p-se.on_rq; + if (on_rq) + dequeue_task(p_rq, p, 0); + + ret = 0; + if (p-sched_class == curr-sched_class curr-sched_class-yield_to) + curr-sched_class-yield_to(p); + + if (on_rq) + enqueue_task(p_rq, p, 0); diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index c886717..8689bcd 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c +static void yield_to_fair(struct task_stuct *p) +{ + struct sched_entity *se =current-se; + struct sched_entity *p_se =p-se; + u64 lag0, p_lag0; + s64 lag, p_lag; + + lag0 = avg_vruntime(cfs_rq_of(se)); + p_lag0 = avg_vruntime(cfs_rq_of(p_se)); + + lag = se-vruntime - avg_vruntime(cfs_rq); + p_lag = p_se-vruntime - avg_vruntime(p_cfs_rq); + + if (p_lag lag) { /* if P is owed less service */ + se-vruntime = lag0 + p_lag; + p_se-vruntime = p_lag + lag; + } + + /* +* XXX try something smarter here +*/ + resched_task(p); + resched_task(current); +} If we do the dequeue_task and enqueue_task here, we can use check_preempt_curr in yield_to_fair. Alternatively, we can do the rescheduling from the main yield_to function, not from yield_to_fair, by calling check_preempt_curr on p and current after p has been enqueued. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Unittests failure for latest upstream kvm.git + qemu-kvm.git + kvm-unit-tests.git
Failure on one of the unittests: vmexit timed out after 10 minutes: 12/08 19:18:32 DEBUG|kvm_prepro:0052| Preprocessing VM 'vm1'... 12/08 19:18:32 DEBUG|kvm_prepro:0055| VM object found in environment 12/08 19:18:32 WARNI|kvm_prepro:0089| Could not send monitor command 'screendump /usr/local/autotest/results/default/kvm.unittest/debug/pre_vm1.ppm' (Broken pipe) 12/08 19:18:32 DEBUG|kvm_vm:0767| VM is already down 12/08 19:18:32 DEBUG|kvm_vm:0357| Getting output of 'qemu -help' 12/08 19:18:32 DEBUG|kvm_vm:0668| Running qemu command: /usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor unix:'/tmp/monitor-humanmonitor1-20101208-191729-GpeV',server,nowait -serial unix:'/tmp/serial-20101208-191729-GpeV',server,nowait -m 512 -smp 2 -kernel '/usr/local/autotest/tests/kvm/unittests/vmexit.flat' -vnc :0 -chardev file,id=testlog,path=/tmp/testlog-20101208-191729-GpeV -device testdev,chardev=testlog -S 12/08 19:18:33 DEBUG|kvm_vm:0735| VM appears to be alive with PID 17597 12/08 19:18:33 INFO | unittest:0096| Waiting for unittest vmexit to complete, timeout 600, output in /tmp/testlog-20101208-191729-GpeV 12/08 19:28:34 DEBUG| kvm_utils:0879| Timeout elapsed 12/08 19:28:34 ERROR| unittest:0108| Exception happened during vmexit: Timeout elapsed (600s) 12/08 19:28:34 INFO | unittest:0113| Unit test log collected and available under /usr/local/autotest/results/default/kvm.unittest/debug/vmexit.log Looking at vmexit log: enabling apic enabling apic cpuid 3984 Commit logs: 12/08 18:56:34 DEBUG|base_utils:0106| [stdout] HEAD is now at 66fc6be KVM: MMU: Avoid dropping accessed bit while removing write access 12/08 18:56:56 INFO | kvm_utils:0407| Commit hash for git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git is 66fc6be8d2b04153b753182610f919faf9c705bc (tag v2.6.32-57426-g66fc6be) 12/08 19:15:45 INFO | kvm_utils:0407| Commit hash for git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git is cb1983b8809d0e06a97384a40bad1194a32fc814 (tag kvm-88-6330-gcb1983b) 12/08 19:15:48 INFO | kvm_utils:0407| Commit hash for git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git is 750bbdb58dbc9017b3adc4f4709cd285183a8e30 (no tag found) Let me know what other info I can capture about this particular failure. Lucas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [qemu-kvm-next-tree] fix compile error of hw/device-assignment.c
Fix the following compile error in next tree: CCx86_64-softmmu/device-assignment.o hw/device-assignment.c: In function ‘assigned_device_pci_cap_init’: hw/device-assignment.c:1463: error: ‘PCI_PM_CTRL_NO_SOFT_RST’ undeclared (first use in this function) hw/device-assignment.c:1463: error: (Each undeclared identifier is reported only once hw/device-assignment.c:1463: error: for each function it appears in.) Signed-off-by: Wei Yongjun yj...@cn.fujitsu.com --- hw/device-assignment.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 50c6408..8446cd4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1460,7 +1460,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) /* assign_device will bring the device up to D0, so we don't need * to worry about doing that ourselves here. */ pci_set_word(pci_dev-config + pos + PCI_PM_CTRL, - PCI_PM_CTRL_NO_SOFT_RST); + PCI_PM_CTRL_NO_SOFT_RESET); pci_set_byte(pci_dev-config + pos + PCI_PM_PPB_EXTENSIONS, 0); pci_set_byte(pci_dev-config + pos + PCI_PM_DATA_REGISTER, 0); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] qemu,kvm: Enable NMI support for user space irqchip
Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the user space APIC emulation or some other source raised them. Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 7dfc357..c4ebe28 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1417,6 +1417,14 @@ int kvm_arch_get_registers(CPUState *env) int kvm_arch_pre_run(CPUState *env, struct kvm_run *run) { +#ifdef KVM_CAP_USER_NMI +if (env-interrupt_request CPU_INTERRUPT_NMI) { +env-interrupt_request = ~CPU_INTERRUPT_NMI; +DPRINTF(injected NMI\n); +kvm_vcpu_ioctl(env, KVM_NMI); +} +#endif + /* Try to inject an interrupt if the guest can accept it */ if (run-ready_for_interrupt_injection (env-interrupt_request CPU_INTERRUPT_HARD) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] qemu,qmp: convert do_inject_nmi() to QObject
Convert do_inject_nmi() to QObject, we need to use it(via libvirt). It is trivial, as it never fails, doesn't have output nor return any data. Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/hmp-commands.hx b/hmp-commands.hx index 7a49b74..2e6b034 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -725,7 +725,8 @@ ETEXI .args_type = cpu_index:i, .params = cpu, .help = inject an NMI on the given CPU, -.mhandler.cmd = do_inject_nmi, +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_inject_nmi, }, #endif STEXI diff --git a/monitor.c b/monitor.c index 729a7cb..1f0d29e 100644 --- a/monitor.c +++ b/monitor.c @@ -2120,7 +2120,7 @@ static void do_wav_capture(Monitor *mon, const QDict *qdict) #endif #if defined(TARGET_I386) -static void do_inject_nmi(Monitor *mon, const QDict *qdict) +static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject **ret_data) { CPUState *env; int cpu_index = qdict_get_int(qdict, cpu_index); @@ -2130,6 +2130,7 @@ static void do_inject_nmi(Monitor *mon, const QDict *qdict) cpu_interrupt(env, CPU_INTERRUPT_NMI); break; } +return 0; } #endif diff --git a/qmp-commands.hx b/qmp-commands.hx index a385b66..2506981 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -453,6 +453,22 @@ Example: EQMP +#if defined(TARGET_I386) +{ +.name = nmi, +.args_type = cpu_index:i, +.params = cpu, +.help = inject an NMI on the given CPU, +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_inject_nmi, +}, +#endif +SQMP +...@item nmi @var{cpu} +...@findex nmi +Inject an NMI on the given CPU (x86 only). +EQMP + { .name = migrate, .args_type = detach:-d,blk:-b,inc:-i,uri:s, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] qumu,qmp: QError: New QERR_INVALID_KEY
Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/qerror.c b/qerror.c index ac2cdaf..a7ef758 100644 --- a/qerror.c +++ b/qerror.c @@ -117,6 +117,10 @@ static const QErrorStringTable qerror_table[] = { .desc = Invalid block format '%(name)', }, { +.error_fmt = QERR_INVALID_KEY, +.desc = Invalid key: '%(name)...', +}, +{ .error_fmt = QERR_INVALID_PARAMETER, .desc = Invalid parameter '%(name)', }, diff --git a/qerror.h b/qerror.h index 943a24b..4fa95ef 100644 --- a/qerror.h +++ b/qerror.h @@ -102,6 +102,9 @@ QError *qobject_to_qerror(const QObject *obj); #define QERR_INVALID_BLOCK_FORMAT \ { 'class': 'InvalidBlockFormat', 'data': { 'name': %s } } +#define QERR_INVALID_KEY \ +{ 'class': 'InvalidKey', 'data': { 'name': %s } } + #define QERR_INVALID_PARAMETER \ { 'class': 'InvalidParameter', 'data': { 'name': %s } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] qemu,qmp: QError: New QERR_TOO_MANY_KEYS
Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/qerror.c b/qerror.c index a7ef758..fd66d2a 100644 --- a/qerror.c +++ b/qerror.c @@ -197,6 +197,10 @@ static const QErrorStringTable qerror_table[] = { .desc = Too many open files, }, { +.error_fmt = QERR_TOO_MANY_KEYS, +.desc = Too many keys, +}, +{ .error_fmt = QERR_UNDEFINED_ERROR, .desc = An undefined error has ocurred, }, diff --git a/qerror.h b/qerror.h index 4fa95ef..7f56f12 100644 --- a/qerror.h +++ b/qerror.h @@ -162,6 +162,9 @@ QError *qobject_to_qerror(const QObject *obj); #define QERR_TOO_MANY_FILES \ { 'class': 'TooManyFiles', 'data': {} } +#define QERR_TOO_MANY_KEYS \ +{ 'class': 'TooManyKeys', 'data': {} } + #define QERR_UNDEFINED_ERROR \ { 'class': 'UndefinedError', 'data': {} } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] qemu,qmp: QError: New QERR_UNKNOWN_KEY
Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/qerror.c b/qerror.c index fd66d2a..07b4cfc 100644 --- a/qerror.c +++ b/qerror.c @@ -205,6 +205,10 @@ static const QErrorStringTable qerror_table[] = { .desc = An undefined error has ocurred, }, { +.error_fmt = QERR_UNKNOWN_KEY, +.desc = Unknown key: '%(name)', +}, +{ .error_fmt = QERR_VNC_SERVER_FAILED, .desc = Could not start VNC server on %(target), }, diff --git a/qerror.h b/qerror.h index 7f56f12..cf3ab8f 100644 --- a/qerror.h +++ b/qerror.h @@ -168,6 +168,9 @@ QError *qobject_to_qerror(const QObject *obj); #define QERR_UNDEFINED_ERROR \ { 'class': 'UndefinedError', 'data': {} } +#define QERR_UNKNOWN_KEY \ +{ 'class': 'UnknownKey', 'data': { 'name': %s } } + #define QERR_VNC_SERVER_FAILED \ { 'class': 'VNCServerFailed', 'data': { 'target': %s } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] qemu,qmp: Convert do_sendkey() to QObject,QError
Convert do_sendkey() to QObject,QError, we need to use it.(via libvirt) It is a trivial conversion, carefully converted the error reports. Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/hmp-commands.hx b/hmp-commands.hx index 23024ba..7a49b74 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -436,7 +436,8 @@ ETEXI .args_type = string:s,hold_time:i?, .params = keys [hold_ms], .help = send keys to the VM (e.g. 'sendkey ctrl-alt-f1', default hold time=100 ms), -.mhandler.cmd = do_sendkey, +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_sendkey, }, STEXI diff --git a/monitor.c b/monitor.c index ec31eac..729a7cb 100644 --- a/monitor.c +++ b/monitor.c @@ -1678,7 +1678,7 @@ static void release_keys(void *opaque) } } -static void do_sendkey(Monitor *mon, const QDict *qdict) +static int do_sendkey(Monitor *mon, const QDict *qdict, QObject **ret_data) { char keyname_buf[16]; char *separator; @@ -1700,18 +1700,18 @@ static void do_sendkey(Monitor *mon, const QDict *qdict) if (keyname_len 0) { pstrcpy(keyname_buf, sizeof(keyname_buf), string); if (keyname_len sizeof(keyname_buf) - 1) { -monitor_printf(mon, invalid key: '%s...'\n, keyname_buf); -return; +qerror_report(QERR_INVALID_KEY, keyname_buf); +return -1; } if (i == MAX_KEYCODES) { -monitor_printf(mon, too many keys\n); -return; +qerror_report(QERR_TOO_MANY_KEYS); +return -1; } keyname_buf[keyname_len] = 0; keycode = get_keycode(keyname_buf); if (keycode 0) { -monitor_printf(mon, unknown key: '%s'\n, keyname_buf); -return; +qerror_report(QERR_UNKNOWN_KEY, keyname_buf); +return -1; } keycodes[i++] = keycode; } @@ -1730,6 +1730,7 @@ static void do_sendkey(Monitor *mon, const QDict *qdict) /* delayed key up events */ qemu_mod_timer(key_timer, qemu_get_clock(vm_clock) + muldiv64(get_ticks_per_sec(), hold_time, 1000)); +return 0; } static int mouse_button_state; diff --git a/qmp-commands.hx b/qmp-commands.hx index e5f157f..a385b66 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -225,6 +225,30 @@ Example: - { return: {} } EQMP +{ +.name = sendkey, +.args_type = string:s,hold_time:i?, +.params = keys [hold_ms], +.help = send keys to the VM (e.g. 'sendkey ctrl-alt-f1', default hold time=100 ms), +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_sendkey, +}, + +SQMP +...@item sendkey @var{keys} +...@findex sendkey + +Send @var{keys} to the emulator. @var{keys} could be the name of the +key or @code{#} followed by the raw value in either decimal or hexadecimal +format. Use @code{-} to press several keys simultaneously. Example: +...@example +sendkey ctrl-alt-f1 +...@end example + +This command is useful to send keys that your graphical user interface +intercepts at low level, such as @code{ctrl-alt-f1} in X Window. + +EQMP { .name = system_reset, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] qemu,kvm: Enable NMI support for user space irqchip
Am 09.12.2010 07:58, Lai Jiangshan wrote: Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the user space APIC emulation or some other source raised them. In that light, the subject is not absolutely correct. Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 7dfc357..c4ebe28 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1417,6 +1417,14 @@ int kvm_arch_get_registers(CPUState *env) int kvm_arch_pre_run(CPUState *env, struct kvm_run *run) { +#ifdef KVM_CAP_USER_NMI +if (env-interrupt_request CPU_INTERRUPT_NMI) { +env-interrupt_request = ~CPU_INTERRUPT_NMI; +DPRINTF(injected NMI\n); +kvm_vcpu_ioctl(env, KVM_NMI); +} +#endif + /* Try to inject an interrupt if the guest can accept it */ if (run-ready_for_interrupt_injection (env-interrupt_request CPU_INTERRUPT_HARD) Actually, we already depend on KVM_CAP_DESTROY_MEMORY_REGION_WORKS which was introduced with 2.6.29 as well. I would suggest to simply extend the static configure check and avoid new #ifdefs in the code. Thanks for pushing this! Was obviously so trivial that it was forgotten... Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html