Re: [Qemu-devel] Re: QEMU-KVM and video performance
On 05/09/2010 10:35 PM, Gerhard Wiesinger wrote: Please run kvm_stat and report output for both tests to confirm. See below. 2nd column is per second statistic when running the test. efer_reload 0 0 exits 18470836 554582 fpu_reload 21478333469 halt_exits2083 0 halt_wakeup 2047 0 host_state_reload 21481863470 hypercalls 0 0 insn_emulation 7688203 554244 This indicates that kvm is emulating instead of direct mapping. That's probably a bug. If you fix it, performance will increase dramatically. Where can I start here? Any ideas how to? One of my ideas: Move hw/vga.c functions vga_mem_readb vga_mem_readw vga_mem_readl vga_mem_writeb vga_mem_writew vga_mem_writel to KVM to avoid switching from KVM to QEMU (I can write C code even kernel but I'm not comfortable with KVM). Howto? That is already done (generically), it's called coalesced mmio. You only have 3470 qemu exits/sec compared to 554244 kvm writes/sec. Switching between tcg and kvm is hard, but not needed. For 256 color modes, direct map is possible and should yield good performance. Bank switching can be improved perhaps 3x, but will never be fast. Where can I start for KVM performance for the bank switching (256 color mode)? (e.g. BIOS writes to VGA window I/O port to switch the bank) Any ideas how to improve (architecture for the change)? For 256 color more the first priority is to find out why direct mapping is not used. I'd suggest tracing the code that makes this decision (in hw/*vga.c) and seeing if it's right or not. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
On 05/06/2010 10:10 PM, Marcelo Tosatti wrote: On Tue, May 04, 2010 at 01:03:50PM +0300, Avi Kivity wrote: On svm, kvm_read_pdptr() may require reading guest memory, which can sleep. Push the spinlock into mmu_alloc_roots(), and only take it after we've read the pdptr. Marcelo, dropping and re-acquiring the lock before mmu_sync_roots(), is fine, yes? Yes, but you should call kvm_mmu_free_some_pages after reacquiring the spin_lock, to guarantee kvm_mmu_alloc_page won't fail. We never fail due to lack of free pages (but I agree it's cleaner). The patch is already in, I'll send an incremental. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Autotest][PATCH] KVM Test: Update qemu-ifup script to set bridge's forwarding delay to 0.
Our pxe case always fail. The problem is that the bridge takes some time to enter the forwarding state. Before that, packages are simply dropped by the bridge. If one sets: $ brctl setfd bridge 0 # forward delay = 0 It works. Signed-off-by: Feng Yang fy...@redhat.com --- client/tests/kvm/scripts/qemu-ifup |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/client/tests/kvm/scripts/qemu-ifup b/client/tests/kvm/scripts/qemu-ifup index bcd9a7a..3d23fb4 100755 --- a/client/tests/kvm/scripts/qemu-ifup +++ b/client/tests/kvm/scripts/qemu-ifup @@ -6,3 +6,4 @@ switch=$(/usr/sbin/brctl show | awk 'NR==2 { print $1 }') /sbin/ifconfig $1 0.0.0.0 up /usr/sbin/brctl addif ${switch} $1 +/usr/sbin/brctl setfd ${switch} 0 -- 1.5.5.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Another SIGFPE in display code, now in cirrus
On 05/06/2010 11:07 PM, Michael Tokarev wrote: There was a bug recently fixed in vnc code. Apparently there's something similar in the cirrus emulation as well. Here it triggers _always_ (including old versions of kvm) when running windows NT and hitting test button in its display resolution dialog. Here's what gdb is to say: Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 0xf76cab70 (LWP 580)] 0x080c5e45 in cirrus_do_copy (s=0x86134dc, dst=96, src=0, w=2, h=9) at hw/cirrus_vga.c:687 687sx = (src % ABS(s-cirrus_blt_srcpitch)) / depth; (gdb) p depth $1 = 2 (gdb) p s-cirrus_blt_srcpitch $2 = 0 This qemu-kvm-0.12.3 - actually a debian package of it, but there's no patches relevant to video applied. Anything can be done with it? Well, it's trivial to check for the condition, but how to handle it? Need to find the spec for the chip. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] Support for booting from virtio disks
This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: v1-v2: - free memory in case of vq initialization error. - change license of virtio ring/pci to LGPLv3 with permission of Laurent Vivier (aka the author). diff --git a/Makefile b/Makefile index 327a1bf..d0b8881 100644 --- a/Makefile +++ b/Makefile @@ -14,7 +14,8 @@ OUT=out/ SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \ kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \ pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \ -usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c +usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \ +virtio-ring.c virtio-pci.c virtio-blk.c SRC16=$(SRCBOTH) system.c disk.c apm.c font.c SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \ acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \ diff --git a/src/block.c b/src/block.c index ddf441f..b6b1902 100644 --- a/src/block.c +++ b/src/block.c @@ -11,6 +11,7 @@ #include util.h // dprintf #include ata.h // process_ata_op #include usb-msc.h // process_usb_op +#include virtio-blk.h // process_virtio_op struct drives_s Drives VAR16VISIBLE; @@ -289,6 +290,8 @@ process_op(struct disk_op_s *op) return process_cdemu_op(op); case DTYPE_USB: return process_usb_op(op); +case DTYPE_VIRTIO: + return process_virtio_op(op); default: op-count = 0; return DISK_RET_EPARAM; diff --git a/src/config.h b/src/config.h index b101174..ad569c6 100644 --- a/src/config.h +++ b/src/config.h @@ -136,6 +136,9 @@ #define CONFIG_SUBMODEL_ID 0x00 #define CONFIG_BIOS_REVISION 0x01 +// Support boot from virtio storage +#define CONFIG_VIRTIO_BLK 1 + // Various memory addresses used by the code. #define BUILD_STACK_ADDR 0x7000 #define BUILD_S3RESUME_STACK_ADDR 0x1000 diff --git a/src/disk.h b/src/disk.h index 0cd1b74..9e5b083 100644 --- a/src/disk.h +++ b/src/disk.h @@ -197,6 +197,7 @@ struct drive_s { #define DTYPE_RAMDISK 0x04 #define DTYPE_CDEMU0x05 #define DTYPE_USB 0x06 +#define DTYPE_VIRTIO 0x07 #define MAXDESCSIZE 80 diff --git a/src/pci_ids.h b/src/pci_ids.h index 1800f1d..e1cded2 100644 --- a/src/pci_ids.h +++ b/src/pci_ids.h @@ -2605,3 +2605,6 @@ #define PCI_DEVICE_ID_RME_DIGI32 0x9896 #define PCI_DEVICE_ID_RME_DIGI32_PRO 0x9897 #define PCI_DEVICE_ID_RME_DIGI32_8 0x9898 + +#define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4 +#define PCI_DEVICE_ID_VIRTIO_BLK 0x1001 diff --git a/src/post.c b/src/post.c index 638b0f7..25535e2 100644 --- a/src/post.c +++ b/src/post.c @@ -23,6 +23,7 @@ #include smbios.h // smbios_init #include paravirt.h // qemu_cfg_port_probe #include ps2port.h // ps2port_setup +#include virtio-blk.h // virtio_blk_setup void __set_irq(int vector, void *loc) @@ -184,6 +185,7 @@ init_hw(void) floppy_setup(); ata_setup(); ramdisk_setup(); +virtio_blk_setup(); } // Main setup code. diff --git a/src/virtio-blk.c b/src/virtio-blk.c new file mode 100644 index 000..a41c336 --- /dev/null +++ b/src/virtio-blk.c @@ -0,0 +1,155 @@ +// Virtio blovl boot support. +// +// Copyright (C) 2010 Red Hat Inc. +// +// Authors: +// Gleb Natapov gnata...@redhat.com +// +// This file may be distributed under the terms of the GNU LGPLv3 license. + +#include util.h // dprintf +#include pci.h // foreachpci +#include config.h // CONFIG_* +#include virtio-pci.h +#include virtio-blk.h +#include disk.h + +struct virtiodrive_s { +struct drive_s drive; +struct vring_virtqueue *vq; +u16 ioaddr; +}; + +static int +virtio_blk_read(struct disk_op_s *op) +{ +struct virtiodrive_s *vdrive_g = +container_of(op-drive_g, struct virtiodrive_s, drive); +struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq); +struct virtio_blk_outhdr hdr = { +.type = VIRTIO_BLK_T_IN, +.ioprio = 0, +.sector = op-lba, +}; +u8 status = VIRTIO_BLK_S_UNSUPP; +struct vring_list sg[] = { +{ +.addr = MAKE_FLATPTR(GET_SEG(SS), hdr), +.length= sizeof(hdr), +}, +{ +.addr = op-buf_fl, +.length= GET_GLOBAL(vdrive_g-drive.blksize) * op-count, +}, +{ +.addr = MAKE_FLATPTR(GET_SEG(SS), status), +.length= sizeof(status), +}, +}; + +/* Add to virtqueue and kick host */ +vring_add_buf(vq, sg, 1, 2, 0, 0); +vring_kick(GET_GLOBAL(vdrive_g-ioaddr), vq, 1); + +/* Wait for reply */ +while (!vring_more_used(vq)) +udelay(5); + +/* Reclaim virtqueue element */ +vring_get_buf(vq, NULL); +return status == VIRTIO_BLK_S_OK ? DISK_RET_SUCCESS : DISK_RET_EBADTRACK; +} + +int +process_virtio_op(struct disk_op_s *op) +{ +switch (op-command) {
Re: Another SIGFPE in display code, now in cirrus
On 05/10/2010 10:41 AM, Avi Kivity wrote: On 05/06/2010 11:07 PM, Michael Tokarev wrote: There was a bug recently fixed in vnc code. Apparently there's something similar in the cirrus emulation as well. Here it triggers _always_ (including old versions of kvm) when running windows NT and hitting test button in its display resolution dialog. Here's what gdb is to say: Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 0xf76cab70 (LWP 580)] 0x080c5e45 in cirrus_do_copy (s=0x86134dc, dst=96, src=0, w=2, h=9) at hw/cirrus_vga.c:687 687sx = (src % ABS(s-cirrus_blt_srcpitch)) / depth; (gdb) p depth $1 = 2 (gdb) p s-cirrus_blt_srcpitch $2 = 0 This qemu-kvm-0.12.3 - actually a debian package of it, but there's no patches relevant to video applied. Anything can be done with it? Well, it's trivial to check for the condition, but how to handle it? That code is wrong. It shouldn't be using the bitblt source pitch, but the actual display pitch. If the two are different, then the blt doesn't actually affect a rectangular region, but instead a parallelogram. Further: if (notify) qemu_console_copy(s-vga.ds, sx, sy, dx, dy, s-cirrus_blt_width / depth, s-cirrus_blt_height); /* we don't have to notify the display that this portion has changed since qemu_console_copy implies this */ cirrus_invalidate_region(s, s-cirrus_blt_dstaddr, s-cirrus_blt_dstpitch, s-cirrus_blt_width, s-cirrus_blt_height); Shouldn't we avoid the invalidate if notify != 0, as the comment says? 31c05501c says this breaks bitblt, but I can't see why this is true. The copy should update the display. This is probably due to a miscalculation of the affected region, and now we have two invalidates instead of one, reducing performance. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-kvm-0.12.4 with bochs bios
Hi, I've tried running an image using the new qemu-kvm-0.12.4 using a BOCHS BIOS (using the -bios parameter). When the machine starts the QEMU window gets minimized into a 1x1 pixel box and just stops. Starting the same machine with the same BIOS on 0.12.3 is working great. Is starting qemu using the BOCHS BIOS no longer supported? Or is it a bug? Let me know if theres any debug info I can pull out to help finding the issue. Commandline: qemu-system-x86_64 -bios /usr/share/bochs/BIOS-bochs-latest /path-to-img/server5.img laptop bin # ./qemu-system-x86_64 --version QEMU PC emulator version 0.12.4 (qemu-kvm-0.12.4), Copyright (c) 2003-2008 Fabrice Bellard laptop bin # uname -a Linux laptop 2.6.34-rc6-git3 #1 SMP PREEMPT Wed May 5 16:32:23 IDT 2010 x86_64 Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz GenuineIntel GNU/Linux Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace
Do not kill VM when instruction emulation fails. Inject #UD and report failure to userspace instead. Userspace may choose to reenter guest if vcpu is in userspace (cpl == 3) in which case guest OS will kill offending process and continue running. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: v1-v2: - always inject #UD and report failure to userspace. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ed48904..5aa0944 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -575,7 +575,6 @@ enum emulation_result { #define EMULTYPE_SKIP (1 2) int emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2, u16 error_code, int emulation_type); -void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context); void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95bee9d..41d2de8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2788,11 +2788,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code) return 1; case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; - return 0; + /* fall through */ case EMULATE_FAIL: - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; - vcpu-run-internal.ndata = 0; return 0; default: BUG(); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index cea916f..58c91f5 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1449,7 +1449,7 @@ static int io_interception(struct vcpu_svm *svm) string = (io_info SVM_IOIO_STR_MASK) != 0; in = (io_info SVM_IOIO_TYPE_MASK) != 0; if (string || in) - return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO); + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; port = io_info 16; size = (io_info SVM_IOIO_SIZE_MASK) SVM_IOIO_SIZE_SHIFT; @@ -2300,16 +2300,12 @@ static int iret_interception(struct vcpu_svm *svm) static int invlpg_interception(struct vcpu_svm *svm) { - if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE) - pr_unimpl(svm-vcpu, %s: failed\n, __func__); - return 1; + return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } static int emulate_on_interception(struct vcpu_svm *svm) { - if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE) - pr_unimpl(svm-vcpu, %s: failed\n, __func__); - return 1; + return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } static int cr8_write_interception(struct vcpu_svm *svm) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 777e00d..e35c479 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3074,7 +3074,7 @@ static int handle_io(struct kvm_vcpu *vcpu) ++vcpu-stat.io_exits; if (string || in) - return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO); + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; port = exit_qualification 16; size = (exit_qualification 7) + 1; @@ -3331,22 +3331,7 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { - unsigned long exit_qualification; - enum emulation_result er; - unsigned long offset; - - exit_qualification = vmcs_readl(EXIT_QUALIFICATION); - offset = exit_qualification 0xffful; - - er = emulate_instruction(vcpu, 0, 0, 0); - - if (er != EMULATE_DONE) { - printk(KERN_ERR - Fail to handle apic access vmexit! Offset is 0x%lx\n, - offset); - return -ENOEXEC; - } - return 1; + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; } static int handle_task_switch(struct kvm_vcpu *vcpu) @@ -3558,13 +3543,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) goto out; } - if (err != EMULATE_DONE) { - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; - vcpu-run-internal.ndata = 0; - ret = 0; - goto out; - } + if (err != EMULATE_DONE) + return 0; if (signal_pending(current)) goto out; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd8a606..b5eaed4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3587,24
[PATCH] Do not stop VM if emulation failed in userspace.
Continue vcpu execution in case emulation failure happened while vcpu was in userspace. In this case #UD will be injected into the guest allowing guest OS to kill offending process and continue. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/kvm-all.c b/kvm-all.c index 9ac35aa..db28d94 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -786,6 +786,8 @@ static void kvm_handle_internal_error(CPUState *env, struct kvm_run *run) cpu_dump_state(env, stderr, fprintf, 0); if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) { fprintf(stderr, emulation failure\n); +if (!kvm_arch_stop_on_emulation_error(env)) + return; } /* FIXME: Should trigger a qmp message to let management know * something went wrong. diff --git a/kvm.h b/kvm.h index 67c0392..79e59f5 100644 --- a/kvm.h +++ b/kvm.h @@ -187,4 +187,5 @@ int kvm_set_irqfd(int gsi, int fd, bool assigned) #endif int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign); +bool kvm_arch_stop_on_emulation_error(CPUState *env); #endif diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 76c1adb..c333f86 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1287,4 +1287,10 @@ void kvm_arch_update_guest_debug(CPUState *env, struct kvm_guest_debug *dbg) } #endif /* KVM_CAP_SET_GUEST_DEBUG */ +bool kvm_arch_stop_on_emulation_error(CPUState *env) +{ + return !(env-cr[0] CR0_PE_MASK) || + ((env-segs[R_CS].selector 3) != 3); +} + #include qemu-kvm-x86.c diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index aa3d432..7b246a1 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -321,3 +321,8 @@ uint32_t kvmppc_get_tbfreq(void) retval = atoi(ns); return retval; } + +bool kvm_arch_stop_on_emulation_error(CPUState *env) +{ +return true; +} diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c index 72e77b0..14f723b 100644 --- a/target-s390x/kvm.c +++ b/target-s390x/kvm.c @@ -480,3 +480,8 @@ int kvm_arch_handle_exit(CPUState *env, struct kvm_run *run) return ret; } + +bool kvm_arch_stop_on_emulation_error(CPUState *env) +{ +return true; +} -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] Support for booting from virtio disks
diff --git a/src/virtio-blk.c b/src/virtio-blk.c new file mode 100644 index 000..a41c336 --- /dev/null +++ b/src/virtio-blk.c @@ -0,0 +1,155 @@ +// Virtio blovl boot support. Just noticed the blovl typo. + char *desc = malloc_tmphigh(MAXDESCSIZE); + struct virtiodrive_s *vdrive_g = malloc_fseg(sizeof(*vdrive_g)); + struct vring_virtqueue *vq = malloc_low(sizeof(*vq)); + if (!vdrive_g || !desc || !vq) { + warn_noalloc(); + return; + } This error return can still leak. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] Support for booting from virtio disks
On Mon, May 10, 2010 at 09:25:20AM +0100, Stefan Hajnoczi wrote: diff --git a/src/virtio-blk.c b/src/virtio-blk.c new file mode 100644 index 000..a41c336 --- /dev/null +++ b/src/virtio-blk.c @@ -0,0 +1,155 @@ +// Virtio blovl boot support. Just noticed the blovl typo. + char *desc = malloc_tmphigh(MAXDESCSIZE); + struct virtiodrive_s *vdrive_g = malloc_fseg(sizeof(*vdrive_g)); + struct vring_virtqueue *vq = malloc_low(sizeof(*vq)); + if (!vdrive_g || !desc || !vq) { + warn_noalloc(); + return; + } This error return can still leak. Oh Gosh, programming is hard. Why don't we write bios in python? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-0.12.4 with bochs bios
On 05/10/2010 11:16 AM, Sasha Levin wrote: Hi, I've tried running an image using the new qemu-kvm-0.12.4 using a BOCHS BIOS (using the -bios parameter). When the machine starts the QEMU window gets minimized into a 1x1 pixel box and just stops. Starting the same machine with the same BIOS on 0.12.3 is working great. Is starting qemu using the BOCHS BIOS no longer supported? Or is it a bug? The bochs bios is not supported in 0.12. Why do you need to use it? If there's a simple fix that can make it work I don't we can include it. Let me know if theres any debug info I can pull out to help finding the issue. A bisection to find the offending commit would be good. I suggest starting with upstream qemu 0.12.3 and 0.12.4. If 0.12.3 works and 0.12.4 fails, you can try a simple git bisect to find what commit introduced the problem. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3] Support for booting from virtio disks
This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: v1-v2: - free memory in case of vq initialization error. - change license of virtio ring/pci to LGPLv3 with permission of Laurent Vivier (aka the author). v2-v3: - free memory in case one of allocations failed diff --git a/Makefile b/Makefile index 327a1bf..d0b8881 100644 --- a/Makefile +++ b/Makefile @@ -14,7 +14,8 @@ OUT=out/ SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \ kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \ pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \ -usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c +usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \ +virtio-ring.c virtio-pci.c virtio-blk.c SRC16=$(SRCBOTH) system.c disk.c apm.c font.c SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \ acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \ diff --git a/src/block.c b/src/block.c index ddf441f..b6b1902 100644 --- a/src/block.c +++ b/src/block.c @@ -11,6 +11,7 @@ #include util.h // dprintf #include ata.h // process_ata_op #include usb-msc.h // process_usb_op +#include virtio-blk.h // process_virtio_op struct drives_s Drives VAR16VISIBLE; @@ -289,6 +290,8 @@ process_op(struct disk_op_s *op) return process_cdemu_op(op); case DTYPE_USB: return process_usb_op(op); +case DTYPE_VIRTIO: + return process_virtio_op(op); default: op-count = 0; return DISK_RET_EPARAM; diff --git a/src/config.h b/src/config.h index b101174..ad569c6 100644 --- a/src/config.h +++ b/src/config.h @@ -136,6 +136,9 @@ #define CONFIG_SUBMODEL_ID 0x00 #define CONFIG_BIOS_REVISION 0x01 +// Support boot from virtio storage +#define CONFIG_VIRTIO_BLK 1 + // Various memory addresses used by the code. #define BUILD_STACK_ADDR 0x7000 #define BUILD_S3RESUME_STACK_ADDR 0x1000 diff --git a/src/disk.h b/src/disk.h index 0cd1b74..9e5b083 100644 --- a/src/disk.h +++ b/src/disk.h @@ -197,6 +197,7 @@ struct drive_s { #define DTYPE_RAMDISK 0x04 #define DTYPE_CDEMU0x05 #define DTYPE_USB 0x06 +#define DTYPE_VIRTIO 0x07 #define MAXDESCSIZE 80 diff --git a/src/pci_ids.h b/src/pci_ids.h index 1800f1d..e1cded2 100644 --- a/src/pci_ids.h +++ b/src/pci_ids.h @@ -2605,3 +2605,6 @@ #define PCI_DEVICE_ID_RME_DIGI32 0x9896 #define PCI_DEVICE_ID_RME_DIGI32_PRO 0x9897 #define PCI_DEVICE_ID_RME_DIGI32_8 0x9898 + +#define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4 +#define PCI_DEVICE_ID_VIRTIO_BLK 0x1001 diff --git a/src/post.c b/src/post.c index 638b0f7..25535e2 100644 --- a/src/post.c +++ b/src/post.c @@ -23,6 +23,7 @@ #include smbios.h // smbios_init #include paravirt.h // qemu_cfg_port_probe #include ps2port.h // ps2port_setup +#include virtio-blk.h // virtio_blk_setup void __set_irq(int vector, void *loc) @@ -184,6 +185,7 @@ init_hw(void) floppy_setup(); ata_setup(); ramdisk_setup(); +virtio_blk_setup(); } // Main setup code. diff --git a/src/virtio-blk.c b/src/virtio-blk.c new file mode 100644 index 000..f106bdf --- /dev/null +++ b/src/virtio-blk.c @@ -0,0 +1,158 @@ +// Virtio block boot support. +// +// Copyright (C) 2010 Red Hat Inc. +// +// Authors: +// Gleb Natapov gnata...@redhat.com +// +// This file may be distributed under the terms of the GNU LGPLv3 license. + +#include util.h // dprintf +#include pci.h // foreachpci +#include config.h // CONFIG_* +#include virtio-pci.h +#include virtio-blk.h +#include disk.h + +struct virtiodrive_s { +struct drive_s drive; +struct vring_virtqueue *vq; +u16 ioaddr; +}; + +static int +virtio_blk_read(struct disk_op_s *op) +{ +struct virtiodrive_s *vdrive_g = +container_of(op-drive_g, struct virtiodrive_s, drive); +struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq); +struct virtio_blk_outhdr hdr = { +.type = VIRTIO_BLK_T_IN, +.ioprio = 0, +.sector = op-lba, +}; +u8 status = VIRTIO_BLK_S_UNSUPP; +struct vring_list sg[] = { +{ +.addr = MAKE_FLATPTR(GET_SEG(SS), hdr), +.length= sizeof(hdr), +}, +{ +.addr = op-buf_fl, +.length= GET_GLOBAL(vdrive_g-drive.blksize) * op-count, +}, +{ +.addr = MAKE_FLATPTR(GET_SEG(SS), status), +.length= sizeof(status), +}, +}; + +/* Add to virtqueue and kick host */ +vring_add_buf(vq, sg, 1, 2, 0, 0); +vring_kick(GET_GLOBAL(vdrive_g-ioaddr), vq, 1); + +/* Wait for reply */ +while (!vring_more_used(vq)) +udelay(5); + +/* Reclaim virtqueue element */ +vring_get_buf(vq, NULL); +return status == VIRTIO_BLK_S_OK ? DISK_RET_SUCCESS : DISK_RET_EBADTRACK; +} + +int
Re: qemu-kvm-0.12.4 with bochs bios
On Mon, May 10, 2010 at 11:35 AM, Avi Kivity a...@redhat.com wrote: On 05/10/2010 11:16 AM, Sasha Levin wrote: Hi, I've tried running an image using the new qemu-kvm-0.12.4 using a BOCHS BIOS (using the -bios parameter). When the machine starts the QEMU window gets minimized into a 1x1 pixel box and just stops. Starting the same machine with the same BIOS on 0.12.3 is working great. Is starting qemu using the BOCHS BIOS no longer supported? Or is it a bug? The bochs bios is not supported in 0.12. Why do you need to use it? If there's a simple fix that can make it work I don't we can include it. I've tried doing cpu hotplug using cpu_set on 0.12.3 but it caused a segfault, but now on 0.12.4 it seems to add the cpu - but it looks like SeaBIOS doesn't support cpu hotplug. I want to try doing the same with BOCHS. Let me know if theres any debug info I can pull out to help finding the issue. A bisection to find the offending commit would be good. I suggest starting with upstream qemu 0.12.3 and 0.12.4. If 0.12.3 works and 0.12.4 fails, you can try a simple git bisect to find what commit introduced the problem. I'll send an update when I find the commit. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/2] x86 FPU API
On 05/06/2010 11:45 AM, Avi Kivity wrote: Currently all fpu accessors are wedded to task_struct. However kvm also uses the fpu in a different context. Introduce an FPU API, and replace the current uses with the new API. While this patchset is oriented towards deeper changes, as a first step it simlifies xsave for kvm. Peter/Ingo, what are the plans for merging it? The kvm xsave work depends on this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: update document of gfns
On 05/07/2010 11:47 AM, Lai Jiangshan wrote: When don't use -gfns when role.direct=1, update document for it. Please fold into the patch that does the change. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SeaBIOS] [PATCHv3] Support for booting from virtio disks
Looks good. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio: put last_used and last_avail index into ring itself.
On Sun, 9 May 2010 06:27:33 pm Michael S. Tsirkin wrote: On Fri, May 07, 2010 at 12:35:39PM +0930, Rusty Russell wrote: Then there's padding to page boundary. That puts us on a cacheline again for the used ring; also 2 bytes per entry. Hmm, is used ring really 2 bytes per entry? Err, no, I am an idiot. /* u32 is used here for ids for padding reasons. */ struct vring_used_elem { /* Index of start of used descriptor chain. */ __u32 id; /* Total length of the descriptor chain which was used (written to) */ __u32 len; }; struct vring_used { __u16 flags; __u16 idx; struct vring_used_elem ring[]; }; OK, now I get it. Sorry, I was focussed on the avail ring. I thought that used ring has 8 bytes per entry, and that struct vring_used is aligned at page boundary, this would mean that ring element is at offset 4 bytes from page boundary. Thus with cacheline size 128 bytes, each 4th element crosses a cacheline boundary. If we had a 4 byte padding after idx, each used element would always be completely within a single cacheline. I think the numbers are: every 16th entry hits two cachelines. So currently the first 15 entries are free (assuming we hit the idx cacheline anyway), then 1 in 16 cost 2 cachelines. That makes the aligned version win when N 240. But, we access the array linearly. So the extra cacheline cost is in fact amortized. I doubt it could be measured, but maybe vring_get_buf() should prefetch? While you're there, we could use an rather than a mod on the calculation, which may actually be measurable :) Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] kvm: calculate correct gfn for small host pages which emulates large guest pages
On 04/30/2010 05:41 AM, Lai Jiangshan wrote: Lai Jiangshan wrote: RFC, because maybe I missing something with the old code. Frome: Lai Jiangshanla...@cn.fujitsu.com In Document/kvm/mmu.txt: gfn: Either the guest page table containing the translations shadowed by this page, or the base page frame for linear translations. See role.direct. But in function FNAME(fetch)(), sp-gfn is incorrect when one of following situations occurred: 1) guest is 32bit paging and guest uses pse-36 and the guest PDE maps a 4-MByte page(backed by 4k host pages) and bits 20:13 of the guest PDE is not equals to 0. 2) guest is long mode paging and the guest PDPTE maps a 1-GByte page (backed by 4k or 2M host pages) Resend this patch with the changelog changed. As Marcelo Tosatti and Gui Jianfeng points out, FNAME(fetch)() miss quadrant on 4mb large page emulation with shadow. Subject: [PATCH] kvm: calculate correct gfn for small host pages which emulates large guest pages In Document/kvm/mmu.txt: gfn: Either the guest page table containing the translations shadowed by this page, or the base page frame for linear translations. See role.direct. But in function FNAME(fetch)(), sp-gfn is incorrect when one of following situations occurred: 1) guest is 32bit paging and the guest PDE maps a 4-MByte page (backed by 4k host pages), FNAME(fetch)() miss handling the quadrant. And if guest use pse-36, table_gfn = gpte_to_gfn(gw-ptes[level - delta]); is incorrect. 2) guest is long mode paging and the guest PDPTE maps a 1-GByte page (backed by 4k or 2M host pages). So we fix it to suit to the document and suit to the code which requires sp-gfn correct when sp-role.direct=1. We use the goal mapping gfn(gw-gfn) to calculate the base page frame for linear translations, it is simple and easy to be understood. Signed-off-by: Lai Jiangshanla...@cn.fujitsu.com --- diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 702c016..958e9c6 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -338,10 +338,13 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, direct = 1; if (!is_dirty_gpte(gw-ptes[level - delta])) access= ~ACC_WRITE_MASK; - table_gfn = gpte_to_gfn(gw-ptes[level - delta]); - /* advance table_gfn when emulating 1gb pages with 4k */ - if (delta == 0) - table_gfn += PT_INDEX(addr, level); + /* +* It is a large guest pages backed by small host pages, +* So we set @direct(@shadow_page-role.direct)=1, and +* set @table_gfn(@shadow_page-gfn)=the base page frame +* for linear translations. +*/ + table_gfn = gw-gfn ~(KVM_PAGES_PER_HPAGE(level) - 1); } else { direct = 0; table_gfn = gw-table_gfn[level - 2]; Looks good, indeed it is a lot easier to understand than the original calculation (a minor issue is that the variable name is misleading, but that's a problem with kvm_mmu_page definition and not this patch). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: Fix free memory accounting race in mmu_alloc_roots()
We drop the mmu lock between freeing memory and allocating the roots; this allows some other vcpu to sneak in and allocate memory. While the race is benign (resulting only in temporary overallocation, not oom) it is simple and easy to fix by moving the freeing close to the allocation. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/mmu.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95bee9d..6857a2f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2067,6 +2067,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) root_gfn = 0; } spin_lock(vcpu-kvm-mmu_lock); + kvm_mmu_free_some_pages(vcpu-kvm); sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL, direct, ACC_ALL, NULL); @@ -2097,6 +2098,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) root_gfn = i 30; } spin_lock(vcpu-kvm-mmu_lock); + kvm_mmu_free_some_pages(vcpu-kvm); sp = kvm_mmu_get_page(vcpu, root_gfn, i 30, PT32_ROOT_LEVEL, direct, ACC_ALL, NULL); @@ -2470,9 +2472,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) r = mmu_topup_memory_caches(vcpu); if (r) goto out; - spin_lock(vcpu-kvm-mmu_lock); - kvm_mmu_free_some_pages(vcpu); - spin_unlock(vcpu-kvm-mmu_lock); r = mmu_alloc_roots(vcpu); spin_lock(vcpu-kvm-mmu_lock); mmu_sync_roots(vcpu); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-0.12.4 with bochs bios
10.05.2010 12:16, Sasha Levin wrote: Hi, I've tried running an image using the new qemu-kvm-0.12.4 using a BOCHS BIOS (using the -bios parameter). When the machine starts the QEMU window gets minimized into a 1x1 pixel box and just stops. Starting the same machine with the same BIOS on 0.12.3 is working great. Actually I tried this many times in different situation, and it never worked for me. qemu-kvm-0.12.* with older bios (bochs) never boots for me, it opens the SDL window but that window has nothing in it, it's completely blank, and the whole thing sits here forever doing nothing. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
migration fails with virtio-blk
Apparently there's an issue with migration when virtio-blk is in use at the time migration occurs. See Debian bug #580649 for details: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580649 In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails. 2.6.32 guest works, 0.11 works, non-virtio-blk works. Debian qemu-kvm is compiled without --enable-iothread - JFYI. The bugreport contains a backtrace taken from guest. Now, one can say it's a bug in 2.6.26 kernel, which is quite old. But the problem does not occur with qemu-kvm-0.11, so it might be treated as a regression instead. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace
On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote: Do not kill VM when instruction emulation fails. Inject #UD and report failure to userspace instead. Userspace may choose to reenter guest if vcpu is in userspace (cpl == 3) in which case guest OS will kill offending process and continue running. Please use this one instead. Compilation warning fixed. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ed48904..5aa0944 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -575,7 +575,6 @@ enum emulation_result { #define EMULTYPE_SKIP (1 2) int emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2, u16 error_code, int emulation_type); -void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context); void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95bee9d..41d2de8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2788,11 +2788,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code) return 1; case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; - return 0; + /* fall through */ case EMULATE_FAIL: - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; - vcpu-run-internal.ndata = 0; return 0; default: BUG(); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index cea916f..58c91f5 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1449,7 +1449,7 @@ static int io_interception(struct vcpu_svm *svm) string = (io_info SVM_IOIO_STR_MASK) != 0; in = (io_info SVM_IOIO_TYPE_MASK) != 0; if (string || in) - return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO); + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; port = io_info 16; size = (io_info SVM_IOIO_SIZE_MASK) SVM_IOIO_SIZE_SHIFT; @@ -2300,16 +2300,12 @@ static int iret_interception(struct vcpu_svm *svm) static int invlpg_interception(struct vcpu_svm *svm) { - if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE) - pr_unimpl(svm-vcpu, %s: failed\n, __func__); - return 1; + return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } static int emulate_on_interception(struct vcpu_svm *svm) { - if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE) - pr_unimpl(svm-vcpu, %s: failed\n, __func__); - return 1; + return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } static int cr8_write_interception(struct vcpu_svm *svm) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 777e00d..e35c479 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3074,7 +3074,7 @@ static int handle_io(struct kvm_vcpu *vcpu) ++vcpu-stat.io_exits; if (string || in) - return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO); + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; port = exit_qualification 16; size = (exit_qualification 7) + 1; @@ -3331,22 +3331,7 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { - unsigned long exit_qualification; - enum emulation_result er; - unsigned long offset; - - exit_qualification = vmcs_readl(EXIT_QUALIFICATION); - offset = exit_qualification 0xffful; - - er = emulate_instruction(vcpu, 0, 0, 0); - - if (er != EMULATE_DONE) { - printk(KERN_ERR - Fail to handle apic access vmexit! Offset is 0x%lx\n, - offset); - return -ENOEXEC; - } - return 1; + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; } static int handle_task_switch(struct kvm_vcpu *vcpu) @@ -3558,13 +3543,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) goto out; } - if (err != EMULATE_DONE) { - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; - vcpu-run-internal.ndata = 0; - ret = 0; - goto out; - } + if (err != EMULATE_DONE) + return 0; if (signal_pending(current)) goto out; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd8a606..eb845cf 100644 --- a/arch/x86/kvm/x86.c
Re: [Autotest] [PATCH 2/2] KVM test: Support to SLES install
On Wed, 10 Mar 2010 08:45:59 -0300 Lucas Meneghel Rodrigues l...@redhat.com wrote: Hi Yogi/Lucas Thanks for including SLES guests support in KVM autotest. I tried SLES guest install. After succsfull install of SLES guest first stage, again install started. Autotest keep installing first stage in a loop. I guess some thing needs to be changed here, extra_params += -bootp /pxelinux.0 -boot n Can you send a patch to fix this. Thanks in advance. --SP Signed-off-by: Yogananth Subramanian anant...@linux.vnet.ibm.com --- client/tests/kvm/tests_base.cfg.sample | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index c76470d..acb2076 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -503,6 +503,28 @@ variants: md5sum = 2afee1b8a87175e6dee2b8dbbd1ad8e8 md5sum_1m = 768ca32503ef92c28f2d144f2a87e4d0 +- SLES: +no setup +shell_prompt = ^r...@.*[\#\$]\s*$|# +unattended_install: +pxe_image = linux +pxe_initrd = initrd +extra_params += -bootp /pxelinux.0 -boot n +kernel_args = autoyast=floppy + +variants: +- 11.64: +no setup +image_name = sles11-64 +cdrom=linux/SLES-11-DVD-x86_64-GM-DVD1.iso +md5sum = 50a2bd45cd12c3808c3ee48208e2586b +md5sum_1m = 0951cab7c32e332362fc424c1054 +unattended_install: +unattended_file = unattended/Sles11-64-autoinst.xml +tftp = images/sles11-64/tftpboot +floppy = images/sles11-64floppy.img +pxe_dir = boot/x86_64/loader + - @Ubuntu: shell_prompt = ^r...@.*[\#\$]\s*$ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation
On 04/21/2010 08:53 PM, Cam Macdonell wrote: This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a host file into guest RAM. This function mmaps the opened file anywhere and adds the memory to the ram blocks. Usage is qemu_ram_mmap(fd, size, MAP_SHARED, offset); Signoff? +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset) +{ +RAMBlock *new_block; + +size = TARGET_PAGE_ALIGN(size); +new_block = qemu_malloc(sizeof(*new_block)); + +/* map the file passed as a parameter to be this part of memory */ +new_block-host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset); + +if (new_block-host == MAP_FAILED) +exit(1); Braces after if () +if (kvm_enabled()) +kvm_setup_guest_memory(new_block-host, size); + More braces. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
On 04/21/2010 08:53 PM, Cam Macdonell wrote: Generic functions to assign irqfds and ioeventfds. Signoff. } #ifdef KVM_IOEVENTFD +int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi) +{ +struct kvm_irqfd call = { }; +int r; + +call.fd = fd; +call.gsi = gsi; + +if (!kvm_enabled()) +return -ENOSYS; Braces, here and elsewhere. +r = kvm_vm_ioctl(kvm_state, KVM_IRQFD,call); + +if (r 0) { +return r; -errno +} +return 0; +} + +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign) +{ + +int ret; +struct kvm_ioeventfd iofd; + +iofd.datamatch = val; +iofd.addr = addr; +iofd.len = 4; +iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH; +iofd.fd = fd; + +if (!kvm_enabled()) +return -ENOSYS; +if (!assign) +iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN; May be more usable to have separate assign and deassign functions (that can call into a single internal implementation). + +ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD,iofd); + +if (ret 0) { +return ret; -errno +} + +return 0; +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
network between host and guest
I have installed kvm (app-emulation/qemu-kvm-0.12.3-r1) on linux (a laptop with gentoo linux) and MS Windows 7 as guest. Here is what info I get about the network on each one: 1) on linux (host): eth0 Link encap:Ethernet HWaddr 00:03:25:43:7f:93 inet addr:192.168.0.13 Bcast:255.255.255.255 Mask:255.255.255.0 inet6 addr: fe80::203:25ff:fe43:7f93/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1492 Metric:1 RX packets:406972 errors:0 dropped:0 overruns:0 frame:0 TX packets:5268868 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:132098724 (125.9 MiB) TX bytes:7683259267 (7.1 GiB) Interrupt:26 Base address:0x2000 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:505 errors:0 dropped:0 overruns:0 frame:0 TX packets:505 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:65800 (64.2 KiB) TX bytes:65800 (64.2 KiB) wlan0 Link encap:Ethernet HWaddr 00:16:44:8e:5a:0b BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) 2) on the guest (sorry I could't find a way copy, so I took a screenshot): http://i999.photobucket.com/albums/af112/aggman/Screenshot-QEMU.jpg Question: Where is my interface on linux with IP address of 10.0.2.2 ? The guest can actually access the internet (via the host), but where can I see that hidden routing that linux provides to it? PS: forgive my ignorance, I am new to kvm. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: virtio-win problem
-Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Martin Maurer Sent: Thursday, May 06, 2010 5:59 PM To: kvm@vger.kernel.org Subject: RE: virtio-win problem -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Riccardo Veraldi Sent: Donnerstag, 06. Mai 2010 14:10 To: kvm@vger.kernel.org Subject: virtio-win problem Hello, if I install virtio-win drivers on windows 2008 Server R2, I have the problem of signed device drivers. I Can install the drivers but Windows 2008 server refuses to use them unless I start the machine pressing F8 every time at each reboot bypassing the checking of signed certified drivers, and this is annoying, since I Cannot reboot the virtual machien automatically. Anyone solved this issue ? Redhat released signed drivers working on win2008r2, but no public download or free use (you need a subscription, it's a special license) Or you need to follow the howtos on the KVM wiki pages, http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers [YV] You could test sign the drivers with our own certificate (http://www.rage3d.com/board/showthread.php?t=33920573) and enable the usage of test signed drivers (Bcdedit.exe -set TESTSIGNING ON). Br, Martin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network between host and guest
在 星期一 10 5月 2010 18:46:20,Thanasis 写道: Question: Where is my interface on linux with IP address of 10.0.2.2 ? The guest can actually access the internet (via the host), but where can I see that hidden routing that linux provides to it? I think you are using user networking(slirp) , with user networking, etch tcp connection is mapped to a tcp connection in host's kvm process, and there is no host route to the guest. Please visit http://wiki.qemu.org/Documentation/Networking for detailed information. -- Liang Guo http://bluestone.cublog.cn -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
(2010/05/06 22:38), Arnd Bergmann wrote: On Wednesday 05 May 2010, Takuya Yoshikawa wrote: Date: Yesterday 04:59:24 That's why the bitmaps are defined as little endian u64 aligned, even on big endian 32-bit systems. Little endian bitmaps are wordsize agnostic, and u64 alignment ensures we can use long-sized bitops on mixed size systems. Ok, I see. There was a suggestion to propose set_le_bit_user() kind of macros. But what I thought was these have a constraint you two explained and seemed to be a little bit specific to some area, like KVM. So I decided to propose just the offset calculation macro. I'm not sure I understand how this macro is going to be used though. If you are just using this in kernel space, that's fine, please go for it. Yes, I'm just using in kernel space: qemu has its own endian related helpers. So if you allow us to place this macro in asm-generic/bitops/* it will help us. Avi, what do you think? Do you want to place it in kvm.h ? However, if the intention is to use the same macro in user space, putting it into asm-generic/bitops/* is not going to help, because those headers are not available in user space, and I wouldn't want to change that. The definition of the macro is not part of the ABI, so just duplicate it in KVM if you need it there. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 04/21/2010 08:53 PM, Cam Macdonell wrote: Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. -device ivshmem,size=size in format accepted by -m[,shm=shm name] Interrupts are supported between multiple VMs by using a shared memory server by using a chardev socket. -device ivshmem,size=size in format accepted by -m[,shm=shm name] [,chardev=id][,msi=on][,irqfd=on][,vectors=n] -chardev socket,path=path,id=id (shared memory server is qemu.git/contrib/ivshmem-server) Sample programs and init scripts are in a git repo here: +typedef struct EventfdEntry { +PCIDevice *pdev; +int vector; +} EventfdEntry; + +typedef struct IVShmemState { +PCIDevice dev; +uint32_t intrmask; +uint32_t intrstatus; +uint32_t doorbell; + +CharDriverState * chr; +CharDriverState ** eventfd_chr; +int ivshmem_mmio_io_addr; + +pcibus_t mmio_addr; +unsigned long ivshmem_offset; +uint64_t ivshmem_size; /* size of shared memory region */ +int shm_fd; /* shared memory file descriptor */ + +int nr_allocated_vms; +/* array of eventfds for each guest */ +int ** eventfds; +/* keep track of # of eventfds for each guest*/ +int * eventfds_posn_count; More readable: typedef struct Peer { int nb_eventfds; int *eventfds; } Peer; int nb_peers; Peer *peers; Does eventfd_chr need to be there as well? + +int nr_alloc_guests; +int vm_id; +int num_eventfds; +uint32_t vectors; +uint32_t features; +EventfdEntry *eventfd_table; + +char * shmobj; +char * sizearg; Does this need to be part of the state? +} IVShmemState; + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { +IntrMask = 0, +IntrStatus = 4, +IVPosition = 8, +Doorbell = 12, +}; + +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) { +return (ivs-features (1 feature)); +} + +static inline int is_power_of_two(int x) { +return (x (x-1)) == 0; +} argument needs to be uint64_t to avoid overflow with large BARs. Return type can be bool. +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val) +{ +IVShmemState *s = opaque; + +u_int64_t write_one = 1; +u_int16_t dest = val 16; +u_int16_t vector = val 0xff; + +addr= 0xfe; Why 0xfe? Can understand 0xfc or 0xff. + +switch (addr) +{ +case IntrMask: +ivshmem_IntrMask_write(s, val); +break; + +case IntrStatus: +ivshmem_IntrStatus_write(s, val); +break; + +case Doorbell: +/* check doorbell range */ +if ((vector= 0) (vector s-eventfds_posn_count[dest])) { What if dest is too big? We overflow s-eventfds_posn_count. + +static void close_guest_eventfds(IVShmemState *s, int posn) +{ +int i, guest_curr_max; + +guest_curr_max = s-eventfds_posn_count[posn]; + +for (i = 0; i guest_curr_max; i++) +close(s-eventfds[posn][i]); + +free(s-eventfds[posn]); qemu_free(). +/* this function increase the dynamic storage need to store data about other + * guests */ +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) { + +int j, old_nr_alloc; + +old_nr_alloc = s-nr_alloc_guests; + +while (s-nr_alloc_guests new_min_size) +s-nr_alloc_guests = s-nr_alloc_guests * 2; + +IVSHMEM_DPRINTF(bumping storage to %d guests\n, s-nr_alloc_guests); +s-eventfds = qemu_realloc(s-eventfds, s-nr_alloc_guests * +sizeof(int *)); +s-eventfds_posn_count = qemu_realloc(s-eventfds_posn_count, +s-nr_alloc_guests * +sizeof(int)); +s-eventfd_table = qemu_realloc(s-eventfd_table, s-nr_alloc_guests * +sizeof(EventfdEntry)); + +if ((s-eventfds == NULL) || (s-eventfds_posn_count == NULL) || +(s-eventfd_table == NULL)) { +fprintf(stderr, Allocation error - exiting\n); +exit(1); +} + +if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) { +s-eventfd_chr = (CharDriverState **)qemu_realloc(s-eventfd_chr, +s-nr_alloc_guests * sizeof(void *)); +if (s-eventfd_chr == NULL) { +fprintf(stderr, Allocation error - exiting\n); +exit(1); +} +} + +/* zero out new pointers */ +for (j = old_nr_alloc; j s-nr_alloc_guests; j++) { +s-eventfds[j] = NULL; eventfds_posn_count and eventfd_table want zeroing as well. +} +} + +static void ivshmem_read(void *opaque, const
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
On 05/10/2010 02:46 PM, Takuya Yoshikawa wrote: (2010/05/06 22:38), Arnd Bergmann wrote: On Wednesday 05 May 2010, Takuya Yoshikawa wrote: Date: Yesterday 04:59:24 That's why the bitmaps are defined as little endian u64 aligned, even on big endian 32-bit systems. Little endian bitmaps are wordsize agnostic, and u64 alignment ensures we can use long-sized bitops on mixed size systems. Ok, I see. There was a suggestion to propose set_le_bit_user() kind of macros. But what I thought was these have a constraint you two explained and seemed to be a little bit specific to some area, like KVM. So I decided to propose just the offset calculation macro. I'm not sure I understand how this macro is going to be used though. If you are just using this in kernel space, that's fine, please go for it. Yes, I'm just using in kernel space: qemu has its own endian related helpers. So if you allow us to place this macro in asm-generic/bitops/* it will help us. Avi, what do you think? Do you want to place it in kvm.h ? I really prefer anything that is generic to be outside kvm, even if kvm is the only user. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
On Monday 10 May 2010, Takuya Yoshikawa wrote: (2010/05/06 22:38), Arnd Bergmann wrote: On Wednesday 05 May 2010, Takuya Yoshikawa wrote: There was a suggestion to propose set_le_bit_user() kind of macros. But what I thought was these have a constraint you two explained and seemed to be a little bit specific to some area, like KVM. So I decided to propose just the offset calculation macro. I'm not sure I understand how this macro is going to be used though. If you are just using this in kernel space, that's fine, please go for it. Yes, I'm just using in kernel space: qemu has its own endian related helpers. So if you allow us to place this macro in asm-generic/bitops/* it will help us. No problem at all then. Thanks for the explanation. Acked-by: Arnd Bergmann a...@arndb.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
Yes, I'm just using in kernel space: qemu has its own endian related helpers. So if you allow us to place this macro in asm-generic/bitops/* it will help us. No problem at all then. Thanks for the explanation. Acked-by: Arnd Bergmanna...@arndb.de Thanks you both. I will add your Acked-by from now on! Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote: [Performance test] We measured the tsc needed to the ioctl()s for getting dirty logs in kernel. Test environment AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory 1. GUI test (running Ubuntu guest in graphical mode) sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ... We show a relatively stable part to compare how much time is needed for the basic parts of dirty log ioctl. get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). 2. Live-migration test (4GB guest, write loop with 1GB buf) We also did a live-migration test. get.org get.opt switch.opt slots[0].len=655360 797383261144222181 slots[1].len=37570478082186721 1965244 1842824 slots[2].len=637534208 1433562 1012723 1031213 slots[3].len=131072 216858 331 331 slots[4].len=131072 121635 225 164 slots[5].len=131072 120863 356 164 slots[6].len=16777216 121746 1133 156 slots[7].len=32768 120415 230 278 slots[8].len=32768 120368 216 149 slots[0].len=655360 806497194710223582 slots[1].len=37570478082142922 1878025 1895369 slots[2].len=637534208 1386512 1021309 1000345 slots[3].len=131072 221118 459 296 slots[4].len=131072 121516 272 166 slots[5].len=131072 122652 244 173 slots[6].len=16777216 123226 99185 149 slots[7].len=32768 121803 457 505 slots[8].len=32768 121586 216 155 slots[0].len=655360 766113211317213179 slots[1].len=37570478082155662 1974790 1842361 slots[2].len=637534208 1481411 1020004 1031352 slots[3].len=131072 223100 351 295 slots[4].len=131072 122982 436 164 slots[5].len=131072 122100 300 503 slots[6].len=16777216 123653 779 151 slots[7].len=32768 122617 284 157 slots[8].len=32768 122737 253 149 For slots other than 0,1,2 we can see the similar improvement. Considering the fact that switch.opt does not depend on the bitmap length except for kvm_mmu_slot_remove_write_access(), this is the cause of some usec to msec time consumption: there might be some context switches. But note that this was done with the workload which dirtied the memory endlessly during the live-migration. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). Sorry but no, and I agree with your guess. Anyway, I want to do some profiling to confirm this guess. BTW, If we only think about performance improvement of time, optimized get(get.opt) may be enough at this stage. But if we consider the future expansion like using user allocated bitmaps, new API's introduced for switch.opt won't become waste, I think, because we need a structure to get and export bitmap addresses. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? OK, I'll do! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for May 11
Please send in any agenda items you are interested in covering. If we have a lack of agenda items I'll cancel the week's call. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.
On 05/10/2010 06:13 PM, Cam Macdonell wrote: +int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool assign) +{ + +int ret; +struct kvm_ioeventfd iofd; + +iofd.datamatch = val; +iofd.addr = addr; +iofd.len = 4; +iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH; +iofd.fd = fd; + +if (!kvm_enabled()) +return -ENOSYS; +if (!assign) +iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN; May be more usable to have separate assign and deassign functions (that can call into a single internal implementation). I believe the convention so far is to use the 'assign' flag as Michael's patch and the PIO version kvm_set_ioeventfd_pio_word() do. I dislike bool arguments since they're hard to understand at the call site. However if there's precedent we can stick to it and perhaps change it all later. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 5:59 AM, Avi Kivity a...@redhat.com wrote: On 04/21/2010 08:53 PM, Cam Macdonell wrote: Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. -device ivshmem,size=size in format accepted by -m[,shm=shm name] Interrupts are supported between multiple VMs by using a shared memory server by using a chardev socket. -device ivshmem,size=size in format accepted by -m[,shm=shm name] [,chardev=id][,msi=on][,irqfd=on][,vectors=n] -chardev socket,path=path,id=id (shared memory server is qemu.git/contrib/ivshmem-server) Sample programs and init scripts are in a git repo here: +typedef struct EventfdEntry { + PCIDevice *pdev; + int vector; +} EventfdEntry; + +typedef struct IVShmemState { + PCIDevice dev; + uint32_t intrmask; + uint32_t intrstatus; + uint32_t doorbell; + + CharDriverState * chr; + CharDriverState ** eventfd_chr; + int ivshmem_mmio_io_addr; + + pcibus_t mmio_addr; + unsigned long ivshmem_offset; + uint64_t ivshmem_size; /* size of shared memory region */ + int shm_fd; /* shared memory file descriptor */ + + int nr_allocated_vms; + /* array of eventfds for each guest */ + int ** eventfds; + /* keep track of # of eventfds for each guest*/ + int * eventfds_posn_count; More readable: typedef struct Peer { int nb_eventfds; int *eventfds; } Peer; int nb_peers; Peer *peers; Does eventfd_chr need to be there as well? + + int nr_alloc_guests; + int vm_id; + int num_eventfds; + uint32_t vectors; + uint32_t features; + EventfdEntry *eventfd_table; + + char * shmobj; + char * sizearg; Does this need to be part of the state? +} IVShmemState; + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { + IntrMask = 0, + IntrStatus = 4, + IVPosition = 8, + Doorbell = 12, +}; + +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) { + return (ivs-features (1 feature)); +} + +static inline int is_power_of_two(int x) { + return (x (x-1)) == 0; +} argument needs to be uint64_t to avoid overflow with large BARs. Return type can be bool. +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val) +{ + IVShmemState *s = opaque; + + u_int64_t write_one = 1; + u_int16_t dest = val 16; + u_int16_t vector = val 0xff; + + addr= 0xfe; Why 0xfe? Can understand 0xfc or 0xff. + + switch (addr) + { + case IntrMask: + ivshmem_IntrMask_write(s, val); + break; + + case IntrStatus: + ivshmem_IntrStatus_write(s, val); + break; + + case Doorbell: + /* check doorbell range */ + if ((vector= 0) (vector s-eventfds_posn_count[dest])) { What if dest is too big? We overflow s-eventfds_posn_count. + +static void close_guest_eventfds(IVShmemState *s, int posn) +{ + int i, guest_curr_max; + + guest_curr_max = s-eventfds_posn_count[posn]; + + for (i = 0; i guest_curr_max; i++) + close(s-eventfds[posn][i]); + + free(s-eventfds[posn]); qemu_free(). +/* this function increase the dynamic storage need to store data about other + * guests */ +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) { + + int j, old_nr_alloc; + + old_nr_alloc = s-nr_alloc_guests; + + while (s-nr_alloc_guests new_min_size) + s-nr_alloc_guests = s-nr_alloc_guests * 2; + + IVSHMEM_DPRINTF(bumping storage to %d guests\n, s-nr_alloc_guests); + s-eventfds = qemu_realloc(s-eventfds, s-nr_alloc_guests * + sizeof(int *)); + s-eventfds_posn_count = qemu_realloc(s-eventfds_posn_count, + s-nr_alloc_guests * + sizeof(int)); + s-eventfd_table = qemu_realloc(s-eventfd_table, s-nr_alloc_guests * + sizeof(EventfdEntry)); + + if ((s-eventfds == NULL) || (s-eventfds_posn_count == NULL) || + (s-eventfd_table == NULL)) { + fprintf(stderr, Allocation error - exiting\n); + exit(1); + } + + if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) { + s-eventfd_chr = (CharDriverState **)qemu_realloc(s-eventfd_chr, + s-nr_alloc_guests * sizeof(void *)); + if (s-eventfd_chr == NULL) { + fprintf(stderr, Allocation error - exiting\n); + exit(1); + } + } + + /* zero out new pointers */ + for (j = old_nr_alloc; j s-nr_alloc_guests; j++) { + s-eventfds[j] =
Re: [PATCH v3 0/2] x86 FPU API
On 05/10/2010 01:48 AM, Avi Kivity wrote: On 05/06/2010 11:45 AM, Avi Kivity wrote: Currently all fpu accessors are wedded to task_struct. However kvm also uses the fpu in a different context. Introduce an FPU API, and replace the current uses with the new API. While this patchset is oriented towards deeper changes, as a first step it simlifies xsave for kvm. Peter/Ingo, what are the plans for merging it? The kvm xsave work depends on this. Going to look at it today. Looks good, but I want to go over it in detail to catch any gotchas. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 06:22 PM, Cam Macdonell wrote: + +/* if the position is -1, then it's shared memory region fd */ +if (incoming_posn == -1) { + +s-num_eventfds = 0; + +if (check_shm_size(s, incoming_fd) == -1) { +exit(-1); +} + +/* creating a BAR in qemu_chr callback may be crazy */ +create_shared_memory_BAR(s, incoming_fd); It probably is... why can't you create it during initialization? This is for the shared memory server implementation, so the fd for the shared memory has to be received (over the qemu char device) from the server before the BAR can be created via qemu_ram_mmap() which adds the necessary memory We could do the handshake during initialization. I'm worried that the device will appear without the BAR, and strange things will happen. But the chardev API is probably not geared for passing data during init. Anthony, any ideas? Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. So it's a good idea to make the initialization process atomic. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation
On Mon, May 10, 2010 at 4:39 AM, Avi Kivity a...@redhat.com wrote: On 04/21/2010 08:53 PM, Cam Macdonell wrote: This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a host file into guest RAM. This function mmaps the opened file anywhere and adds the memory to the ram blocks. Usage is qemu_ram_mmap(fd, size, MAP_SHARED, offset); Signoff? +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset) +{ + RAMBlock *new_block; + + size = TARGET_PAGE_ALIGN(size); + new_block = qemu_malloc(sizeof(*new_block)); + + /* map the file passed as a parameter to be this part of memory */ + new_block-host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset); + + if (new_block-host == MAP_FAILED) + exit(1); Braces after if () + if (kvm_enabled()) + kvm_setup_guest_memory(new_block-host, size); + More braces. This function is possibly made redundant by Marcelo's patch for qemu_ram_map http://kerneltrap.org/mailarchive/linux-kvm/2010/4/26/6261299 qemu_ram_map isn't merged yet either, but I'm fine with either one. Marcelo's requires the device to map the memory and then pass the pointer to be added to the memory allocation, so it gives the device full mapping control. Alternatively, I could add the protection flag to my function (I think that's all that is missing). Let me know and I'll change my patch if necessary. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 10:28 AM, Avi Kivity wrote: On 05/10/2010 06:22 PM, Cam Macdonell wrote: + +/* if the position is -1, then it's shared memory region fd */ +if (incoming_posn == -1) { + +s-num_eventfds = 0; + +if (check_shm_size(s, incoming_fd) == -1) { +exit(-1); +} + +/* creating a BAR in qemu_chr callback may be crazy */ +create_shared_memory_BAR(s, incoming_fd); It probably is... why can't you create it during initialization? This is for the shared memory server implementation, so the fd for the shared memory has to be received (over the qemu char device) from the server before the BAR can be created via qemu_ram_mmap() which adds the necessary memory We could do the handshake during initialization. I'm worried that the device will appear without the BAR, and strange things will happen. But the chardev API is probably not geared for passing data during init. Anthony, any ideas? Why can't we create the BAR with just normal RAM and then change it to a mmap()'d fd after initialization? This will be behavior would be important for live migration as it would let you quickly migrate preserving the memory contents without waiting for an external program to reconnect. Regards, Anthony Lioguori Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. If you're doing just a ring queue in shared memory, it should allow disconnect/reconnect during live migration asynchronously to the actual qemu live migration. Regards, Anthony Liguori So it's a good idea to make the initialization process atomic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 9:28 AM, Avi Kivity a...@redhat.com wrote: On 05/10/2010 06:22 PM, Cam Macdonell wrote: + + /* if the position is -1, then it's shared memory region fd */ + if (incoming_posn == -1) { + + s-num_eventfds = 0; + + if (check_shm_size(s, incoming_fd) == -1) { + exit(-1); + } + + /* creating a BAR in qemu_chr callback may be crazy */ + create_shared_memory_BAR(s, incoming_fd); It probably is... why can't you create it during initialization? This is for the shared memory server implementation, so the fd for the shared memory has to be received (over the qemu char device) from the server before the BAR can be created via qemu_ram_mmap() which adds the necessary memory We could do the handshake during initialization. I'm worried that the device will appear without the BAR, and strange things will happen. But the chardev API is probably not geared for passing data during init. More specifically, the challenge I've found is that there is no function to tell a chardev to block and wait for the initialization data. Anthony, any ideas? Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. But, the BAR isn't there until the handshake is completed. Only after receiving the shared memory fd does my device call pci_register_bar() in the callback function. So there may be a case with BAR2 (the shared memory BAR) missing during initialization. FWIW, I haven't encountered this. So it's a good idea to make the initialization process atomic. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] Support for booting from virtio disks
On 05/10/2010 03:11 AM, Gleb Natapov wrote: This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapovg...@redhat.com A related problem that I think we need to think about how we solve is indicating to Seabios which device we want to boot from With your patch, a user can select a virtio device explicitly or if they use only one virtio device, it will Just Work. However, if a user uses IDE and virtio, or a user has multiple disks, they cannot select a device via -boot. Is this something we need to address? I don't think we'd break libvirt if we didn't. Regards, Anthony Liguori --- Changelog: v1-v2: - free memory in case of vq initialization error. - change license of virtio ring/pci to LGPLv3 with permission of Laurent Vivier (aka the author). diff --git a/Makefile b/Makefile index 327a1bf..d0b8881 100644 --- a/Makefile +++ b/Makefile @@ -14,7 +14,8 @@ OUT=out/ SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \ kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \ pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \ -usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c +usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \ +virtio-ring.c virtio-pci.c virtio-blk.c SRC16=$(SRCBOTH) system.c disk.c apm.c font.c SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \ acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \ diff --git a/src/block.c b/src/block.c index ddf441f..b6b1902 100644 --- a/src/block.c +++ b/src/block.c @@ -11,6 +11,7 @@ #include util.h // dprintf #include ata.h // process_ata_op #include usb-msc.h // process_usb_op +#include virtio-blk.h // process_virtio_op struct drives_s Drives VAR16VISIBLE; @@ -289,6 +290,8 @@ process_op(struct disk_op_s *op) return process_cdemu_op(op); case DTYPE_USB: return process_usb_op(op); +case DTYPE_VIRTIO: + return process_virtio_op(op); default: op-count = 0; return DISK_RET_EPARAM; diff --git a/src/config.h b/src/config.h index b101174..ad569c6 100644 --- a/src/config.h +++ b/src/config.h @@ -136,6 +136,9 @@ #define CONFIG_SUBMODEL_ID 0x00 #define CONFIG_BIOS_REVISION 0x01 +// Support boot from virtio storage +#define CONFIG_VIRTIO_BLK 1 + // Various memory addresses used by the code. #define BUILD_STACK_ADDR 0x7000 #define BUILD_S3RESUME_STACK_ADDR 0x1000 diff --git a/src/disk.h b/src/disk.h index 0cd1b74..9e5b083 100644 --- a/src/disk.h +++ b/src/disk.h @@ -197,6 +197,7 @@ struct drive_s { #define DTYPE_RAMDISK 0x04 #define DTYPE_CDEMU0x05 #define DTYPE_USB 0x06 +#define DTYPE_VIRTIO 0x07 #define MAXDESCSIZE 80 diff --git a/src/pci_ids.h b/src/pci_ids.h index 1800f1d..e1cded2 100644 --- a/src/pci_ids.h +++ b/src/pci_ids.h @@ -2605,3 +2605,6 @@ #define PCI_DEVICE_ID_RME_DIGI32 0x9896 #define PCI_DEVICE_ID_RME_DIGI32_PRO 0x9897 #define PCI_DEVICE_ID_RME_DIGI32_80x9898 + +#define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4 +#define PCI_DEVICE_ID_VIRTIO_BLK 0x1001 diff --git a/src/post.c b/src/post.c index 638b0f7..25535e2 100644 --- a/src/post.c +++ b/src/post.c @@ -23,6 +23,7 @@ #include smbios.h // smbios_init #include paravirt.h // qemu_cfg_port_probe #include ps2port.h // ps2port_setup +#include virtio-blk.h // virtio_blk_setup void __set_irq(int vector, void *loc) @@ -184,6 +185,7 @@ init_hw(void) floppy_setup(); ata_setup(); ramdisk_setup(); +virtio_blk_setup(); } // Main setup code. diff --git a/src/virtio-blk.c b/src/virtio-blk.c new file mode 100644 index 000..a41c336 --- /dev/null +++ b/src/virtio-blk.c @@ -0,0 +1,155 @@ +// Virtio blovl boot support. +// +// Copyright (C) 2010 Red Hat Inc. +// +// Authors: +// Gleb Natapovgnata...@redhat.com +// +// This file may be distributed under the terms of the GNU LGPLv3 license. + +#include util.h // dprintf +#include pci.h // foreachpci +#include config.h // CONFIG_* +#include virtio-pci.h +#include virtio-blk.h +#include disk.h + +struct virtiodrive_s { +struct drive_s drive; +struct vring_virtqueue *vq; +u16 ioaddr; +}; + +static int +virtio_blk_read(struct disk_op_s *op) +{ +struct virtiodrive_s *vdrive_g = +container_of(op-drive_g, struct virtiodrive_s, drive); +struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq); +struct virtio_blk_outhdr hdr = { +.type = VIRTIO_BLK_T_IN, +.ioprio = 0, +.sector = op-lba, +}; +u8 status = VIRTIO_BLK_S_UNSUPP; +struct vring_list sg[] = { +{ +.addr = MAKE_FLATPTR(GET_SEG(SS),hdr), +.length= sizeof(hdr), +}, +{ +.addr = op-buf_fl, +.length= GET_GLOBAL(vdrive_g-drive.blksize) * op-count, +}, +{ +
[PATCH] VMX: Invalid guest state detection enhancements and bug fixes
- Correct unusable flag check on SS, DS, ES, FS, GS, and LDTR - Add rflags checks - Report failed instruction on emulation failure Signed-off-by: Mohammed Gamal m.gamal...@gmail.com --- arch/x86/kvm/vmx.c | 31 +++ 1 files changed, 27 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 777e00d..968384b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2122,7 +2122,7 @@ static bool stack_segment_valid(struct kvm_vcpu *vcpu) ss_rpl = ss.selector SELECTOR_RPL_MASK; if (ss.unusable) - return true; + return false; if (ss.type != 3 ss.type != 7) return false; if (!ss.s) @@ -2144,7 +2144,7 @@ static bool data_segment_valid(struct kvm_vcpu *vcpu, int seg) rpl = var.selector SELECTOR_RPL_MASK; if (var.unusable) - return true; + return false; if (!var.s) return false; if (!var.present) @@ -2185,7 +2185,7 @@ static bool ldtr_valid(struct kvm_vcpu *vcpu) vmx_get_segment(vcpu, ldtr, VCPU_SREG_LDTR); if (ldtr.unusable) - return true; + return false; if (ldtr.selector SELECTOR_TI_MASK) /* TI = 1 */ return false; if (ldtr.type != 2) @@ -2207,6 +2207,27 @@ static bool cs_ss_rpl_check(struct kvm_vcpu *vcpu) (ss.selector SELECTOR_RPL_MASK)); } +static bool rflags_valid(struct kvm_vcpu *vcpu) +{ + unsigned long rflags; + u32 entry_intr_info; + + rflags = vmcs_readl(GUEST_RFLAGS); + entry_intr_info = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); +#ifdef CONFIG_X86_64 + if (is_long_mode(vcpu)) + if (rflags X86_EFLAGS_VM) + return false; +#endif + if ((entry_intr_info INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_EXT_INTR +(entry_intr_info INTR_INFO_VALID_MASK)) { + if (!(rflags X86_EFLAGS_IF)) + return false; + } + + return true; +} + /* * Check if guest state is valid. Returns true if valid, false if * not. @@ -2251,8 +2272,9 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu) } /* TODO: * - Add checks on RIP -* - Add checks on RFLAGS */ + if (!rflags_valid(vcpu)) + return false; return true; } @@ -3559,6 +3581,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) } if (err != EMULATE_DONE) { + kvm_report_emulation_failure(vcpu, invalid guest state handler); vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; vcpu-run-internal.ndata = 0; -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] Support for booting from virtio disks
On 05/10/2010 10:54 AM, Gleb Natapov wrote: On Mon, May 10, 2010 at 10:48:42AM -0500, Anthony Liguori wrote: On 05/10/2010 03:11 AM, Gleb Natapov wrote: This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapovg...@redhat.com A related problem that I think we need to think about how we solve is indicating to Seabios which device we want to boot from With your patch, a user can select a virtio device explicitly or if they use only one virtio device, it will Just Work. However, if a user uses IDE and virtio, or a user has multiple disks, they cannot select a device via -boot. Isn't this problem unrelated to this patch? I mean if I start qemu with two ide devices can I specify from qemu command line which one I want to boot from? That's sort of what I'm asking. If you compare this approach to extboot, extboot provided a capability to select a disk. I think it can be argued though that this isn't a necessary feature to carry over and I'm looking for additional opinions on that. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] Support for booting from virtio disks
On Mon, May 10, 2010 at 10:48:42AM -0500, Anthony Liguori wrote: On 05/10/2010 03:11 AM, Gleb Natapov wrote: This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapovg...@redhat.com A related problem that I think we need to think about how we solve is indicating to Seabios which device we want to boot from With your patch, a user can select a virtio device explicitly or if they use only one virtio device, it will Just Work. However, if a user uses IDE and virtio, or a user has multiple disks, they cannot select a device via -boot. Isn't this problem unrelated to this patch? I mean if I start qemu with two ide devices can I specify from qemu command line which one I want to boot from? Is this something we need to address? I don't think we'd break libvirt if we didn't. Regards, Anthony Liguori --- Changelog: v1-v2: - free memory in case of vq initialization error. - change license of virtio ring/pci to LGPLv3 with permission of Laurent Vivier (aka the author). diff --git a/Makefile b/Makefile index 327a1bf..d0b8881 100644 --- a/Makefile +++ b/Makefile @@ -14,7 +14,8 @@ OUT=out/ SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \ kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \ pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \ -usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c +usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \ +virtio-ring.c virtio-pci.c virtio-blk.c SRC16=$(SRCBOTH) system.c disk.c apm.c font.c SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \ acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \ diff --git a/src/block.c b/src/block.c index ddf441f..b6b1902 100644 --- a/src/block.c +++ b/src/block.c @@ -11,6 +11,7 @@ #include util.h // dprintf #include ata.h // process_ata_op #include usb-msc.h // process_usb_op +#include virtio-blk.h // process_virtio_op struct drives_s Drives VAR16VISIBLE; @@ -289,6 +290,8 @@ process_op(struct disk_op_s *op) return process_cdemu_op(op); case DTYPE_USB: return process_usb_op(op); +case DTYPE_VIRTIO: +return process_virtio_op(op); default: op-count = 0; return DISK_RET_EPARAM; diff --git a/src/config.h b/src/config.h index b101174..ad569c6 100644 --- a/src/config.h +++ b/src/config.h @@ -136,6 +136,9 @@ #define CONFIG_SUBMODEL_ID 0x00 #define CONFIG_BIOS_REVISION 0x01 +// Support boot from virtio storage +#define CONFIG_VIRTIO_BLK 1 + // Various memory addresses used by the code. #define BUILD_STACK_ADDR 0x7000 #define BUILD_S3RESUME_STACK_ADDR 0x1000 diff --git a/src/disk.h b/src/disk.h index 0cd1b74..9e5b083 100644 --- a/src/disk.h +++ b/src/disk.h @@ -197,6 +197,7 @@ struct drive_s { #define DTYPE_RAMDISK 0x04 #define DTYPE_CDEMU0x05 #define DTYPE_USB 0x06 +#define DTYPE_VIRTIO 0x07 #define MAXDESCSIZE 80 diff --git a/src/pci_ids.h b/src/pci_ids.h index 1800f1d..e1cded2 100644 --- a/src/pci_ids.h +++ b/src/pci_ids.h @@ -2605,3 +2605,6 @@ #define PCI_DEVICE_ID_RME_DIGI32 0x9896 #define PCI_DEVICE_ID_RME_DIGI32_PRO 0x9897 #define PCI_DEVICE_ID_RME_DIGI32_8 0x9898 + +#define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4 +#define PCI_DEVICE_ID_VIRTIO_BLK0x1001 diff --git a/src/post.c b/src/post.c index 638b0f7..25535e2 100644 --- a/src/post.c +++ b/src/post.c @@ -23,6 +23,7 @@ #include smbios.h // smbios_init #include paravirt.h // qemu_cfg_port_probe #include ps2port.h // ps2port_setup +#include virtio-blk.h // virtio_blk_setup void __set_irq(int vector, void *loc) @@ -184,6 +185,7 @@ init_hw(void) floppy_setup(); ata_setup(); ramdisk_setup(); +virtio_blk_setup(); } // Main setup code. diff --git a/src/virtio-blk.c b/src/virtio-blk.c new file mode 100644 index 000..a41c336 --- /dev/null +++ b/src/virtio-blk.c @@ -0,0 +1,155 @@ +// Virtio blovl boot support. +// +// Copyright (C) 2010 Red Hat Inc. +// +// Authors: +// Gleb Natapovgnata...@redhat.com +// +// This file may be distributed under the terms of the GNU LGPLv3 license. + +#include util.h // dprintf +#include pci.h // foreachpci +#include config.h // CONFIG_* +#include virtio-pci.h +#include virtio-blk.h +#include disk.h + +struct virtiodrive_s { +struct drive_s drive; +struct vring_virtqueue *vq; +u16 ioaddr; +}; + +static int +virtio_blk_read(struct disk_op_s *op) +{ +struct virtiodrive_s *vdrive_g = +container_of(op-drive_g, struct virtiodrive_s, drive); +struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq); +struct virtio_blk_outhdr hdr = { +.type = VIRTIO_BLK_T_IN, +.ioprio = 0, +
Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace
On Mon, May 10, 2010 at 1:25 PM, Gleb Natapov g...@redhat.com wrote: On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote: Do not kill VM when instruction emulation fails. Inject #UD and report failure to userspace instead. Userspace may choose to reenter guest if vcpu is in userspace (cpl == 3) in which case guest OS will kill offending process and continue running. I am curious to know what'd happen in case the vcpu is in kernel space (cpl == 0). Is that case handled? Please use this one instead. Compilation warning fixed. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ed48904..5aa0944 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -575,7 +575,6 @@ enum emulation_result { #define EMULTYPE_SKIP (1 2) int emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2, u16 error_code, int emulation_type); -void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context); void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95bee9d..41d2de8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2788,11 +2788,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code) return 1; case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; - return 0; + /* fall through */ case EMULATE_FAIL: - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; - vcpu-run-internal.ndata = 0; return 0; default: BUG(); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index cea916f..58c91f5 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1449,7 +1449,7 @@ static int io_interception(struct vcpu_svm *svm) string = (io_info SVM_IOIO_STR_MASK) != 0; in = (io_info SVM_IOIO_TYPE_MASK) != 0; if (string || in) - return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO); + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; port = io_info 16; size = (io_info SVM_IOIO_SIZE_MASK) SVM_IOIO_SIZE_SHIFT; @@ -2300,16 +2300,12 @@ static int iret_interception(struct vcpu_svm *svm) static int invlpg_interception(struct vcpu_svm *svm) { - if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE) - pr_unimpl(svm-vcpu, %s: failed\n, __func__); - return 1; + return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } static int emulate_on_interception(struct vcpu_svm *svm) { - if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE) - pr_unimpl(svm-vcpu, %s: failed\n, __func__); - return 1; + return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } static int cr8_write_interception(struct vcpu_svm *svm) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 777e00d..e35c479 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3074,7 +3074,7 @@ static int handle_io(struct kvm_vcpu *vcpu) ++vcpu-stat.io_exits; if (string || in) - return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO); + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; port = exit_qualification 16; size = (exit_qualification 7) + 1; @@ -3331,22 +3331,7 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { - unsigned long exit_qualification; - enum emulation_result er; - unsigned long offset; - - exit_qualification = vmcs_readl(EXIT_QUALIFICATION); - offset = exit_qualification 0xffful; - - er = emulate_instruction(vcpu, 0, 0, 0); - - if (er != EMULATE_DONE) { - printk(KERN_ERR - Fail to handle apic access vmexit! Offset is 0x%lx\n, - offset); - return -ENOEXEC; - } - return 1; + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; } static int handle_task_switch(struct kvm_vcpu *vcpu) @@ -3558,13 +3543,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) goto out; } - if (err != EMULATE_DONE) { - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION; - vcpu-run-internal.ndata = 0; - ret = 0; - goto out; - } +
Re: [PATCHv2] Support for booting from virtio disks
On Mon, May 10, 2010 at 10:58:45AM -0500, Anthony Liguori wrote: On 05/10/2010 10:54 AM, Gleb Natapov wrote: On Mon, May 10, 2010 at 10:48:42AM -0500, Anthony Liguori wrote: On 05/10/2010 03:11 AM, Gleb Natapov wrote: This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapovg...@redhat.com A related problem that I think we need to think about how we solve is indicating to Seabios which device we want to boot from With your patch, a user can select a virtio device explicitly or if they use only one virtio device, it will Just Work. However, if a user uses IDE and virtio, or a user has multiple disks, they cannot select a device via -boot. Isn't this problem unrelated to this patch? I mean if I start qemu with two ide devices can I specify from qemu command line which one I want to boot from? That's sort of what I'm asking. If you compare this approach to extboot, extboot provided a capability to select a disk. I think it can be argued though that this isn't a necessary feature to carry over and I'm looking for additional opinions on that. Well, extboot is just a hack and shouldn't be used with ide disks at all. With extboot it is not possible to switch to another disk from F12 menu for instance (is it actually possible to read more then one disks with bios int13 when extboot is in use?). About specifying boot disk from qemu command like I think it will be very useful. It is not clear how to pass default boot device into Seabios though. We should pass a bus boot device is attached too (ide/virtio) and an unique id of the device on the bus. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 9:38 AM, Anthony Liguori anth...@codemonkey.ws wrote: On 05/10/2010 10:28 AM, Avi Kivity wrote: On 05/10/2010 06:22 PM, Cam Macdonell wrote: + + /* if the position is -1, then it's shared memory region fd */ + if (incoming_posn == -1) { + + s-num_eventfds = 0; + + if (check_shm_size(s, incoming_fd) == -1) { + exit(-1); + } + + /* creating a BAR in qemu_chr callback may be crazy */ + create_shared_memory_BAR(s, incoming_fd); It probably is... why can't you create it during initialization? This is for the shared memory server implementation, so the fd for the shared memory has to be received (over the qemu char device) from the server before the BAR can be created via qemu_ram_mmap() which adds the necessary memory We could do the handshake during initialization. I'm worried that the device will appear without the BAR, and strange things will happen. But the chardev API is probably not geared for passing data during init. Anthony, any ideas? Why can't we create the BAR with just normal RAM and then change it to a mmap()'d fd after initialization? This will be behavior would be important for live migration as it would let you quickly migrate preserving the memory contents without waiting for an external program to reconnect. Regards, Anthony Lioguori Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. the memcpy() would overwrite the contents of the shared memory each time a guest joins which would be dangerous. From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. Yes, I think the ack is the way to go, so the guest has to be aware of it. Would setting a flag in the driver-specific config space be an acceptable ack that the shared region is now mapped? Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 06:41 PM, Cam Macdonell wrote: What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. But, the BAR isn't there until the handshake is completed. Only after receiving the shared memory fd does my device call pci_register_bar() in the callback function. So there may be a case with BAR2 (the shared memory BAR) missing during initialization. FWIW, I haven't encountered this. Well, that violates PCI. You can't have a PCI device with no BAR, then have a BAR appear. It may work since the BAR is registered a lot faster than the BIOS is able to peek at it, but it's a race nevertheless. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7] add mergeable buffers support to vhost_net
On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net * use_mm(net-dev.mm); mutex_lock(vq-mutex); vhost_disable_notify(vq); - hdr_size = vq-hdr_size; + vhost_hlen = vq-vhost_hlen; vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, - ARRAY_SIZE(vq-iov), - out, in, - vq_log, log); + while ((datalen = vhost_head_len(vq, sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, + datalen + vhost_hlen, + in, vq_log, log); + if (headcount 0) + break; /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net * break; } /* We don't need to be notified again. */ - if (out) { - vq_err(vq, Unexpected descriptor format for RX: -out %d, int %d\n, -out, in); - break; - } - /* Skip header. TODO: support TSO/mergeable rx buffers. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); + if (vhost_hlen) + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in); + else + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: %zd expected %zd\n, -iov_length(vq-hdr, s), hdr_size); +iov_length(vq-hdr, s), vhost_hlen); break; } err = sock-ops-recvmsg(NULL, sock, msg, len, MSG_DONTWAIT | MSG_TRUNC); /* TODO: Check specific error and bomb out unless EAGAIN? */ if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_desc(vq, headcount); break; } - /* TODO: Should check and handle checksum. */ - if (err len) { - pr_err(Discarded truncated rx packet: - len %d %zd\n, err, len); - vhost_discard_vq_desc(vq); + if (err != datalen) { + pr_err(Discarded rx packet: + len %d, expected %zd\n, err, datalen); + vhost_discard_desc(vq, headcount); continue; } len = err; - err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); - if (err) { - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n, -vq-iov-iov_base, err); + if (vhost_hlen + memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0, + vhost_hlen)) { + vq_err(vq, Unable to write vnet_hdr at addr %p\n, +vq-iov-iov_base); break; } - len += hdr_size; - vhost_add_used_and_signal(net-dev, vq, head, len); + /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) + memcpy_toiovecend(vq-hdr, (unsigned char *)headcount, + offsetof(typeof(hdr), num_buffers), + sizeof(hdr.num_buffers))) { + vq_err(vq, Failed num_buffers write); + vhost_discard_desc(vq, headcount); + break; + } + len += vhost_hlen; + vhost_add_used_and_signal_n(net-dev, vq, vq-heads, + headcount); if (unlikely(vq_log)) vhost_log_write(vq, vq_log, log, len); total_len += len; OK I think I see the bug here: vhost_add_used_and_signal_n does not get the actual length, it
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 10:40 AM, Avi Kivity a...@redhat.com wrote: On 05/10/2010 06:41 PM, Cam Macdonell wrote: What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. But, the BAR isn't there until the handshake is completed. Only after receiving the shared memory fd does my device call pci_register_bar() in the callback function. So there may be a case with BAR2 (the shared memory BAR) missing during initialization. FWIW, I haven't encountered this. Well, that violates PCI. You can't have a PCI device with no BAR, then have a BAR appear. It may work since the BAR is registered a lot faster than the BIOS is able to peek at it, but it's a race nevertheless. Agreed. I'll get Anthony's idea up and running. It seems that is the way forward. Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 11:20 AM, Cam Macdonell wrote: On Mon, May 10, 2010 at 9:38 AM, Anthony Liguorianth...@codemonkey.ws wrote: On 05/10/2010 10:28 AM, Avi Kivity wrote: On 05/10/2010 06:22 PM, Cam Macdonell wrote: + +/* if the position is -1, then it's shared memory region fd */ +if (incoming_posn == -1) { + +s-num_eventfds = 0; + +if (check_shm_size(s, incoming_fd) == -1) { +exit(-1); +} + +/* creating a BAR in qemu_chr callback may be crazy */ +create_shared_memory_BAR(s, incoming_fd); It probably is... why can't you create it during initialization? This is for the shared memory server implementation, so the fd for the shared memory has to be received (over the qemu char device) from the server before the BAR can be created via qemu_ram_mmap() which adds the necessary memory We could do the handshake during initialization. I'm worried that the device will appear without the BAR, and strange things will happen. But the chardev API is probably not geared for passing data during init. Anthony, any ideas? Why can't we create the BAR with just normal RAM and then change it to a mmap()'d fd after initialization? This will be behavior would be important for live migration as it would let you quickly migrate preserving the memory contents without waiting for an external program to reconnect. Regards, Anthony Lioguori Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. the memcpy() would overwrite the contents of the shared memory each time a guest joins which would be dangerous. I think those are reasonable semantics and is really the only way to get guest-transparent reconnect. The later is pretty critical for guest transparent live migration. From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. Yes, I think the ack is the way to go, so the guest has to be aware of it. Would setting a flag in the driver-specific config space be an acceptable ack that the shared region is now mapped? You know it's mapped because it's mapped when the pci map function returns. You don't need the guest to explicitly tell you. Regards, Anthony Liguori Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 06:38 PM, Anthony Liguori wrote: Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. But it's a shared memory area. Other peers could have connected and written some data in. The memcpy() would destroy their data. From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. From the peers' perspective, it's non-transparent :( Also it doubles the transient memory requirement. If you're doing just a ring queue in shared memory, it should allow disconnect/reconnect during live migration asynchronously to the actual qemu live migration. Live migration of guests using shared memory is interesting. You'd need to freeze all peers on one node, disconnect, reconnect, and restart them on the other node. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7] add mergeable buffers support to vhost_net
Since datalen carries the difference and will be negative by that amount from the original loop, what about just adding something like: } if (headcount) heads[headcount-1].len += datalen; [and really, headcount 0 since datalen 0, so just: heads[headcount-1].len += datalen; +-DLS kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM: On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net * use_mm(net-dev.mm); mutex_lock(vq-mutex); vhost_disable_notify(vq); - hdr_size = vq-hdr_size; + vhost_hlen = vq-vhost_hlen; vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(vq, sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, +datalen + vhost_hlen, +in, vq_log, log); + if (headcount 0) + break; /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net * break; } /* We don't need to be notified again. */ - if (out) { - vq_err(vq, Unexpected descriptor format for RX: -out %d, int %d\n, -out, in); - break; - } - /* Skip header. TODO: support TSO/mergeable rx buffers. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); + if (vhost_hlen) + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in); + else + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: %zd expected %zd\n, -iov_length(vq-hdr, s), hdr_size); +iov_length(vq-hdr, s), vhost_hlen); break; } err = sock-ops-recvmsg(NULL, sock, msg, len, MSG_DONTWAIT | MSG_TRUNC); /* TODO: Check specific error and bomb out unless EAGAIN? */ if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_desc(vq, headcount); break; } - /* TODO: Should check and handle checksum. */ - if (err len) { - pr_err(Discarded truncated rx packet: - len %d %zd\n, err, len); - vhost_discard_vq_desc(vq); + if (err != datalen) { + pr_err(Discarded rx packet: + len %d, expected %zd\n, err, datalen); + vhost_discard_desc(vq, headcount); continue; } len = err; - err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); - if (err) { - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n, -vq-iov-iov_base, err); + if (vhost_hlen + memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0, + vhost_hlen)) { + vq_err(vq, Unable to write vnet_hdr at addr %p\n, +vq-iov-iov_base); break; } - len += hdr_size; - vhost_add_used_and_signal(net-dev, vq, head, len); + /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) + memcpy_toiovecend(vq-hdr, (unsigned char *)headcount, + offsetof(typeof(hdr), num_buffers), + sizeof(hdr.num_buffers))) { + vq_err(vq, Failed num_buffers write); + vhost_discard_desc(vq, headcount); + break; + } + len += vhost_hlen; + vhost_add_used_and_signal_n(net-dev, vq, vq-heads, + headcount); if (unlikely(vq_log)) vhost_log_write(vq, vq_log, log, len); total_len += len; OK I think I see the bug here: vhost_add_used_and_signal_n does not get the actual length, it gets the iovec length from vhost. Guest virtio uses this as packet length, with bad results. So I have applied the follows and it seems to have fixed the problem: diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c16db02..9d7496d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -219,7 +219,7 @@ static int peek_head_len(struct vhost_virtqueue *vq, struct sock *sk) /* This is a multi-buffer
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 11:59 AM, Avi Kivity wrote: On 05/10/2010 06:38 PM, Anthony Liguori wrote: Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. But it's a shared memory area. Other peers could have connected and written some data in. The memcpy() would destroy their data. Why try to attempt to support multi-master shared memory? What's the use-case? Regards, Anthony Liguori From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. From the peers' perspective, it's non-transparent :( Also it doubles the transient memory requirement. If you're doing just a ring queue in shared memory, it should allow disconnect/reconnect during live migration asynchronously to the actual qemu live migration. Live migration of guests using shared memory is interesting. You'd need to freeze all peers on one node, disconnect, reconnect, and restart them on the other node. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7] add mergeable buffers support to vhost_net
On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote: Since datalen carries the difference and will be negative by that amount from the original loop, what about just adding something like: } if (headcount) heads[headcount-1].len += datalen; [and really, headcount 0 since datalen 0, so just: heads[headcount-1].len += datalen; +-DLS This works too, just does more checks and comparisons. I am still surprised that you were unable to reproduce the problem. kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM: On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net * use_mm(net-dev.mm); mutex_lock(vq-mutex); vhost_disable_notify(vq); - hdr_size = vq-hdr_size; + vhost_hlen = vq-vhost_hlen; vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(vq, sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, +datalen + vhost_hlen, +in, vq_log, log); + if (headcount 0) + break; /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net * break; } /* We don't need to be notified again. */ - if (out) { - vq_err(vq, Unexpected descriptor format for RX: -out %d, int %d\n, -out, in); - break; - } - /* Skip header. TODO: support TSO/mergeable rx buffers. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); + if (vhost_hlen) + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in); + else + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: %zd expected %zd\n, -iov_length(vq-hdr, s), hdr_size); +iov_length(vq-hdr, s), vhost_hlen); break; } err = sock-ops-recvmsg(NULL, sock, msg, len, MSG_DONTWAIT | MSG_TRUNC); /* TODO: Check specific error and bomb out unless EAGAIN? */ if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_desc(vq, headcount); break; } - /* TODO: Should check and handle checksum. */ - if (err len) { - pr_err(Discarded truncated rx packet: - len %d %zd\n, err, len); - vhost_discard_vq_desc(vq); + if (err != datalen) { + pr_err(Discarded rx packet: + len %d, expected %zd\n, err, datalen); + vhost_discard_desc(vq, headcount); continue; } len = err; - err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); - if (err) { - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n, -vq-iov-iov_base, err); + if (vhost_hlen + memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0, + vhost_hlen)) { + vq_err(vq, Unable to write vnet_hdr at addr %p\n, +vq-iov-iov_base); break; } - len += hdr_size; - vhost_add_used_and_signal(net-dev, vq, head, len); + /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) + memcpy_toiovecend(vq-hdr, (unsigned char *)headcount, + offsetof(typeof(hdr), num_buffers), + sizeof(hdr.num_buffers))) { + vq_err(vq, Failed num_buffers write); + vhost_discard_desc(vq, headcount); + break; + } + len += vhost_hlen; + vhost_add_used_and_signal_n(net-dev, vq, vq-heads, + headcount); if (unlikely(vq_log)) vhost_log_write(vq, vq_log, log, len); total_len += len; OK I think I see the bug here: vhost_add_used_and_signal_n does not get the actual length, it gets the iovec length from vhost. Guest virtio uses this as packet length, with bad results. So I have applied the
Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace
On Mon, May 10, 2010 at 07:06:05PM +0300, Mohammed Gamal wrote: On Mon, May 10, 2010 at 1:25 PM, Gleb Natapov g...@redhat.com wrote: On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote: Do not kill VM when instruction emulation fails. Inject #UD and report failure to userspace instead. Userspace may choose to reenter guest if vcpu is in userspace (cpl == 3) in which case guest OS will kill offending process and continue running. I am curious to know what'd happen in case the vcpu is in kernel space (cpl == 0). Is that case handled? Currently no matter where emulation fails VM is stopped and cpu state is printed on stderr. After that patch userspace may choose to continue VM execution after emulation error (#UD will be injected into VM though). The policy is in userspace, but I don't see the point to continue execution after emulation failed in kernel. How kernel can recover from the #UD? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7] add mergeable buffers support to vhost_net
On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net * use_mm(net-dev.mm); mutex_lock(vq-mutex); vhost_disable_notify(vq); - hdr_size = vq-hdr_size; + vhost_hlen = vq-vhost_hlen; vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, - ARRAY_SIZE(vq-iov), - out, in, - vq_log, log); + while ((datalen = vhost_head_len(vq, sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, + datalen + vhost_hlen, + in, vq_log, log); + if (headcount 0) + break; /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ So I think this breaks handling for a failure mode where we get an skb that is larger than the max packet guest can get. The right thing to do in this case is to drop the skb, we currently do this by passing truncate flag to recvmsg. In particular, with mergeable buffers off, if we get an skb that does not fit in a single packet, this code will spread it over multiple buffers. You should be able to reproduce this fairly easily by disabling both indirect buffers and mergeable buffers on qemu command line. With current code TCP still works by falling back on small packets. I think with your code it will get stuck forever once we get an skb that is too large for us to handle. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori anth...@codemonkey.ws wrote: On 05/10/2010 11:59 AM, Avi Kivity wrote: On 05/10/2010 06:38 PM, Anthony Liguori wrote: Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. But it's a shared memory area. Other peers could have connected and written some data in. The memcpy() would destroy their data. Why try to attempt to support multi-master shared memory? What's the use-case? I don't see it as multi-master, but that the latest guest to join shouldn't have their contents take precedence. In developing this patch, my motivation has been to let the guests decide. If the memcpy is always done, even when no data is written, a guest cannot join without overwriting everything. One use case we're looking at is having VMs using a map reduce framework like Hadoop or Phoenix running in VMs. However, if a workqueue is stored or data transfer passes through shared memory, a system can't scale up the number of workers because each new guest will erase the shared memory (and the workqueue or in progress data transfer). In cases where the latest guest to join wants to clear the memory, it can do so without the automatic memcpy. The guest can do a memset once it knows the memory is attached. My opinion is to leave it to the guests and the application that is using the shared memory to decide what to do on guest joins. Cam Regards, Anthony Liguori From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. From the peers' perspective, it's non-transparent :( Also it doubles the transient memory requirement. If you're doing just a ring queue in shared memory, it should allow disconnect/reconnect during live migration asynchronously to the actual qemu live migration. Live migration of guests using shared memory is interesting. You'd need to freeze all peers on one node, disconnect, reconnect, and restart them on the other node. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7] add mergeable buffers support to vhost_net
netdev-ow...@vger.kernel.org wrote on 05/10/2010 10:25:57 AM: On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote: Since datalen carries the difference and will be negative by that amount from the original loop, what about just adding something like: } if (headcount) heads[headcount-1].len += datalen; [and really, headcount 0 since datalen 0, so just: heads[headcount-1].len += datalen; +-DLS This works too, just does more checks and comparisons. I am still surprised that you were unable to reproduce the problem. I'm sure it happened, and probably had a performance penalty on my systems too, but not as much as yours. I didn't see any obvious performance difference running with the patch, though; not sure why. I'll instrument to see how often it's happening, I think. But fixed now, good catch! +-DLS -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On 05/10/2010 12:43 PM, Cam Macdonell wrote: On Mon, May 10, 2010 at 11:25 AM, Anthony Liguorianth...@codemonkey.ws wrote: On 05/10/2010 11:59 AM, Avi Kivity wrote: On 05/10/2010 06:38 PM, Anthony Liguori wrote: Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. But it's a shared memory area. Other peers could have connected and written some data in. The memcpy() would destroy their data. Why try to attempt to support multi-master shared memory? What's the use-case? I don't see it as multi-master, but that the latest guest to join shouldn't have their contents take precedence. In developing this patch, my motivation has been to let the guests decide. If the memcpy is always done, even when no data is written, a guest cannot join without overwriting everything. One use case we're looking at is having VMs using a map reduce framework like Hadoop or Phoenix running in VMs. However, if a workqueue is stored or data transfer passes through shared memory, a system can't scale up the number of workers because each new guest will erase the shared memory (and the workqueue or in progress data transfer). (Replying again to list) What data structure would you use? For a lockless ring queue, you can only support a single producer and consumer. To achieve bidirectional communication in virtio, we always use two queues. If you're adding additional queues to support other levels of communication, you can always use different areas of shared memory. I guess this is the point behind the doorbell mechanism? Regards, Anthony Liguori In cases where the latest guest to join wants to clear the memory, it can do so without the automatic memcpy. The guest can do a memset once it knows the memory is attached. My opinion is to leave it to the guests and the application that is using the shared memory to decide what to do on guest joins. Cam Regards, Anthony Liguori From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. From the peers' perspective, it's non-transparent :( Also it doubles the transient memory requirement. If you're doing just a ring queue in shared memory, it should allow disconnect/reconnect during live migration asynchronously to the actual qemu live migration. Live migration of guests using shared memory is interesting. You'd need to freeze all peers on one node, disconnect, reconnect, and restart them on the other node. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Instant crash (invalid free()) at migrate for all 32bit qemu-kvm-0.12
Seeing an.. interesting thing here. Kvm crashes during migrate, instantly. Freshly-built 0.12.4, with a simple idle guest (linux 2.6.32). Starting TCP receiver on the same host, so using exactly the same kvm binary. And it drops into shell with large diagnostics output from glibc: (qemu) migrate tcp:localhost:3210 *** glibc detected *** /usr/bin/kvm: free(): invalid next size (fast): 0x084c7fd0 *** === Backtrace: = /lib/libc.so.6[0xf7ae1905] /lib/libc.so.6[0xf7ae31a3] /lib/libc.so.6(cfree+0x6d)[0xf7ae622d] /usr/bin/kvm[0x806ed6f] /usr/bin/kvm[0x806ee53] /usr/bin/kvm[0x8050fb6] /usr/bin/kvm[0x805114b] /usr/bin/kvm[0x810e090] /usr/bin/kvm[0x8105ee9] /usr/bin/kvm[0x8106cce] /usr/bin/kvm[0x8051d28] /usr/bin/kvm[0x806d3c4] /usr/bin/kvm[0x8054bbd] /lib/libc.so.6(__libc_start_main+0xe5)[0xf7a8cb55] /usr/bin/kvm[0x804e711] === Memory map: 08048000-08244000 r-xp 08:01 77392 /usr/bin/kvm 08244000-08256000 rw-p 001fc000 08:01 77392 /usr/bin/kvm 08256000-08506000 rw-p 00:00 0 08506000-08516000 rw-p 00:00 0 08516000-08517000 rw-p 00:00 0 08517000-0851f000 rw-p 00:00 0 0851f000-085f9000 rw-p 00:00 0 e8f4a000-e8f4b000 ---p 00:00 0 e8f4b000-e974a000 rw-p 00:00 0 e9f4a000-e9f4b000 ---p 00:00 0 e9f4b000-ea74a000 rw-p 00:00 0 ea74a000-ea74b000 ---p 00:00 0 ea74b000-eaf4a000 rw-p 00:00 0 eaf4a000-eaf4b000 ---p 00:00 0 eaf4b000-eb74a000 rw-p 00:00 0 eb74a000-eb74b000 ---p 00:00 0 eb74b000-ebf4a000 rw-p 00:00 0 ec60-ec621000 rw-p 00:00 0 ec621000-ec70 ---p 00:00 0 ec74a000-ec864000 rw-s 00:04 2293771 /SYSV (deleted) ec864000-ec89e000 rw-p 00:00 0 ec91b000-ec923000 r-xp 08:01 32779 /usr/lib/libXcursor.so.1.0.2 ec923000-ec924000 rw-p 7000 08:01 32779 /usr/lib/libXcursor.so.1.0.2 ec939000-eca27000 r--p 0018 08:01 106154 /usr/lib/locale/locale-archive eca27000-ecc27000 r--p 08:01 106154 /usr/lib/locale/locale-archive ecc27000-ecc2d000 r-xp 08:01 75964 /usr/lib/libXrandr.so.2.2.0 ecc2d000-ecc2e000 rw-p 5000 08:01 75964 /usr/lib/libXrandr.so.2.2.0 ecc43000-ecd28000 rw-p 00:00 0 ecded000-ece4e000 rw-p 00:00 0 ece4e000-ece55000 r--s 08:01 84015 /usr/lib/gconv/gconv-modules.cache ece55000-ece56000 rw-p 00:00 0 ece56000-ede56000 rw-p 00:00 0 ede56000-ede58000 rw-p 00:00 0 ede58000-ede78000 rw-p 00:00 0 ede78000-ede79000 rw-p 00:00 0 ede79000-ede83000 r-xp 08:01 188305 /lib/libnss_files-2.10.2.so ede83000-ede84000 r--p 9000 08:01 188305 /lib/libnss_files-2.10.2.so ede84000-ede85000 rw-p a000 08:01 188305 /lib/libnss_files-2.10.2.so ede9a000-ede9b000 rw-p 00:00 0 ede9b000-edebb000 rw-p 00:00 0 edebb000-edebd000 rw-p 00:00 0 edebd000-f5ebd000 rw-p 00:00 0 f5ebd000-f5ebe000 rw-p 00:00 0 f5ebe000-f5ebf000 ---p 00:00 0 f5ebf000-f66c2000 rw-p 00:00 0 f66c2000-f66c6000 r-xp 08:01 33553 /usr/lib/libXdmcp.so.6.0.0 f66c6000-f66c7000 rw-p 3000 08:01 33553 /usr/lib/libXdmcp.so.6.0.0 f66c7000-f66c9000 r-xp 08:01 33554 /usr/lib/libXau.so.6.0.0 f66c9000-f66ca000 rw-p 1000 08:01 33554 /usr/lib/libXau.so.6.0.0 f66ca000-f66cc000 r-xp 08:01 187523 /lib/libx86.so.1 f66cc000-f66cd000 rw-p 1000 08:01 187523 /lib/libx86.so.1 f66cd000-f66ce000 rw-p 00:00 0 f66ce000-f66d2000 r-xp 08:01 187387 /lib/libattr.so.1.1.0 f66d2000-f66d3000 rw-p 3000 08:01 187387 /lib/libattr.so.1.1.0 f66d3000-f66d7000 r-xp 08:01 33792 /usr/lib/libogg.so.0.5.3 f66d7000-f66d8000 rw-p 3000 08:01 33792 /usr/lib/libogg.so.0.5.3 f66d8000-f66ff000 r-xp 08:01 76459 /usr/lib/libvorbis.so.0.4.4 f66ff000-f670 rw-p 00026000 08:01 76459 /usr/lib/libvorbis.so.0.4.4 f670-f670b000 r-xp 08:01 73934 /usr/lib/libvorbisenc.so.2.0.3 f670b000-f67fb000 rw-p a000 08:01 73934 /usr/lib/libvorbisenc.so.2.0.3 f67fb000-f684d000 r-xp 08:01 33564 /usr/lib/libFLAC.so.8.2.0
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 11:52 AM, Anthony Liguori anth...@codemonkey.ws wrote: On 05/10/2010 12:43 PM, Cam Macdonell wrote: On Mon, May 10, 2010 at 11:25 AM, Anthony Liguorianth...@codemonkey.ws wrote: On 05/10/2010 11:59 AM, Avi Kivity wrote: On 05/10/2010 06:38 PM, Anthony Liguori wrote: Otherwise, if the BAR is allocated during initialization, I would have to use MAP_FIXED to mmap the memory. This is what I did before the qemu_ram_mmap() function was added. What would happen to any data written to the BAR before the the handshake completed? I think it would disappear. You don't have to do MAP_FIXED. You can allocate a ram area and map that in when disconnected. When you connect, you create another ram area and memcpy() the previous ram area to the new one. You then map the second ram area in. But it's a shared memory area. Other peers could have connected and written some data in. The memcpy() would destroy their data. Why try to attempt to support multi-master shared memory? What's the use-case? I don't see it as multi-master, but that the latest guest to join shouldn't have their contents take precedence. In developing this patch, my motivation has been to let the guests decide. If the memcpy is always done, even when no data is written, a guest cannot join without overwriting everything. One use case we're looking at is having VMs using a map reduce framework like Hadoop or Phoenix running in VMs. However, if a workqueue is stored or data transfer passes through shared memory, a system can't scale up the number of workers because each new guest will erase the shared memory (and the workqueue or in progress data transfer). (Replying again to list) Sorry about that. What data structure would you use? For a lockless ring queue, you can only support a single producer and consumer. To achieve bidirectional communication in virtio, we always use two queues. MCS locks can work with multiple producer/consumers, either with busy waiting or using the doorbell mechanism. If you're adding additional queues to support other levels of communication, you can always use different areas of shared memory. True, and my point is simply that the memcpy would wipe those all out. I guess this is the point behind the doorbell mechanism? Yes. Regards, Anthony Liguori In cases where the latest guest to join wants to clear the memory, it can do so without the automatic memcpy. The guest can do a memset once it knows the memory is attached. My opinion is to leave it to the guests and the application that is using the shared memory to decide what to do on guest joins. Cam Regards, Anthony Liguori From the guest's perspective, it's totally transparent. For the backend, I'd suggest having an explicit initialized ack or something so that it knows that the data is now mapped to the guest. From the peers' perspective, it's non-transparent :( Also it doubles the transient memory requirement. If you're doing just a ring queue in shared memory, it should allow disconnect/reconnect during live migration asynchronously to the actual qemu live migration. Live migration of guests using shared memory is interesting. You'd need to freeze all peers on one node, disconnect, reconnect, and restart them on the other node. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network between host and guest
Probably it's slirp. The url you provided doesn't help much though. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migration fails with virtio-blk
10.05.2010 13:49, Michael Tokarev пишет: Apparently there's an issue with migration when virtio-blk is in use at the time migration occurs. See Debian bug #580649 for details: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580649 In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails. 2.6.32 guest works, 0.11 works, non-virtio-blk works. It looks like a problem reported on qemu-devel list should fix the issue here: http://marc.info/?l=qemu-develm=127350821419989 Testing... It wont be fast because I want to hear from the OP. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migration fails with virtio-blk
10.05.2010 22:38, Anthony Liguori wrote: On 05/10/2010 01:34 PM, Michael Tokarev wrote: In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails. 2.6.32 guest works, 0.11 works, non-virtio-blk works. It looks like a problem reported on qemu-devel list should fix the issue here: http://marc.info/?l=qemu-develm=127350821419989 Can't be, that's a vhost patch and 0.12.3 doesn't support vhost. Well, that was really close. qemu-kvm-0.12 has the following code in that place: vdev-features = features; the patch mentioned has this: +if (vdev-set_features) +vdev-set_features(vdev, features); so basically kvm had the same code here, just not wrapped into a function ;) And sure thing you're right, it does not help any. :) Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migration fails with virtio-blk
On 05/10/2010 01:34 PM, Michael Tokarev wrote: 10.05.2010 13:49, Michael Tokarev пишет: Apparently there's an issue with migration when virtio-blk is in use at the time migration occurs. See Debian bug #580649 for details: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580649 In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails. 2.6.32 guest works, 0.11 works, non-virtio-blk works. It looks like a problem reported on qemu-devel list should fix the issue here: http://marc.info/?l=qemu-develm=127350821419989 Can't be, that's a vhost patch and 0.12.3 doesn't support vhost. Regards, Anthony Liguori Testing... It wont be fast because I want to hear from the OP. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] pci: cleanly backout of pci_qdev_init()
If the init function of a device fails, as might happen with device assignment, we never undo the work done by do_pci_register_device(). This not only causes a bit of a memory leak, but also leaves a bogus pointer in the bus devices array that can cause a segfault or garbage data from 'info pci'. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pci.c | 17 - 1 files changed, 12 insertions(+), 5 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index f167436..3d3560e 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -625,6 +625,14 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, return pci_dev; } +static void do_pci_unregister_device(PCIDevice *pci_dev) +{ +qemu_free_irqs(pci_dev-irq); +pci_dev-bus-devices[pci_dev-devfn] = NULL; +pci_config_free(pci_dev); +return; +} + PCIDevice *pci_register_device(PCIBus *bus, const char *name, int instance_size, int devfn, PCIConfigReadFunc *config_read, @@ -680,10 +688,7 @@ static int pci_unregister_device(DeviceState *dev) return ret; pci_unregister_io_regions(pci_dev); - -qemu_free_irqs(pci_dev-irq); -pci_dev-bus-devices[pci_dev-devfn] = NULL; -pci_config_free(pci_dev); +do_pci_unregister_device(pci_dev); return 0; } @@ -1652,8 +1657,10 @@ static int pci_qdev_init(DeviceState *qdev, DeviceInfo *base) if (pci_dev == NULL) return -1; rc = info-init(pci_dev); -if (rc != 0) +if (rc != 0) { +do_pci_unregister_device(pci_dev); return rc; +} /* rom loading */ if (pci_dev-romfile == NULL info-romfile != NULL) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Keep index within boundaries in kvmppc_44x_emul_tlbwe()
On Sun, May 9, 2010 at 8:26 AM, Roel Kluin roel.kl...@gmail.com wrote: An index of KVM44x_GUEST_TLB_SIZE is already one too large. Signed-off-by: Roel Kluin roel.kl...@gmail.com Acked-by: Hollis Blanchard hol...@penguinppc.org Thanks Roel. -Hollis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Keep index within boundaries in kvmppc_44x_emul_tlbwe()
On 11.05.2010, at 00:58, Hollis Blanchard wrote: On Sun, May 9, 2010 at 8:26 AM, Roel Kluin roel.kl...@gmail.com wrote: An index of KVM44x_GUEST_TLB_SIZE is already one too large. Signed-off-by: Roel Kluin roel.kl...@gmail.com Acked-by: Hollis Blanchard hol...@penguinppc.org Acked-by: Alexander Graf ag...@suse.de Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[69/98] KVM: remove unused load_segment_descriptor_to_kvm_desct
2.6.32-stable review patch. If anyone has any objections, please let us know. -- From: Marcelo Tosatti mtosa...@redhat.com Commit 78ce64a384 in v2.6.32.12 introduced a warning due to unused load_segment_descriptor_to_kvm_desct helper, which has been opencoded by this commit. On upstream, the helper was removed as part of a different commit. Remove the now unused function. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Greg Kroah-Hartman gre...@suse.de --- arch/x86/kvm/x86.c | 12 1 file changed, 12 deletions(-) --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4155,18 +4155,6 @@ static u16 get_segment_selector(struct k return kvm_seg.selector; } -static int load_segment_descriptor_to_kvm_desct(struct kvm_vcpu *vcpu, - u16 selector, - struct kvm_segment *kvm_seg) -{ - struct desc_struct seg_desc; - - if (load_guest_segment_descriptor(vcpu, selector, seg_desc)) - return 1; - seg_desct_to_kvm_desct(seg_desc, selector, kvm_seg); - return 0; -} - static int kvm_load_realmode_segment(struct kvm_vcpu *vcpu, u16 selector, int seg) { struct kvm_segment segvar = { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
On Mon, May 10, 2010 at 5:59 AM, Avi Kivity a...@redhat.com wrote: On 04/21/2010 08:53 PM, Cam Macdonell wrote: Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. -device ivshmem,size=size in format accepted by -m[,shm=shm name] Interrupts are supported between multiple VMs by using a shared memory server by using a chardev socket. -device ivshmem,size=size in format accepted by -m[,shm=shm name] [,chardev=id][,msi=on][,irqfd=on][,vectors=n] -chardev socket,path=path,id=id (shared memory server is qemu.git/contrib/ivshmem-server) Sample programs and init scripts are in a git repo here: +typedef struct EventfdEntry { + PCIDevice *pdev; + int vector; +} EventfdEntry; + +typedef struct IVShmemState { + PCIDevice dev; + uint32_t intrmask; + uint32_t intrstatus; + uint32_t doorbell; + + CharDriverState * chr; + CharDriverState ** eventfd_chr; + int ivshmem_mmio_io_addr; + + pcibus_t mmio_addr; + unsigned long ivshmem_offset; + uint64_t ivshmem_size; /* size of shared memory region */ + int shm_fd; /* shared memory file descriptor */ + + int nr_allocated_vms; + /* array of eventfds for each guest */ + int ** eventfds; + /* keep track of # of eventfds for each guest*/ + int * eventfds_posn_count; More readable: typedef struct Peer { int nb_eventfds; int *eventfds; } Peer; int nb_peers; Peer *peers; Does eventfd_chr need to be there as well? No it does not, eventfd_chr store character devices for receiving interrupts when irqfd is not available, so we only them for this guest, not for our peers. I've switched over to this more readable naming you've suggested. + + int nr_alloc_guests; + int vm_id; + int num_eventfds; + uint32_t vectors; + uint32_t features; + EventfdEntry *eventfd_table; + + char * shmobj; + char * sizearg; Does this need to be part of the state? They are because they're passed in as qdev properties from the command-line so I thought they needed to be in the state struct to be assigned via DEFINE_PROP_... +} IVShmemState; + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { + IntrMask = 0, + IntrStatus = 4, + IVPosition = 8, + Doorbell = 12, +}; + +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) { + return (ivs-features (1 feature)); +} + +static inline int is_power_of_two(int x) { + return (x (x-1)) == 0; +} argument needs to be uint64_t to avoid overflow with large BARs. Return type can be bool. +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val) +{ + IVShmemState *s = opaque; + + u_int64_t write_one = 1; + u_int16_t dest = val 16; + u_int16_t vector = val 0xff; + + addr= 0xfe; Why 0xfe? Can understand 0xfc or 0xff. Forgot to change to 0xfc when registers went from 16 to 32-bits. + + switch (addr) + { + case IntrMask: + ivshmem_IntrMask_write(s, val); + break; + + case IntrStatus: + ivshmem_IntrStatus_write(s, val); + break; + + case Doorbell: + /* check doorbell range */ + if ((vector= 0) (vector s-eventfds_posn_count[dest])) { What if dest is too big? We overflow s-eventfds_posn_count. added a check for that. Thanks, Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SeaBIOS] About cpu_set, CPU hotplug and related subjects
On 28.04.2010 12:45, Gleb Natapov wrote: On Wed, Apr 28, 2010 at 12:41:51PM +0200, Jes Sorensen wrote: The CPU declarations are particularly tricky as they get pretty big and complex and need to live in the DSDT, whereas a lot of other things we can shift off to separate SSDT tables and only put the minimum that needs to be generated dynamically in it's own table. We can generate complex code statically and call it from dynamically generated CPU declarations. There is some ACPI code generator in coreboot. Not sure if it is usable for those purposes. coreboot uses it to generate AMD CPU frequency tables. Regards, Carl-Daniel -- http://www.hailfinger.org/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3] Support for booting from virtio disks
On Mon, May 10, 2010 at 11:36:37AM +0300, Gleb Natapov wrote: This patch adds native support for booting from virtio disks to Seabios. Signed-off-by: Gleb Natapov g...@redhat.com Thanks - commit 89acfa3f. The patch had some compile errors on gcc3.4 and gcc4.5 - I went ahead and committed an update to fix the errors (commit 7d09d0e3). -Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace
(2010/05/11 2:33), Gleb Natapov wrote: On Mon, May 10, 2010 at 07:06:05PM +0300, Mohammed Gamal wrote: On Mon, May 10, 2010 at 1:25 PM, Gleb Natapovg...@redhat.com wrote: On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote: Do not kill VM when instruction emulation fails. Inject #UD and report failure to userspace instead. Userspace may choose to reenter guest if vcpu is in userspace (cpl == 3) in which case guest OS will kill offending process and continue running. I am curious to know what'd happen in case the vcpu is in kernel space (cpl == 0). Is that case handled? Currently no matter where emulation fails VM is stopped and cpu state is printed on stderr. After that patch userspace may choose to continue VM execution after emulation error (#UD will be injected into VM though). The policy is in userspace, but I don't see the point to continue execution after emulation failed in kernel. How kernel can recover from the #UD? I don't see what is the recommended(possible) way of trouble shooting yet. If the user is managing both the guest and the host, it's simple: no worth reentering the guest. The user will just see the stderr. But what about if the user can only see the guest? In the case of non-virt, usually the user sees oops log or something and calls to a support staff. Compared to that, if VM is silently stopped, what information can the user see? In such a case, catching up the trouble and calling to a support staff is host side management staff's job? If you give us preferred way of trouble shooting of KVM, from the developer's point of view, it will really help us to prepare for the future use of KVM. What we can do will depend on your development! And this is one of the reasons why I'm interested in the x86's emulation development. :-) Thanks, Takuya -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/4 v3] KVM: VMX: Support hosted VMM coexsitence.
Hi Avi, Do you have other comments on this version of patch? Thanks, Dongxiao Xu, Dongxiao wrote: Hi all, This is hosted VMM coexistence support v3. Main changes from v2: 1) Change vmm_coexistence to vmm_exclusive. 2) Some code structure changes. Split the original 3 patches to 4. 3) Address some comments from Avi. Main changes from v1: 1) Add an module option vmm_coexistence to decide whether to enable this feature. Currently it is off defaultly. 2) Each time when a KVM vcpu is scheduled in, it will invalidate EPT and VPID TLBs to avoid conflict between different VMMs. VMX: Support for coexistence of KVM and other hosted VMMs. The following NOTE is picked up from Intel SDM 3B 27.3 chapter, MANAGING VMCS REGIONS AND POINTERS. -- NOTE As noted in Section 21.1, the processor may optimize VMX operation by maintaining the state of an active VMCS (one for which VMPTRLD has been executed) on the processor. Before relinquishing control to other system software that may, without informing the VMM, remove power from the processor (e.g., for transitions to S3 or S4) or leave VMX operation, a VMM must VMCLEAR all active VMCSs. This ensures that all VMCS data cached by the processor are flushed to memory and that no other software can corrupt the current VMM's VMCS data. It is also recommended that the VMM execute VMXOFF after such executions of VMCLEAR. -- Currently, VMCLEAR is called at VCPU migration. To support hosted VMM coexistence, this patch modifies the VMCLEAR/VMPTRLD and VMXON/VMXOFF usages. VMCLEAR will be called when VCPU is scheduled out of a physical CPU, while VMPTRLD is called when VCPU is scheduled in a physical CPU. Also this approach could eliminates the IPI mechanism for original VMCLEAR. As suggested by SDM, VMXOFF will be called after VMCLEAR, and VMXON will be called before VMPTRLD. With this patchset, KVM and VMware Workstation 7 could launch serapate guests and they can work well with each other. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 11/12] KVM: introduce new API for getting/switching dirty bitmaps
On Tue, May 04, 2010 at 10:08:21PM +0900, Takuya Yoshikawa wrote: Now that dirty bitmaps are accessible from user space, we export the addresses of these to achieve zero-copy dirty log check: KVM_GET_USER_DIRTY_LOG_ADDR We also need an API for triggering dirty bitmap switch to take the full advantage of double buffered bitmaps. KVM_SWITCH_DIRTY_LOG See the documentation in this patch for precise explanations. About performance improvement: the most important feature of switch API is the lightness. In our test, this appeared in the form of improved responses for GUI manipulations. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp CC: Avi Kivity a...@redhat.com CC: Alexander Graf ag...@suse.de --- Documentation/kvm/api.txt | 87 + arch/ia64/kvm/kvm-ia64.c | 27 +- arch/powerpc/kvm/book3s.c | 18 +++-- arch/x86/kvm/x86.c| 44 --- include/linux/kvm.h | 11 ++ include/linux/kvm_host.h |4 ++- virt/kvm/kvm_main.c | 63 + 7 files changed, 220 insertions(+), 34 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index a237518..c106c83 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -892,6 +892,93 @@ arguments. This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace. +4.39 KVM_GET_USER_DIRTY_LOG_ADDR + +Capability: KVM_CAP_USER_DIRTY_LOG (=1 see below) +Architectures: all +Type: vm ioctl +Parameters: struct kvm_user_dirty_log (in/out) +Returns: 0 on success, -1 on error + +This ioctl makes it possible to use KVM_SWITCH_DIRTY_LOG (see 4.40) instead +of the old dirty log manipulation by KVM_GET_DIRTY_LOG. + +A bit about KVM_CAP_USER_DIRTY_LOG + +Before this ioctl was introduced, dirty bitmaps for dirty page logging were +allocated in the kernel's memory space. But we have now moved them to user +space to get more flexiblity and performance. To achive this move without +breaking the compatibility, we will split KVM_CAP_USER_DIRTY_LOG capability +into a few generations which can be identified by its check extension +return values. + +This KVM_GET_USER_DIRTY_LOG_ADDR belongs to the first generation with the +KVM_SWITCH_DIRTY_LOG (4.40) and must be supported by all generations. + +What you get + +By using this, you can get paired bitmap addresses which are accessible from +user space. See the explanation in 4.40 for the roles of these two bitmaps. + +How to Get + +Before calling this, you have to set the slot member of kvm_user_dirty_log +to indicate the target memory slot. + +struct kvm_user_dirty_log { + __u32 slot; + __u32 flags; + __u64 dirty_bitmap; + __u64 dirty_bitmap_old; +}; + +The addresses will be set in the paired members: dirty_bitmap and _old. Why not pass the bitmap address to the kernel, instead of having the kernel allocate it. Because apparently you plan to do that in a new generation anyway? Also, why does the kernel need to know about different bitmaps? One alternative would be: KVM_SWITCH_DIRTY_LOG passing the address of a bitmap. If the active bitmap was clean, it returns 0, no switch performed. If the active bitmap was dirty, the kernel switches to the new bitmap and returns 1. And the responsability of cleaning the new bitmap could also be left for userspace. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH RFC 10/12] KVM: move dirty bitmaps to user space
On Tue, May 04, 2010 at 10:07:02PM +0900, Takuya Yoshikawa wrote: We move dirty bitmaps to user space. - Allocation and destruction: we use do_mmap() and do_munmap(). The new bitmap space is twice longer than the original one and we use the additional space for double buffering: this makes it possible to update the active bitmap while letting the user space read the other one safely. For x86, we can also remove the vmalloc() in kvm_vm_ioctl_get_dirty_log(). - Bitmap manipulations: we replace all functions which access dirty bitmaps with *_user() functions. - For ia64: moving the dirty bitmaps of memory slots does not affect ia64 much because it's using a different place to store dirty logs rather than the dirty bitmaps of memory slots: all we have to change are sync and get of dirty log, so we don't need set_bit_user like functions for ia64. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp CC: Avi Kivity a...@redhat.com CC: Alexander Graf ag...@suse.de --- arch/ia64/kvm/kvm-ia64.c | 15 +- arch/powerpc/kvm/book3s.c |5 +++- arch/x86/kvm/x86.c| 25 -- include/linux/kvm_host.h |3 +- virt/kvm/kvm_main.c | 62 +--- 5 files changed, 82 insertions(+), 28 deletions(-) diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 17fd65c..03503e6 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1823,11 +1823,19 @@ static int kvm_ia64_sync_dirty_log(struct kvm *kvm, n = kvm_dirty_bitmap_bytes(memslot); base = memslot-base_gfn / BITS_PER_LONG; + r = -EFAULT; + if (!access_ok(VERIFY_WRITE, memslot-dirty_bitmap, n)) + goto out; + for (i = 0; i n/sizeof(long); ++i) { if (dirty_bitmap[base + i]) memslot-is_dirty = true; - memslot-dirty_bitmap[i] = dirty_bitmap[base + i]; + if (__put_user(dirty_bitmap[base + i], +memslot-dirty_bitmap[i])) { + r = -EFAULT; + goto out; + } dirty_bitmap[base + i] = 0; } r = 0; @@ -1858,7 +1866,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, if (memslot-is_dirty) { kvm_flush_remote_tlbs(kvm); n = kvm_dirty_bitmap_bytes(memslot); - memset(memslot-dirty_bitmap, 0, n); + if (clear_user(memslot-dirty_bitmap, n)) { + r = -EFAULT; + goto out; + } memslot-is_dirty = false; } r = 0; diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 4b074f1..2a31d2f 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -1210,7 +1210,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, kvmppc_mmu_pte_pflush(vcpu, ga, ga_end); n = kvm_dirty_bitmap_bytes(memslot); - memset(memslot-dirty_bitmap, 0, n); + if (clear_user(memslot-dirty_bitmap, n)) { + r = -EFAULT; + goto out; + } memslot-is_dirty = false; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 023c7f8..32a3d94 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2760,40 +2760,37 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, /* If nothing is dirty, don't bother messing with page tables. */ if (memslot-is_dirty) { struct kvm_memslots *slots, *old_slots; - unsigned long *dirty_bitmap; + unsigned long __user *dirty_bitmap; + unsigned long __user *dirty_bitmap_old; spin_lock(kvm-mmu_lock); kvm_mmu_slot_remove_write_access(kvm, log-slot); spin_unlock(kvm-mmu_lock); - r = -ENOMEM; - dirty_bitmap = vmalloc(n); - if (!dirty_bitmap) + dirty_bitmap = memslot-dirty_bitmap; + dirty_bitmap_old = memslot-dirty_bitmap_old; + r = -EFAULT; + if (clear_user(dirty_bitmap_old, n)) goto out; - memset(dirty_bitmap, 0, n); r = -ENOMEM; slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); - if (!slots) { - vfree(dirty_bitmap); + if (!slots) goto out; - } + memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); - slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; + slots-memslots[log-slot].dirty_bitmap = dirty_bitmap_old; + slots-memslots[log-slot].dirty_bitmap_old = dirty_bitmap;
[PATCH 2/4] KVM: Clean up duplicate assignment
mmu.free() already set root_hpa to INVALID_PAGE, no need to do it again in the destory_kvm_mmu(). kvm_x86_ops-set_cr4() and set_efer() already assign cr4/efer to vcpu-arch.cr4/efer, no need to do it again later. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/mmu.c |5 ++--- arch/x86/kvm/x86.c |4 +--- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95bee9d..5412185 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2450,10 +2450,9 @@ static int init_kvm_mmu(struct kvm_vcpu *vcpu) static void destroy_kvm_mmu(struct kvm_vcpu *vcpu) { ASSERT(vcpu); - if (VALID_PAGE(vcpu-arch.mmu.root_hpa)) { + if (VALID_PAGE(vcpu-arch.mmu.root_hpa)) + /* mmu.free() should set root_hpa = INVALID_PAGE */ vcpu-arch.mmu.free(vcpu); - vcpu-arch.mmu.root_hpa = INVALID_PAGE; - } } int kvm_mmu_reset_context(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2cae460..764f89b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -486,7 +486,7 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) return 1; kvm_x86_ops-set_cr4(vcpu, cr4); - vcpu-arch.cr4 = cr4; + kvm_mmu_reset_context(vcpu); return 0; @@ -720,8 +720,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer) kvm_x86_ops-set_efer(vcpu, efer); - vcpu-arch.efer = efer; - vcpu-arch.mmu.base_role.nxe = (efer EFER_NX) !tdp_enabled; kvm_mmu_reset_context(vcpu); -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] VMX: x86: Only reset MMU when necessary
Only modifying some bits of CR0/CR4 needs paging mode switch. Add update_rsvd_bits_mask() to address EFER.NX bit updating for reserved bits. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 17 ++--- arch/x86/kvm/x86.c | 14 -- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ed48904..c8c8a03 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -553,6 +553,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); +void update_rsvd_bits_mask(struct kvm_vcpu *vcpu); int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5412185..98abdcf 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2335,6 +2335,19 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) } } +void update_rsvd_bits_mask(struct kvm_vcpu *vcpu) +{ + if (!is_paging(vcpu)) + return; + if (is_long_mode(vcpu)) + reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL); + else if (is_pae(vcpu)) + reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL); + else + reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL); +} +EXPORT_SYMBOL_GPL(update_rsvd_bits_mask); + static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) { struct kvm_mmu *context = vcpu-arch.mmu; @@ -2400,18 +2413,16 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) context-gva_to_gpa = nonpaging_gva_to_gpa; context-root_level = 0; } else if (is_long_mode(vcpu)) { - reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL); context-gva_to_gpa = paging64_gva_to_gpa; context-root_level = PT64_ROOT_LEVEL; } else if (is_pae(vcpu)) { - reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL); context-gva_to_gpa = paging64_gva_to_gpa; context-root_level = PT32E_ROOT_LEVEL; } else { - reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL); context-gva_to_gpa = paging32_gva_to_gpa; context-root_level = PT32_ROOT_LEVEL; } + update_rsvd_bits_mask(vcpu); return 0; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b59fc67..1c76e08 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -416,6 +416,9 @@ out: static int __kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) { + unsigned long old_cr0 = kvm_read_cr0(vcpu); + unsigned long update_bits = X86_CR0_PG | X86_CR0_PE; + cr0 |= X86_CR0_ET; #ifdef CONFIG_X86_64 @@ -449,7 +452,8 @@ static int __kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) kvm_x86_ops-set_cr0(vcpu, cr0); - kvm_mmu_reset_context(vcpu); + if ((cr0 ^ old_cr0) update_bits) + kvm_mmu_reset_context(vcpu); return 0; } @@ -487,7 +491,8 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) kvm_x86_ops-set_cr4(vcpu, cr4); - kvm_mmu_reset_context(vcpu); + if ((cr4 ^ old_cr4) pdptr_bits) + kvm_mmu_reset_context(vcpu); return 0; } @@ -692,6 +697,8 @@ static u32 emulated_msrs[] = { static int set_efer(struct kvm_vcpu *vcpu, u64 efer) { + u64 old_efer = vcpu-arch.efer; + if (efer efer_reserved_bits) return 1; @@ -722,6 +729,9 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer) vcpu-arch.mmu.base_role.nxe = (efer EFER_NX) !tdp_enabled; + if ((efer ^ old_efer) EFER_NX) + update_rsvd_bits_mask(vcpu); + return 0; } -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] KVM: x86: Remove kvm_mmu_reset_context() in kvm_set_efer()
Modify EFER won't result in mode switch directly. After EFER.LME set, the following set CR0.PG would result in mode switch to IA32e. And the later action already covered by kvm_set_cr0(). Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/x86.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 764f89b..b59fc67 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -721,7 +721,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer) kvm_x86_ops-set_efer(vcpu, efer); vcpu-arch.mmu.base_role.nxe = (efer EFER_NX) !tdp_enabled; - kvm_mmu_reset_context(vcpu); return 0; } -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] KVM: x86: Check LMA bit before set_efer
kvm_x86_ops-set_efer() would execute vcpu-arch.efer = efer, so the checking of LMA bit didn't work. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/x86.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd8a606..2cae460 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -715,11 +715,11 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer) return 1; } - kvm_x86_ops-set_efer(vcpu, efer); - efer = ~EFER_LMA; efer |= vcpu-arch.efer EFER_LMA; + kvm_x86_ops-set_efer(vcpu, efer); + vcpu-arch.efer = efer; vcpu-arch.mmu.base_role.nxe = (efer EFER_NX) !tdp_enabled; -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 11/12] KVM: introduce new API for getting/switching dirty bitmaps
(2010/05/11 12:43), Marcelo Tosatti wrote: On Tue, May 04, 2010 at 10:08:21PM +0900, Takuya Yoshikawa wrote: +How to Get + +Before calling this, you have to set the slot member of kvm_user_dirty_log +to indicate the target memory slot. + +struct kvm_user_dirty_log { + __u32 slot; + __u32 flags; + __u64 dirty_bitmap; + __u64 dirty_bitmap_old; +}; + +The addresses will be set in the paired members: dirty_bitmap and _old. Why not pass the bitmap address to the kernel, instead of having the kernel allocate it. Because apparently you plan to do that in a new generation anyway? Yes, we want to make qemu allocate and free bitmaps in the future. But currently, those are strictly tied with memory slot registration and changing it in one patch set is really difficult. Anyway, we need kernel side allocation mechanism to keep the current GET_DIRTY_LOG api. I don't mind not exporting kernel allocated bitmaps in this patch set and later introducing a bitmap registration mechanism in another patch set. As this RFC is suggesting, kernel side double buffering infrastructure is designed as general as possible and adding a new API like SWITCH can be done naturally. Also, why does the kernel need to know about different bitmaps? Because we need to support current GET API. We don't have any way to get a new bitmap in the case of GET and we don't want to do_mmap() every time for GET. One alternative would be: KVM_SWITCH_DIRTY_LOG passing the address of a bitmap. If the active bitmap was clean, it returns 0, no switch performed. If the active bitmap was dirty, the kernel switches to the new bitmap and returns 1. And the responsability of cleaning the new bitmap could also be left for userspace. That is a beautiful approach but we can do that only when we give up using GET api. I follow you and Avi's advice about that kind of maintenance policy! What do you think? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
Yes, I'm just using in kernel space: qemu has its own endian related helpers. So if you allow us to place this macro in asm-generic/bitops/* it will help us. No problem at all then. Thanks for the explanation. Acked-by: Arnd Bergmanna...@arndb.de Thanks you both. I will add your Acked-by from now on! Takuya -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote: [Performance test] We measured the tsc needed to the ioctl()s for getting dirty logs in kernel. Test environment AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory 1. GUI test (running Ubuntu guest in graphical mode) sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ... We show a relatively stable part to compare how much time is needed for the basic parts of dirty log ioctl. get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). 2. Live-migration test (4GB guest, write loop with 1GB buf) We also did a live-migration test. get.org get.opt switch.opt slots[0].len=655360 797383261144222181 slots[1].len=37570478082186721 1965244 1842824 slots[2].len=637534208 1433562 1012723 1031213 slots[3].len=131072 216858 331 331 slots[4].len=131072 121635 225 164 slots[5].len=131072 120863 356 164 slots[6].len=16777216 121746 1133 156 slots[7].len=32768 120415 230 278 slots[8].len=32768 120368 216 149 slots[0].len=655360 806497194710223582 slots[1].len=37570478082142922 1878025 1895369 slots[2].len=637534208 1386512 1021309 1000345 slots[3].len=131072 221118 459 296 slots[4].len=131072 121516 272 166 slots[5].len=131072 122652 244 173 slots[6].len=16777216 123226 99185 149 slots[7].len=32768 121803 457 505 slots[8].len=32768 121586 216 155 slots[0].len=655360 766113211317213179 slots[1].len=37570478082155662 1974790 1842361 slots[2].len=637534208 1481411 1020004 1031352 slots[3].len=131072 223100 351 295 slots[4].len=131072 122982 436 164 slots[5].len=131072 122100 300 503 slots[6].len=16777216 123653 779 151 slots[7].len=32768 122617 284 157 slots[8].len=32768 122737 253 149 For slots other than 0,1,2 we can see the similar improvement. Considering the fact that switch.opt does not depend on the bitmap length except for kvm_mmu_slot_remove_write_access(), this is the cause of some usec to msec time consumption: there might be some context switches. But note that this was done with the workload which dirtied the memory endlessly during the live-migration. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space
get.org get.opt switch.opt slots[7].len=32768 278379 66398 64024 slots[8].len=32768 181246 270 160 slots[7].len=32768 263961 64673 64494 slots[8].len=32768 181655 265 160 slots[7].len=32768 263736 64701 64610 slots[8].len=32768 182785 267 160 slots[7].len=32768 260925 65360 65042 slots[8].len=32768 182579 264 160 slots[7].len=32768 267823 65915 65682 slots[8].len=32768 186350 271 160 At a glance, we know our optimization improved significantly compared to the original get dirty log ioctl. This is true for both get.opt and switch.opt. This has a really big impact for the personal KVM users who drive KVM in GUI mode on their usual PCs. Next, we notice that switch.opt improved a hundred nano seconds or so for these slots. Although this may sound a bit tiny improvement, we can feel this as a difference of GUI's responses like mouse reactions. 100 ns... this is a bit on the low side (and if you can measure it interactively you have much better reflexes than I). To feel the difference, please try GUI on your PC with our patch series! No doubt get.org - get.opt is measurable, but get.opt-switch.opt is problematic. Have you tried profiling to see where the time is spent (well I can guess, clearing the write access from the sptes). Sorry but no, and I agree with your guess. Anyway, I want to do some profiling to confirm this guess. BTW, If we only think about performance improvement of time, optimized get(get.opt) may be enough at this stage. But if we consider the future expansion like using user allocated bitmaps, new API's introduced for switch.opt won't become waste, I think, because we need a structure to get and export bitmap addresses. In usual workload, the number of dirty pages varies a lot for each iteration and we should gain really a lot for relatively clean cases. Can you post such a test, for an idle large guest? OK, I'll do! -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html