[COMMIT master] Merge branch 'upstream-merge'
From: Avi Kivity a...@redhat.com * upstream-merge: (54 commits) Make the e1000 the default network adapter for the pc target. eliminate errors about unused results in block/vpc.c virtio-blk: add msi support. qdev/prop: convert isa-bus to helper macros. make pthreads mandatory qemu: move virtio-pci.o to near pci.o char: Emit 'CLOSED' events on char device close cleanup cpu-exec.c, part 0/N: consolidate handle_cpu_signal Unbreak large mem support by removing kqemu unify popen/fopen qemu wrappers Only build osdep once Route IOAPIC interrupts via ISA bus Add a configure switch to enable / disable all user targets. I felt compelled to do it for symmetry, mostly it is useful to disable user targets when you don't want to build them. Migration via unix sockets. Route PC irqs to ISA bus instead of i8259 directly Makefile: fixed rule TAGS QEMU set irq0override in fw_cfg SMART ATA Functionality Add missing linefeed in error message Clean up VGA type selection; far too many variables being used to track one state leads to confusion if new variables are added. ... Signed-off-by: Avi Kivity a...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IBM server-X3650 issue with kvm modules
Hi Everyone, Presently I m porting kvm-76 modules on IBM server x3650 target. System Information: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping: 6 cpu MHz : 2327.501 cache size : 6144 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips: 4658.35 clflush size: 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: Thanks for patient.. when i run qemu utilities to install redhat or fedora guest on montavista host running in ibm server, i found that there is vmwrite error occurs on following registers GUEST_RFLAGS = 0x6820 TPR_THRESHOLD = 0x401c CPU_BASED_VM_EXEC_CONTROL= 0x4002 GUEST_RSP = 0x681c GUEST_RIP = 0x681e HOST_CR0= 0x6c00 And then kvm-76 aborted.. Any help will be appreciated.. Thanks in advance.. __ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IBM server-X3650 issue with kvm modules
On 08/25/2009 09:00 AM, Haneef Syed wrote: when i run qemu utilities to install redhat or fedora guest on montavista host running in ibm server, i found that there is vmwrite error occurs on following You've already posted this before. It's impossible for us to debug kvm interactions with the montavista kernel. Please take it up with montavista's support. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Extending virtio_console to support multiple ports
Hello all, Here is a new iteration of the patch series that implements a transport for guest and host communications. The code has been updated to reuse the virtio-console device instead of creating a new virtio-serial device. I've tested for compatibility (old qemu new kernel, new qemu old kernel, new qemu new kernel) and it all works fine. There are a few items on my todo list but this works well. New since last send: - connect/disconnect notifications on connections to ports TODO: - Look at converting to a tty interface instead of the current char interface - Migration: save the extra state that's necessary - Convert all config writes to little endian in qemu / convert from little endian to host endian in guest - Address a few FIXMEs spread in the code - Introduce a watermark to stop a rogue host process flooding guest with data - Make connect/disconnect work for guest Please review. Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_console: Add interface for guest and host communication
Expose multiple char devices (ports) for simple communication between the host userspace and guest. Sample offline usages can be: poking around in a guest to find out the file systems used, applications installed, etc. Online usages can be sharing of clipboard data between the guest and the host, sending information about logged-in users to the host, locking the screen or session when a vnc session is closed, and so on. The interface presented to guest userspace is of a simple char device, so it can be used like this: fd = open(/dev/vcon2, O_RDWR); ret = read(fd, buf, 100); ret = write(fd, string, strlen(string)); Each port is to be assigned a unique function, for example, the first 4 ports may be reserved for libvirt usage, the next 4 for generic streaming data and so on. This port-function mapping isn't finalised yet. For requirements, use-cases and some history see http://www.linux-kvm.org/page/VMchannel_Requirements Signed-off-by: Amit Shah amit.s...@redhat.com --- drivers/char/Kconfig |4 +- drivers/char/virtio_console.c | 871 +++- include/linux/virtio_console.h | 27 ++ 3 files changed, 791 insertions(+), 111 deletions(-) diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 6a06913..fe76627 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -678,7 +678,9 @@ config VIRTIO_CONSOLE select HVC_DRIVER help Virtio console for use with lguest and other hypervisors. - + Also serves as a general-purpose serial device for data transfer + between the guest and host. Character devices at /dev/vconNN will + be created when corresponding ports are found. config HVCS tristate IBM Hypervisor Virtual Console Server support diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index c74dacf..3548bf6 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -13,6 +13,7 @@ * Host can send more. Buffering in the Host could alleviate this, but it is a * difficult problem in general. :*/ /* Copyright (C) 2006, 2007 Rusty Russell, IBM Corporation + * Copyright (C) 2009, Amit Shah, Red Hat, Inc. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -28,100 +29,374 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +#include linux/cdev.h +#include linux/device.h #include linux/err.h +#include linux/fs.h #include linux/init.h +#include linux/poll.h #include linux/virtio.h #include linux/virtio_console.h +#include linux/workqueue.h #include hvc_console.h -/*D:340 These represent our input and output console queues, and the virtio +/*D:340 FIXME These represent our input and output console queues, and the virtio * operations for them. */ -static struct virtqueue *in_vq, *out_vq; -static struct virtio_device *vdev; +struct virtio_console_struct { + struct work_struct rx_work; + struct work_struct tx_work; + struct work_struct config_work; + + struct list_head port_head; + struct list_head unused_read_head; + + struct virtio_device *vdev; + struct class *class; + struct virtqueue *in_vq, *out_vq; + + struct virtio_console_config *config; +}; + +/* This struct holds individual buffers received for each port */ +struct virtio_console_port_buffer { + struct list_head next; -/* This is our input buffer, and how much data is left in it. */ -static unsigned int in_len; -static char *in, *inbuf; + char *buf; + + size_t len; /* length of the buffer */ + size_t offset; /* offset in the buf from which to consume data */ +}; + +struct virtio_console_port { + /* Next port in the list */ + struct list_head next; + + /* Buffer management */ + struct virtio_console_port_buffer read_buf; + struct list_head readbuf_head; + wait_queue_head_t waitqueue; + + /* Each port associates with a separate char device */ + struct cdev cdev; + struct device *dev; + + bool host_connected; /* Is the host device open */ +}; + +static struct virtio_console_struct virtconsole; + +static int major = 60; /* from the experimental range */ /* The operations for our console. */ +/* FIXME: this should go into the per-port struct */ static struct hv_ops virtio_cons; /* The hvc device */ static struct hvc_struct *hvc; -/*D:310 The put_chars() callback is pretty straightforward. +static struct virtio_console_port *get_port_from_id(u32 id) +{ + struct virtio_console_port *port; + + list_for_each_entry(port, virtconsole.port_head, next) { + if (MINOR(port-dev-devt) == id) + return port; + } + return NULL; +} + +static int get_id_from_port(struct
[PATCH 2/3] virtio-console: rename dvq to ovq
It isn't obvious what 'dvq' stands for. Since it's the output queue and the corresponding input queue is called 'ivq', call this 'ovq' Signed-off-by: Amit Shah amit.s...@redhat.com --- hw/virtio-console.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/virtio-console.c b/hw/virtio-console.c index 663c8b9..92c953c 100644 --- a/hw/virtio-console.c +++ b/hw/virtio-console.c @@ -20,7 +20,7 @@ typedef struct VirtIOConsole { VirtIODevice vdev; -VirtQueue *ivq, *dvq; +VirtQueue *ivq, *ovq; CharDriverState *chr; } VirtIOConsole; @@ -135,7 +135,7 @@ VirtIODevice *virtio_console_init(DeviceState *dev) s-vdev.get_features = virtio_console_get_features; s-ivq = virtio_add_queue(s-vdev, 128, virtio_console_handle_input); -s-dvq = virtio_add_queue(s-vdev, 128, virtio_console_handle_output); +s-ovq = virtio_add_queue(s-vdev, 128, virtio_console_handle_output); s-chr = qdev_init_chardev(dev); qemu_chr_add_handlers(s-chr, vcon_can_read, vcon_read, vcon_event, s); -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] char: Emit 'CLOSED' events on char device close
Notify users of the char interface whenever the file / connection is closed. Signed-off-by: Amit Shah amit.s...@redhat.com --- qemu-char.c | 10 ++ qemu-char.h |1 + 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/qemu-char.c b/qemu-char.c index be27994..c25ed1c 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -570,6 +570,7 @@ static void fd_chr_read(void *opaque) if (size == 0) { /* FD has been closed. Remove it from the active list. */ qemu_set_fd_handler2(s-fd_in, NULL, NULL, NULL, NULL); +qemu_chr_event(chr, CHR_EVENT_CLOSED); return; } if (size 0) { @@ -602,6 +603,7 @@ static void fd_chr_close(struct CharDriverState *chr) } qemu_free(s); +qemu_chr_event(chr, CHR_EVENT_CLOSED); } /* open a character device to a unix fd */ @@ -690,6 +692,7 @@ static void stdio_read(void *opaque) if (size == 0) { /* stdin has been closed. Remove it from the active list. */ qemu_set_fd_handler2(0, NULL, NULL, NULL, NULL); +qemu_chr_event(chr, CHR_EVENT_CLOSED); return; } if (size 0) { @@ -943,6 +946,7 @@ static void pty_chr_close(struct CharDriverState *chr) qemu_del_timer(s-timer); qemu_free_timer(s-timer); qemu_free(s); +qemu_chr_event(chr, CHR_EVENT_CLOSED); } static CharDriverState *qemu_chr_open_pty(void) @@ -1264,6 +1268,7 @@ static void pp_close(CharDriverState *chr) ioctl(fd, PPRELEASE); close(fd); qemu_free(drv); +qemu_chr_event(chr, CHR_EVENT_CLOSED); } static CharDriverState *qemu_chr_open_pp(const char *filename) @@ -1390,6 +1395,8 @@ static void win_chr_close(CharDriverState *chr) qemu_del_polling_cb(win_chr_pipe_poll, chr); else qemu_del_polling_cb(win_chr_poll, chr); + +qemu_chr_event(chr, CHR_EVENT_CLOSED); } static int win_chr_init(CharDriverState *chr, const char *filename) @@ -1779,6 +1786,7 @@ static void udp_chr_close(CharDriverState *chr) closesocket(s-fd); } qemu_free(s); +qemu_chr_event(chr, CHR_EVENT_CLOSED); } static CharDriverState *qemu_chr_open_udp(const char *def) @@ -1999,6 +2007,7 @@ static void tcp_chr_read(void *opaque) qemu_set_fd_handler(s-fd, NULL, NULL, NULL); closesocket(s-fd); s-fd = -1; +qemu_chr_event(chr, CHR_EVENT_CLOSED); } else if (size 0) { if (s-do_telnetopt) tcp_chr_process_IAC_bytes(chr, s, buf, size); @@ -2095,6 +2104,7 @@ static void tcp_chr_close(CharDriverState *chr) closesocket(s-listen_fd); } qemu_free(s); +qemu_chr_event(chr, CHR_EVENT_CLOSED); } static CharDriverState *qemu_chr_open_tcp(const char *host_str, diff --git a/qemu-char.h b/qemu-char.h index 77d4eda..df620bc 100644 --- a/qemu-char.h +++ b/qemu-char.h @@ -11,6 +11,7 @@ #define CHR_EVENT_RESET 2 /* new connection established */ #define CHR_EVENT_MUX_IN 3 /* mux-focus was set to this terminal */ #define CHR_EVENT_MUX_OUT 4 /* mux-focus will move on */ +#define CHR_EVENT_CLOSED 5 /* connection closed */ #define CHR_IOCTL_SERIAL_SET_PARAMS 1 -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
There are userspace libraries that do almost everything, but you hardly see things like pthread_(EFD_STATE-like)_create() or similar system interfaces based on such abstraction. It actually seems as close to a condition variable as an eventfd can be. A pthread condition typical code usage maps to eventfd like: while (read(efd, ...) 0) if (CONDITION) break; So a pthread condition is really a wakeup gate like eventfd is. EFD_STATE has nothing to do with a pthread condition. No, your code does not work for pthread_cond_broadcast (which arguably is the common case) because all the eventfd readers after the first would block. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Tue, Aug 25, 2009 at 07:14:29AM +0300, Avi Kivity wrote: On 08/25/2009 05:22 AM, Anthony Liguori wrote: I think 2.6.32 is pushing it. 2.6.32 is pushing it, but we need to push it. I think some time is needed to flush out the userspace interface. In particular, I don't think Mark's comments have been adequately addressed. If a version were merged without GSO support, some mechanism to do feature detection would be needed in the userspace API. I don't see any point in merging without gso (unless it beats userspace with gso, which I don't think will happen). In any case we'll need feature negotiation. I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. What about veth pairs? I'd like some time to look at get_state/set_state ioctl()s along with dirty tracking support. It's a much better model for live migration IMHO. My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Exactly, that's what I meant earlier. That's enough, isn't it, Anthony? I think so more thorough benchmarking would be good too. In particular, netperf/iperf runs would be nice. Definitely. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Mon, Aug 24, 2009 at 09:22:47PM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: On Mon, Aug 24, 2009 at 11:12:41AM +0300, Michael S. Tsirkin wrote: At Rusty's suggestion, I tested vhost base performance with ping. Results below, and seem to be what you'd expect. Rusty, any chance you could look at the code? Is it in reasonable shape? I think it makes sense to merge it through you. What do you think? One comment on file placement: I put files under a separate vhost directory to avoid confusion with virtio-net which runs in guest. Does this sound sane? Also, can a minimal version (without TSO, tap or any other features) be merged upstream first so that features can be added later? Or do we have to wait until it's more full featured? Finally, can it reasonably make 2.6.32, or you think it needs more time out of tree? I think 2.6.32 is pushing it. I think some time is needed to flush out the userspace interface. In particular, I don't think Mark's comments have been adequately addressed. Went over, and I thought they have. Mark, could you please comment? Are you ok with the interface? If a version were merged without GSO support, some mechanism to do feature detection would be needed in the userspace API. Correct. There's already GET_FEATURES/ACK_FEATURES in place for this. I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. Yes. I'll post a patch to tap showing how this can be done without vhost changes. I'd like some time to look at get_state/set_state ioctl()s along with dirty tracking support. It's a much better model for live migration IMHO. That option is also available in my code, I just went for a simpler one in my qemu patch. I will outline how it works in a separate mail. I think so more thorough benchmarking would be good too. In particular, netperf/iperf runs would be nice. I don't expect the first version to perform well in all situations and for all users. But it can't regress, can it? One can always fall back to userspace ... Regards, Anthony Liguori Thanks very much, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Aug 24, 2009 at 03:15:05PM -0700, Davide Libenzi wrote: On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote: On Sun, 23 Aug 2009, Michael S. Tsirkin wrote: On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. Hmm. All we try to do is, associate a small key with the event that we signal. Is it really that uncommon/KVM specific? All I'm trying to do, is to avoid that eventfd will become an horrible multiplexor for every freaky one-time-use behaviors arising inside kernel modules. Yes, we don't want that. The best thing is to try to restate the problem in a way that is generic, and then either solve or best use existing solution. Right? I thought I had that, but apparently not. The reason I'm Cc-ing you is not to try and spam you until you give up and accept the patch, it's hoping that you see the pattern behind our usage, and help generalize it. If I understand it correctly, you believe this is not possible and so any solution will have to be in KVM? Or maybe I didn't state the problem clearly enough and should restate it? - Davide -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio-console: Add interface for generic guest-host communication
This interface extends the virtio-console device to handle multiple ports that present a char device from which bits can be sent and read. Sample uses for such a device can be obtaining info from the guest like the file systems used, apps installed, etc. for offline usage and logged-in users, clipboard copy-paste, etc. for online usage. Each port is to be assigned a unique function, for example, the first 4 ports may be reserved for libvirt usage, the next 4 for generic streaming data and so on. This port-function mapping isn't finalised yet. For requirements, use-cases and some history see http://www.linux-kvm.org/page/VMchannel_Requirements Signed-off-by: Amit Shah amit.s...@redhat.com --- hw/pc.c | 16 ++- hw/virtio-console.c | 423 +-- hw/virtio-console.h | 47 ++ monitor.c |7 + qemu-monitor.hx | 10 ++ qemu-options.hx |2 +- sysemu.h| 10 +- vl.c| 41 +++--- 8 files changed, 479 insertions(+), 77 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index ca681b8..96eb7ac 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1453,11 +1453,17 @@ static void pc_init1(ram_addr_t ram_size, } /* Add virtio console devices */ -if (pci_enabled) { -for(i = 0; i MAX_VIRTIO_CONSOLES; i++) { -if (virtcon_hds[i]) { -pci_create_simple(pci_bus, -1, virtio-console-pci); -} +if (pci_enabled virtcon_nr_ports) { +void *dev; + +dev = pci_create_simple(pci_bus, -1, virtio-console-pci); +if (!dev) { +fprintf(stderr, qemu: could not create virtio console pci device\n); +exit(1); +} + +for (i = 0; i virtcon_nr_ports; i++) { +virtio_console_new_port(dev, virtcon_idx[i]); } } diff --git a/hw/virtio-console.c b/hw/virtio-console.c index 92c953c..f8f9866 100644 --- a/hw/virtio-console.c +++ b/hw/virtio-console.c @@ -2,9 +2,11 @@ * Virtio Console Device * * Copyright IBM, Corp. 2008 + * Copyright Red Hat, Inc. 2009 * * Authors: * Christian Ehrhardt ehrha...@linux.vnet.ibm.com + * Amit Shah amit.s...@redhat.com * * This work is licensed under the terms of the GNU GPL, version 2. See * the COPYING file in the top-level directory. @@ -12,39 +14,158 @@ */ #include hw.h +#include pci.h +#include monitor.h #include qemu-char.h #include virtio.h #include virtio-console.h - typedef struct VirtIOConsole { VirtIODevice vdev; +PCIDevice *dev; VirtQueue *ivq, *ovq; -CharDriverState *chr; +struct VirtIOConsolePort *ports; +struct virtio_console_config *config; +uint32_t guest_features; } VirtIOConsole; -static VirtIOConsole *to_virtio_console(VirtIODevice *vdev) +typedef struct VirtIOConsolePort { +VirtIOConsole *vcon; +CharDriverState *hd; +bool guest_connected; +} VirtIOConsolePort; + +static VirtIOConsole *virtio_console; +static struct virtio_console_config virtcon_config; + +static VirtIOConsolePort *get_port_from_id(uint32_t id) +{ +if (id MAX_VIRTIO_CONSOLE_PORTS) +return NULL; + +return virtio_console-ports[id]; +} + +static int get_id_from_port(VirtIOConsolePort *port) { -return (VirtIOConsole *)vdev; +uint32_t i; + +for (i = 0; i MAX_VIRTIO_CONSOLE_PORTS; i++) { +if (port == virtio_console-ports[i]) { +return i; +} +} +return VIRTIO_CONSOLE_BAD_ID; +} + +static bool use_multiport(void) +{ +return virtio_console-guest_features (1 VIRTIO_CONSOLE_F_MULTIPORT); +} + +void virtio_console_monitor_command(Monitor *mon, +const char *command, const char *param) +{ +int ret; + +if(!strncmp(command, add_port, 8)) { +if (!param) { +monitor_printf(mon, Error: need port id to add new port\n); +return; +} +ret = init_virtio_console_port(virtcon_nr_ports, param); +if (ret 0) { +monitor_printf(mon, Error: cannot add new port: %s\n, + strerror(-ret)); +return; +} +virtio_console_new_port(NULL, virtcon_idx[virtcon_nr_ports]); +virtcon_nr_ports++; +virtio_console-config-nr_active_ports = cpu_to_le32(virtcon_nr_ports); +return; +} +} + +static size_t flush_buf(uint32_t id, VirtIOConsolePort *port, +const uint8_t *buf, size_t len) +{ +size_t write_len = 0; + +if (port-hd) { +write_len = qemu_chr_write(port-hd, buf, len); +return write_len; +} +return 0; +} + +/* Guest wants to notify us of some event */ +static void handle_control_message(VirtIOConsolePort *port, + struct virtio_console_control *cpkt) +{ +switch(cpkt-event) { +case VIRTIO_CONSOLE_PORT_OPEN: +port-guest_connected = cpkt-value; +break; +} } +/* Guest
Re: [PATCHv4 0/2] vhost: a kernel-level virtio server
On Thu, 20 Aug 2009 12:30:29 am Michael S. Tsirkin wrote: Rusty, could you review and comment on the patches please? Since most of the code deals with virtio from host side, I think it will make sense to merge them through your tree. What do you think? Yep, I've been waiting for all the other comments, then I got ill. Now I'm all recovered, what better way to spend an evening than reviewing fresh code? New dir seems fine to me, too. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Biweekly KVM Test report, kernel 7597f... qemu 1c45e...
On 08/21/2009 10:14 AM, Xu, Jiajun wrote: I found the migration failure is caused by a configuration mistake on our testing machine. Now 64-bit migration works well. But I found on PAE host, migration will cause host kernel call trace. Pid: 12053, comm: qemu-system-x86 Tainted: G D(2.6.31-rc2 #1) EIP: 0060:[c043e023] EFLAGS: 00210202 CPU: 0 EIP is at lock_hrtimer_base+0x11/0x33 EAX: f5d1541c EBX: 0010 ECX: 04a9 EDX: f5c1bc7c ESI: f5d1541c EDI: f5c1bc7c EBP: f5c1bc74 ESP: f5c1bc68 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process qemu-system-x86 (pid: 12053, ti=f5c1b000 task=f61cb410 task.ti=f5c1b000) Stack: f5d1541c 04a9 f5c1bc8c c043e097 f9b7f7cb f5d1541c 0 04a9 f5c1bc98 c043e0f0 f5d153d0 f5c1bcb0 f9b9b4df bfd8a102 0 f3c1e000 f5d15440 f5c1bcc0 f9b9b56d bfd8a10c f3c1e000 f5c1bda0 f9b8c26b Call Trace: [c043e097] ? hrtimer_try_to_cancel+0x16/0x62 [f9b7f7cb] ? kvm_flush_remote_tlbs+0xd/0x1a [kvm] [c043e0f0] ? hrtimer_cancel+0xd/0x18 [f9b9b4df] ? pit_load_count+0x98/0x9e [kvm] [f9b9b56d] ? kvm_pit_load_count+0x21/0x35 [kvm] Marcelo, any idea? Looks like the PIT was reloaded, but the hrtimer wasn't initialized? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] defer skb allocation in virtio_net -- mergable buff part
Wanted to try this awhile now, good to see this worked on! Some comments inline. On Wed, Aug 12, 2009 at 11:33:51PM -0700, Shirley Ma wrote: Guest virtio_net receives packets from its pre-allocated vring buffers, then it delivers these packets to upper layer protocols as skb buffs. So it's not necessary to pre-allocate skb for each mergable buffer, then frees it when it's useless. This patch has deferred skb allocation to when receiving packets, it reduces skb pre-allocations and skb_frees. And it induces two page list: freed_pages and used_page list, used_pages is used to track pages pre-allocated, it is only useful when removing virtio_net. This patch has tested and measured against 2.6.31-rc4 git, I thought this patch will improve large packet performance, but I saw netperf TCP_STREAM performance improved for small packet for both local guest to host and host to local guest cases. It also reduces UDP packets drop rate from host to local guest. I am not fully understand why. The netperf results from my laptop are: mtu=1500 netperf -H xxx -l 120 w/o patch w/i patch (two runs) guest to host: 3336.84Mb/s 3730.14Mb/s ~ 3582.88Mb/s host to guest: 3165.10Mb/s 3370.39Mb/s ~ 3407.96Mb/s Here is the patch for your review. The same approach can apply to non-mergable buffs too, so we can use code in common. Yes, we should do that. The way to test is to hack virtio to ignore host support for mergeable buffers. If there is no objection, I will submit the non-mergable buffs patch later. The whole page cache management in virtio net is too complex. Since it seems you bypass it for some cases, this might be where savings come from? Signed-off-by: Shirley Ma x...@us.ibm.com --- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 2a6e81d..e31ebc9 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -17,6 +17,7 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ //#define DEBUG +#include linux/list.h #include linux/netdevice.h #include linux/etherdevice.h #include linux/ethtool.h @@ -39,6 +40,12 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 +struct page_list +{ Kernel style is struct page_list {. Also, prefix with virtnet_? + struct page *page; + struct list_head list; +}; + struct virtnet_info { struct virtio_device *vdev; @@ -72,6 +79,8 @@ struct virtnet_info /* Chain pages by the private ptr. */ struct page *pages; Do we need the pages list now? Can we do without? Pls document fields below. + struct list_head used_pages; Seems a waste to have this list just for dev down. Extend virtio to give us all buffers from vq on shutdown? + struct list_head freed_pages; }; static inline void *skb_vnet_hdr(struct sk_buff *skb) @@ -106,6 +115,26 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) return p; } +static struct page_list *get_a_free_page(struct virtnet_info *vi, gfp_t gfp_mask) always called with gfp_mask GFP_ATOMIC? Also - line too long here and elsewere. Run checkpatch? +{ + struct page_list *plist; + + if (list_empty(vi-freed_pages)) { special handling for empty list looks a bit ugly really necessary? + plist = kmalloc(sizeof(struct page_list), gfp_mask); sizeof *plist + if (!plist) + return NULL; + list_add_tail(plist-list, vi-freed_pages); + plist-page = alloc_page(gfp_mask); + } else { + plist = list_first_entry(vi-freed_pages, struct page_list, list); is the first entry special? pls document in struct definition + if (!plist-page) + plist-page = alloc_page(gfp_mask); what does it mean plist-page == NULL? Please document this in struct definition. + } + if (plist-page) + list_move_tail(plist-list, vi-used_pages); + return plist; +} + static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; @@ -121,14 +150,14 @@ static void skb_xmit_done(struct virtqueue *svq) tasklet_schedule(vi-tasklet); } -static void receive_skb(struct net_device *dev, struct sk_buff *skb, +static void receive_skb(struct net_device *dev, void *buf, what is buf, here? can it have proper type and not void*? unsigned len) { struct virtnet_info *vi = netdev_priv(dev); - struct virtio_net_hdr *hdr = skb_vnet_hdr(skb); int err; int i; - + struct sk_buff *skb = NULL; + struct virtio_net_hdr *hdr = NULL; do we have to init these to NULL? if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { pr_debug(%s: short packet %i\n, dev-name, len); dev-stats.rx_length_errors++; @@ -136,15 +165,30 @@
Re: [PATCHv4 2/2] virtio: refactor find_vqs
Sorry it took a bit to respond to this. I kept hoping I'll have time to test it but looks like I won't before I'm on vacation. I agree it's good cleanup, and maybe we can go even further, see below. On Mon, Aug 10, 2009 at 09:07:52AM +0930, Rusty Russell wrote: On Tue, 28 Jul 2009 06:03:08 pm Michael S. Tsirkin wrote: On Tue, Jul 28, 2009 at 12:44:31PM +0930, Rusty Russell wrote: On Mon, 27 Jul 2009 01:17:09 am Michael S. Tsirkin wrote: This refactors find_vqs, making it more readable and robust, and fixing two regressions from 2.6.30: - double free_irq causing BUG_ON on device removal - probe failure when vq can't be assigned to msi-x vector (reported on old host kernels) An older version of this patch was tested by Amit Shah. OK, I've applied both of these; I'd like to see a new test by Amit to make sure tho. I really like this cleanup! I looked harder at this code, and my best attempts to untangle it further came to very little. This is what I ended up with, but it's all cosmetic and can wait until next merge window. See what you think. Thanks! Rusty. virtio_pci: minor MSI-X cleanups 1) Rename vp_request_vectors to vp_request_msix_vectors, and take non-MSI-X case out to caller. I'm not sure this change was for the best: we still have a separate code path under if !use_msix, only in another place now. See below. And this seems to break the symmetry between request_ and free_vectors. Yes, but unfortunately request_vectors was never really symmetrical :( That's because we didn't do the request_irq's for the per_vector case, because we don't have the names. This is what prevented me from doing a nice encapsulation. Yes. But let's split free_vectors out into free_msix_vectors and free_intx as well? - err = vp_request_vectors(vdev, nvectors, per_vq_vectors); + if (!use_msix) { + /* Old style: one normal interrupt for change and all vqs. */ + vp_dev-msix_vectors = 0; + vp_dev-per_vq_vectors = false; + err = request_irq(vp_dev-pci_dev-irq, vp_interrupt, + IRQF_SHARED, dev_name(vdev-dev), vp_dev); + if (!err) + vp_dev-intx_enabled = 1; shorter as vp_dev-intx_enabled = !err + return err; Is that all? Don't we need to create the vqs? Oh, yeah :) This patch applies on top. Basically reverts that part, and renames vector to msix_vector, which I think makes the code more readable. Yes, I agree, this is good cleanup, structure field names should be desriptive. Are you sure we want to make all local variables named vector renamed to msix_vector though? CodingStyle says local var names should be short ... A couple of ideas below. virtio: more PCI minor cleanups Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/virtio/virtio_pci.c | 73 +--- 1 file changed, 36 insertions(+), 37 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -84,7 +84,7 @@ struct virtio_pci_vq_info struct list_head node; /* MSI-X vector (or none) */ - unsigned vector; + unsigned msix_vector; Good. And now we don't need the comment. }; /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */ @@ -349,7 +349,7 @@ error: static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index, void (*callback)(struct virtqueue *vq), const char *name, - u16 vector) + u16 msix_vector) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); struct virtio_pci_vq_info *info; @@ -374,7 +374,7 @@ static struct virtqueue *setup_vq(struct info-queue_index = index; info-num = num; - info-vector = vector; + info-msix_vector = msix_vector; size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); @@ -398,10 +398,10 @@ static struct virtqueue *setup_vq(struct vq-priv = info; info-vq = vq; - if (vector != VIRTIO_MSI_NO_VECTOR) { - iowrite16(vector, vp_dev-ioaddr + VIRTIO_MSI_QUEUE_VECTOR); - vector = ioread16(vp_dev-ioaddr + VIRTIO_MSI_QUEUE_VECTOR); - if (vector == VIRTIO_MSI_NO_VECTOR) { + if (msix_vector != VIRTIO_MSI_NO_VECTOR) { + iowrite16(msix_vector, vp_dev-ioaddr+VIRTIO_MSI_QUEUE_VECTOR); + msix_vector = ioread16(vp_dev-ioaddr+VIRTIO_MSI_QUEUE_VECTOR); checkpatch will complain that space is lacking around +. We won't have a problem if we keep it named vector. + if (msix_vector == VIRTIO_MSI_NO_VECTOR) {
Re: [PATCHv4 2/2] vhost_net: a kernel-level virtio server
On Thu, 20 Aug 2009 12:33:09 am Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. ... +config VHOST_NET + tristate Host kernel accelerator for virtio net + depends on NET EVENTFD + ---help--- + This kernel module can be loaded in host kernel to accelerate + guest networking with virtio_net. Not to be confused with virtio_net + module itself which needs to be loaded in guest kernel. + + To compile this driver as a module, choose M here: the module will + be called vhost_net. Just want to note that the patch explanation and the Kconfig help text are exceptional examples of reader-oriented text. Nice! + /* Sanity check */ + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) { + vq_err(vq, Unexpected header len for TX: +%ld expected %zd\n, vq-iov-iov_len, +sizeof(struct virtio_net_hdr)); + break; + } OK, this check, which is in the qemu version, is *wrong*. There should be no assumption on sg boundaries. For example, the guest should be able to put the virtio_net_hdr in the front of the skbuf data if there is room. You should try to explicitly consume sizeof(struct virtio_net_hdr) of the iov, and if fail, do this message. You can skip the out = 1 test then, too. Anyway, I really prefer vq-iov[0]. to vq-iov- here. + /* Sanity check */ + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) { + vq_err(vq, Unexpected header len for RX: +%ld expected %zd\n, +vq-iov-iov_len, sizeof(struct virtio_net_hdr)); + break; Here too. + u32 __user *featurep = argp; + int __user *fdp = argp; + u32 features; + int fd, r; + switch (ioctl) { + case VHOST_NET_SET_SOCKET: + r = get_user(fd, fdp); + if (r 0) + return r; + return vhost_net_set_socket(n, fd); + case VHOST_GET_FEATURES: + /* No features for now */ + features = 0; + return put_user(features, featurep); We may well get more than 32 feature bits, at least for virtio_net, which will force us to do some trickery in virtio_pci. I'd like to avoid that here, though it's kind of ugly. We'd need VHOST_GET_FEATURES (and ACK) to take a struct like: u32 feature_size; u32 features[]; +int vhost_net_init(void) static? +void vhost_net_exit(void) static? +/* Start polling a file. We add ourselves to file's wait queue. The user must + * keep a reference to a file until after vhost_poll_stop is called. */ I experienced minor confusion from the comments in this file. Where you said user I think caller. No biggie though. + memory-nregions = 2; + memory-regions[0].guest_phys_addr = 1; + memory-regions[0].userspace_addr = 1; + memory-regions[0].memory_size = ~0ULL; + memory-regions[1].guest_phys_addr = 0; + memory-regions[1].userspace_addr = 0; + memory-regions[1].memory_size = 1; Not sure I understand why there are two regions to start? + case VHOST_SET_VRING_BASE: + r = copy_from_user(s, argp, sizeof s); + if (r 0) + break; + if (s.num 0x) { + r = -EINVAL; + break; + } + vq-last_avail_idx = s.num; + break; + case VHOST_GET_VRING_BASE: + s.index = idx; + s.num = vq-last_avail_idx; + r = copy_to_user(argp, s, sizeof s); + break; Ah, this is my fault. I didn't expose the last_avail_idx in the ring because the other side doesn't need it; but without it the ring state is not fully observable from outside (no external save / restore!). I have a patch which published these indices (we have room), see: http://ozlabs.org/~rusty/kernel/rr-2009-08-12-1/virtio:ring-publish-indices.patch Perhaps we should use that mechanism instead? We don't actually have to offer the feature (we don't care about the guest state), but it's nice as documentation. I've been waiting for an excuse to use that patch. +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg) +{ + void __user *argp = (void __user *)arg; + long r; + + mutex_lock(d-mutex); + if (ioctl == VHOST_SET_OWNER) { + r = vhost_dev_set_owner(d); + goto done; + } + + r = vhost_dev_check_owner(d); You can do a VHOST_SET_OWNER without being the owner? So really, the -EPERM from all the vhost_dev_check_owner() is not a security thing, but a I
KVM: PIT: fix pit_state copy in set_pit2/get_pit2
The kvm_pit_state2 structure contains extra space, so the memcpy in kvm_vm_ioctl_set_pit2 corrupts kvm-arch.vpit-pit_state. Fix it by memcpy'ing the channel information and assigning flags manually. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0f22f72..35e7fc5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2095,7 +2095,9 @@ static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps) int r = 0; mutex_lock(kvm-arch.vpit-pit_state.lock); - memcpy(ps, kvm-arch.vpit-pit_state, sizeof(struct kvm_pit_state2)); + memcpy(ps-channels, kvm-arch.vpit-pit_state.channels, + sizeof(ps-channels)); + ps-flags = kvm-arch.vpit-pit_state.flags; mutex_unlock(kvm-arch.vpit-pit_state.lock); return r; } @@ -2109,7 +2111,9 @@ static int kvm_vm_ioctl_set_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps) cur_legacy = ps-flags KVM_PIT_FLAGS_HPET_LEGACY; if (!prev_legacy cur_legacy) start = 1; - memcpy(kvm-arch.vpit-pit_state, ps, sizeof(struct kvm_pit_state2)); + memcpy(kvm-arch.vpit-pit_state.channels, ps-channels, + sizeof(kvm-arch.vpit-pit_state.channels)); + kvm-arch.vpit-pit_state.flags = ps-flags; kvm_pit_load_count(kvm, 0, kvm-arch.vpit-pit_state.channels[0].count, start); mutex_unlock(kvm-arch.vpit-pit_state.lock); return r; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: PIT: fix pit_state copy in set_pit2/get_pit2
On 08/25/2009 03:29 PM, Marcelo Tosatti wrote: The kvm_pit_state2 structure contains extra space, so the memcpy in kvm_vm_ioctl_set_pit2 corrupts kvm-arch.vpit-pit_state. Fix it by memcpy'ing the channel information and assigning flags manually. Good catch; applied. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Tuesday 25 August 2009, Avi Kivity wrote: On 08/25/2009 05:22 AM, Anthony Liguori wrote: I think 2.6.32 is pushing it. 2.6.32 is pushing it, but we need to push it. Agreed. I think some time is needed to flush out the userspace interface. In particular, I don't think Mark's comments have been adequately addressed. If a version were merged without GSO support, some mechanism to do feature detection would be needed in the userspace API. I don't see any point in merging without gso (unless it beats userspace with gso, which I don't think will happen). In any case we'll need feature negotiation. The feature negotiation that Michael has put in seems sufficient for this. If you care more about latency than bandwidth, the current driver is an enormous improvement over the user space code, which I find is enough reason to have it now. I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. What about veth pairs? I think you are talking about different issues. Veth pairs let you connect vhost to a bridge device like you do in the typical tun/tap setup, which addresses one problem. The other issue that came up is that raw sockets require root permissions, so you have to start qemu as root or with an open file descriptor for the socket (e.g. through libvirt). Permission handling on tap devices is ugly as well, but is a solved problem. The solution is to be able to pass a tap device (or a socket from a tap device) into vhost net. IMHO we need this eventually, but it's not a show stopper for 2.6.32. Along the same lines, I also think it should support TCP and UDP sockets so we can offload 'qemu -net socket,mcast' and 'qemu -net socket,connect' in addition to 'qemu -net socket,raw'. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On Thu, 20 Aug 2009 05:14:29 pm Gerd Hoffmann wrote: I think strings are better as numbers for identifying protocols as you can work without a central registry for the numbers then. Yep, all you need is a central registry for names :) Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On 08/25/09 14:43, Rusty Russell wrote: On Thu, 20 Aug 2009 05:14:29 pm Gerd Hoffmann wrote: I think strings are better as numbers for identifying protocols as you can work without a central registry for the numbers then. Yep, all you need is a central registry for names :) There are schemes to do without (reverse domain names, i.e. au.com.rustcorp.* is all yours and you don't have to register). Also with names the namespace is much bigger and clashes are much less likely, so chances that it works out with everybody simply picking a sane name are much higher. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
Avi Kivity wrote: I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. What about veth pairs? Does veth support GSO and checksum offload? I'd like some time to look at get_state/set_state ioctl()s along with dirty tracking support. It's a much better model for live migration IMHO. My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. I avoided suggested ring proxying because I didn't want to suggest that merging should be contingent on it. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
Michael S. Tsirkin wrote: My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Exactly, that's what I meant earlier. That's enough, isn't it, Anthony? It is if we have a working implementation that demonstrates the userspace interface is sufficient. Once it goes into the upstream kernel, we need to have backwards compatibility code in QEMU forever to support that kernel version. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] vhost_net: a kernel-level virtio server
Thanks for the comments, I'll work on them ASAP. Answers to questions and more comments below. On Tue, Aug 25, 2009 at 09:40:40PM +0930, Rusty Russell wrote: On Thu, 20 Aug 2009 12:33:09 am Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. ... +config VHOST_NET + tristate Host kernel accelerator for virtio net + depends on NET EVENTFD + ---help--- + This kernel module can be loaded in host kernel to accelerate + guest networking with virtio_net. Not to be confused with virtio_net + module itself which needs to be loaded in guest kernel. + + To compile this driver as a module, choose M here: the module will + be called vhost_net. Just want to note that the patch explanation and the Kconfig help text are exceptional examples of reader-oriented text. Nice! High praise indeed from the author of the lguest Quest :). + /* Sanity check */ + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) { + vq_err(vq, Unexpected header len for TX: + %ld expected %zd\n, vq-iov-iov_len, + sizeof(struct virtio_net_hdr)); + break; + } OK, this check, which is in the qemu version, is *wrong*. There should be no assumption on sg boundaries. For example, the guest should be able to put the virtio_net_hdr in the front of the skbuf data if there is room. You should try to explicitly consume sizeof(struct virtio_net_hdr) of the iov, and if fail, do this message. You can skip the out = 1 test then, too. Anyway, I really prefer vq-iov[0]. to vq-iov- here. I'll fix that. Probably should fix qemu as well. + /* Sanity check */ + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) { + vq_err(vq, Unexpected header len for RX: + %ld expected %zd\n, + vq-iov-iov_len, sizeof(struct virtio_net_hdr)); + break; Here too. + u32 __user *featurep = argp; + int __user *fdp = argp; + u32 features; + int fd, r; + switch (ioctl) { + case VHOST_NET_SET_SOCKET: + r = get_user(fd, fdp); + if (r 0) + return r; + return vhost_net_set_socket(n, fd); + case VHOST_GET_FEATURES: + /* No features for now */ + features = 0; + return put_user(features, featurep); We may well get more than 32 feature bits, at least for virtio_net, which will force us to do some trickery in virtio_pci. I'd like to avoid that here, though it's kind of ugly. We'd need VHOST_GET_FEATURES (and ACK) to take a struct like: u32 feature_size; u32 features[]; Do you feel just making it 64 bit won't be enough? How about 128 bit? +int vhost_net_init(void) static? +void vhost_net_exit(void) static? Good catch. +/* Start polling a file. We add ourselves to file's wait queue. The user must + * keep a reference to a file until after vhost_poll_stop is called. */ I experienced minor confusion from the comments in this file. Where you said user I think caller. No biggie though. + memory-nregions = 2; + memory-regions[0].guest_phys_addr = 1; + memory-regions[0].userspace_addr = 1; + memory-regions[0].memory_size = ~0ULL; + memory-regions[1].guest_phys_addr = 0; + memory-regions[1].userspace_addr = 0; + memory-regions[1].memory_size = 1; Not sure I understand why there are two regions to start? We are trying to cover a whole 2^64 range here. It's size does not fit in a single 64 bit length value. I could special case 0 length to mean 2^64, but decided against it. + case VHOST_SET_VRING_BASE: + r = copy_from_user(s, argp, sizeof s); + if (r 0) + break; + if (s.num 0x) { + r = -EINVAL; + break; + } + vq-last_avail_idx = s.num; + break; + case VHOST_GET_VRING_BASE: + s.index = idx; + s.num = vq-last_avail_idx; + r = copy_to_user(argp, s, sizeof s); + break; Ah, this is my fault. I didn't expose the last_avail_idx in the ring because the other side doesn't need it; but without it the ring state is not fully observable from outside (no external save / restore!). I have a patch which published these indices (we have room), see: http://ozlabs.org/~rusty/kernel/rr-2009-08-12-1/virtio:ring-publish-indices.patch Perhaps we should use that mechanism instead? We don't actually have to offer the feature (we don't care about the guest state), but it's nice as documentation. I've been
Re: vhost net: performance with ping benchmark
Avi Kivity wrote: My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Thinking about this more... How does the hand off work? Assuming you normally don't proxy ring entries and switch to proxying them when you want to migration, do you have a set of ioctl()s that changes the semantics of the ring to be host virtual addresses instead of guest physical? If so, what do you do with in flight requests? Does qemu have to buffer new requests and wait for old ones to complete? Unless you always do ring proxying. If that's the case, we don't need any of the slot management code in vhost. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Tue, Aug 25, 2009 at 08:08:05AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Exactly, that's what I meant earlier. That's enough, isn't it, Anthony? It is if we have a working implementation that demonstrates the userspace interface is sufficient. The idea is trivial enough to be sure the interface is sufficient: we point kernel at used buffer at address X, and copy stuff from there to guest buffer, then signal guest. I'll post a code snippet to show how it's done if you like. Once it goes into the upstream kernel, we need to have backwards compatibility code in QEMU forever to support that kernel version. Don't worry: kernel needs to handle old userspace as well, and neither I nor Rusty want to have a compatibility mess in kernel. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix apic id reading in x2apic mode
Format of apic id register is different in x2apic mode. Return correct apic id when apic is in x2apic mode. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e41e948..bd3a402 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -551,6 +551,12 @@ static u32 __apic_read(struct kvm_lapic *apic, unsigned int offset) return 0; switch (offset) { + case APIC_ID: + if (apic_x2apic_mode(apic)) + val = kvm_apic_id(apic); + else + val = kvm_apic_id(apic) 24; + break; case APIC_ARBPRI: printk(KERN_WARNING Access APIC ARBPRI register which is for P6\n); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Tue, Aug 25, 2009 at 08:24:07AM -0500, Anthony Liguori wrote: Avi Kivity wrote: My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Thinking about this more... How does the hand off work? Assuming you normally don't proxy ring entries and switch to proxying them when you want to migration, do you have a set of ioctl()s that changes the semantics of the ring to be host virtual addresses instead of guest physical? If so, what do you do with in flight requests? Does qemu have to buffer new requests and wait for old ones to complete? Unless you always do ring proxying. If that's the case, we don't need any of the slot management code in vhost. Regards, Anthony Liguori Here's how it works. It relies on the fact that in virtio, guest can not assume that descriptors have been used unless they appeared in used buffers. When migration starts, we do this: 1. stop kernel (disable socket) 2. call VHOST_SET_VRING_USED: note it gets virtual address, bot guest physical. We point it at buffer in qemu memory 3. call VHOST_SET_VRING_CALL, pass eventfd created by qemu 5. copy over existing used buffer 4. unstop kernel (reenable socket) Now when migration is in progress, we do this: A. poll eventfd in 3 above B. When event is seen, look at used buffer that we gave to kernel C. Parse descriptors and mark pages that kernel wrote to as dirty D. update used buffer that guest looks at E. signal eventfd for guest -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Tue, Aug 25, 2009 at 08:06:39AM -0500, Anthony Liguori wrote: Avi Kivity wrote: I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. What about veth pairs? Does veth support GSO and checksum offload? AFAIK, no. But again, improving veth is a separate project :) I'd like some time to look at get_state/set_state ioctl()s along with dirty tracking support. It's a much better model for live migration IMHO. My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. I avoided suggested ring proxying because I didn't want to suggest that merging should be contingent on it. Happily, the proposed interface supports is. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On Tue, Aug 25, 2009 at 08:08:05AM -0500, Anthony Liguori wrote: Once it goes into the upstream kernel, we need to have backwards compatibility code in QEMU forever to support that kernel version. BTW, qemu can keep doing the userspace thing if some capability it needs is missing. It won't be worse off than if the driver is not upstream at all :). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default
On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote: Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which means it does not allow filesystems to use barriers. But the typical use case for virtio-blk is to use a backed that uses synchronous I/O Really? Does qemu open with O_SYNC? I'm definitely no block expert, but this seems strange... Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default
On Tue, Aug 25, 2009 at 11:41:37PM +0930, Rusty Russell wrote: On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote: Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which means it does not allow filesystems to use barriers. But the typical use case for virtio-blk is to use a backed that uses synchronous I/O Really? Does qemu open with O_SYNC? I'm definitely no block expert, but this seems strange... Rusty. Qemu can open it various ways, but the only one that is fully safe is O_SYNC (cache=writethrough). The O_DIRECT (cache=none) option is also fully safe with the above patch under some limited circumstances (disk write caches off and using a host device or fully allocated file). Fixing the cache=writeback option and the majority case for cache=none requires implementing a cache flush command, and for the latter one also fixes to the host kernel I'm working on. You will get another patch to implement the proper cache controls in virtio-blk for me in a couple of days, too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2844286 ] kvm hangs when iommu used
Bugs item #2844286, was opened at 2009-08-25 10:59 Message generated for change (Tracker Item Submitted) made by mjcoss You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2844286group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: kernel Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Mike Coss (mjcoss) Assigned to: Nobody/Anonymous (nobody) Summary: kvm hangs when iommu used Initial Comment: CPU: Intel i7 975 Extreme Edition Host: Gentoo w/2.6.31-rc6 kernel 32bit Guest: Windows XP sp3 32bit I'm trying to use Vt-d to enable pci pass through of a Nvidia GTX 295 card, and I currently have 2 issues. The first is that when I shutdown the kvm, the process hangs. It can't be killed, and it uses 100% utilization of the cpu that it's running on. As expected, it prevents the system from shutting down and I have to reset the machine. The second issue is that when I start the kvm, I get an error message Option ROM size fe00 exceeds available space Option ROM size fe00 exceeds available space If I don't try to use pass through the KVM starts and stops fine. Under windows, it sees the GTX 295, and allowed me to install the driver, but it fails to start with error message that it doesn't have enough resources to start. I have tried kvm-88, and the latest qemu-kvm-0.11.0-rc1. I have also tried it with a Linux distribution, OpenSUSE, and the X server doesn't start, however I do get a message from the driver that I'm still tracking down. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2844286group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost net: performance with ping benchmark
On 08/25/2009 04:08 PM, Anthony Liguori wrote: Michael S. Tsirkin wrote: My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Exactly, that's what I meant earlier. That's enough, isn't it, Anthony? It is if we have a working implementation that demonstrates the userspace interface is sufficient. Once it goes into the upstream kernel, we need to have backwards compatibility code in QEMU forever to support that kernel version. Not at all. We still have pure userspace support, so if we don't like the first two versions of vhost, we can simply not support them. Of course I'm not advocating merging something known bad or untested, just pointing out that the cost of an error is not that bad. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 2/2] vhost_net: a kernel-level virtio server
On Tue, Aug 25, 2009 at 09:40:40PM +0930, Rusty Russell wrote: + u32 __user *featurep = argp; + int __user *fdp = argp; + u32 features; + int fd, r; + switch (ioctl) { + case VHOST_NET_SET_SOCKET: + r = get_user(fd, fdp); + if (r 0) + return r; + return vhost_net_set_socket(n, fd); + case VHOST_GET_FEATURES: + /* No features for now */ + features = 0; + return put_user(features, featurep); We may well get more than 32 feature bits, at least for virtio_net, which will force us to do some trickery in virtio_pci. Unlike PCI, if we ever run out of bits we can just add FEATURES_EXTENDED ioctl, no need for trickery. I'd like to avoid that here, though it's kind of ugly. We'd need VHOST_GET_FEATURES (and ACK) to take a struct like: u32 feature_size; u32 features[]; Thinking about this proposal some more, how will the guest determine the size to supply the GET_FEATURES ioctl? Since we are a bit tight in 32 bit space already, let's just use a 64 bit integer and be done with it? Right? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Notes on block I/O data integrity
As various people wanted to know how the various data integrity patches I've send out recently play together here's a small writeup on what issues we have in QEMU and how to fix it: There are two major aspects of data integrity we need to care in the QEMU block I/O code: (1) stable data storage - we must be able to force data out of caches onto the stable media, and we must get completion notification for it. (2) request ordering - we must be able to make sure some I/O request do not get reordered with other in-flight requests before or after it. Linux uses two related abstractions to implement the this (other operating system are probably similar) (1) a cache flush request that flushes the whole volatile write cache to stable storage (2) a barrier request, which (a) is guaranteed to actually go all the way to stable storage (b) does not reordered versus any requests before or after it For disks not using volatile write caches the cache flush is a no-op, and barrier requests are implemented by draining the queue of outstanding requests before the barrier request, and only allowing new requests to proceed after it has finished. Instead of the queue drain tag ordering could be used, but at this point that is not the case in Linux. For disks using volatile write caches, the cache flush is implemented by a protocol specific request, and the the barrier request are implemented by performing cache flushes before and after the barrier request, in addition to the draining mentioned above. The second cache flush can be replaced by setting the Force Unit Access bit on the barrier request on modern disks. The above is supported by the QEMU emulated disks in the following way: - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/ WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not indicate a volatile write cache in the ATA IDENTIFY command. Because of that guests do no not actually send down cache flush request. Linux guests do however drain the I/O queues to guarantee ordering in absence of volatile write caches. - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command, and also advertises the write cache enabled bit. This means Linux sends down cache flush requests to implement barriers, and provides sufficient queue draining. - The virtio-blk driver does not implement any cache flush command. And while there is a virtio-blk feature bit for barrier support it is not support by virtio-blk. Due to the lack of a cache flush command it also is insufficient to implement the required data integrity semantics. Currently the virtio-blk Linux does not advertise any form of barrier support, and we don't even get the queue draining required for proper operation in a cache-less environment. The I/O from these front end drivers maps to different host kernel I/O patterns depending on the cache= drive command line. There are three choices for it: (a) cache=writethrough (b) cache=writeback (c) cache=none (a) means all writes are synchronous (O_DSYNC), which means the host kernel guarantees us that data is on stable storage once the I/O request has completed. In cache=writethrough mode the IDE and SCSI drivers are safe because the queue is properly drained to guarantee I/O ordering. Virtio-blk is not safe due to the lack of queue draining. (b) means we use regular buffered writes and need a fsync/fdatasync to actually guarantee that data is stable on disk. In data=writeback mode on the SCSI emulation is safe as all others miss the cache flush requests. (c) means we use direct I/O (O_DIRECT) to bypass the host cache and perform direct dma to/from the I/O buffer in QEMU. While direct I/O bypasses the host cache it does not guarantee flushing of volatile write caches in disks, nor completion of metadata operations in filesystems (e.g. block allocations). In data=none only the SCSI emulation is entirely safe right now due to the lack of cache flushes in the other drivers. Action plan for the guest drivers: - virtio-blk needs to advertise ordered queue by default. This makes cache=writethrough safe on virtio. Action plan for QEMU: - IDE needs to set the write cache enabled bit - virtio needs to implement a cache flush command and advertise it (also needs a small change to the host driver) - we need to implement an aio_fsync to not stall the vpu on cache flushes - investigate only advertising a write cache when we really have one to avoid the cache flush requests for cache=writethrough Notes on disk cache flushes on Linux hosts: - barrier requests and cache flushes are supported by all local disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS). However unlike the other filesystems ext3 does _NOT_ enable barriers and cache flush requests by default. - currently O_SYNC
Re: Notes on block I/O data integrity
On Tue, Aug 25, 2009 at 1:11 PM, Christoph Hellwigh...@lst.de wrote: - barrier requests and cache flushes are supported by all local disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS). However unlike the other filesystems ext3 does _NOT_ enable barriers and cache flush requests by default. what about LVM? iv'e read somewhere that it used to just eat barriers used by XFS, making it less safe than simple partitions. -- Javier -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on block I/O data integrity
On Tue, Aug 25, 2009 at 02:33:30PM -0500, Javier Guerra wrote: On Tue, Aug 25, 2009 at 1:11 PM, Christoph Hellwigh...@lst.de wrote: ??- barrier requests and cache flushes are supported by all local ?? disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS). ?? However unlike the other filesystems ext3 does _NOT_ enable barriers ?? and cache flush requests by default. what about LVM? iv'e read somewhere that it used to just eat barriers used by XFS, making it less safe than simple partitions. Oh, any additional layers open another by cans of worms. On Linux until very recently using LVM or software raid means only disabled write caches are safe. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on block I/O data integrity
Hello Christopher, thanks a lot vor this overview, it answers a lot of my questions! May I suggest You put it somewhere on the wiki so it doesn't get forgotten in the maillist only? It also rises few new questions though. We have experienced postgresql database corruptions lately, two times to be exact. First time, I blamed server crash, but lately (freshly created) database got corrupted for the second time and there were no crashes since the initialisation. The server hardware is surely OK. I didn't have much time to look into this yet, but Your mail just poked me to return to the subject. The situation is a bit more complex, as there are additional two layers of storage there: we're using SATA/SAS drives, network-mirrored by DRBD, clustered LVM on top of those, and finally qemu-kvm using virtio on top of created logical volumes. So there are plenty of possible culprits, but Your mention of virtio unsafeness while using cache=writethrough (which is the default for drive types other then qcow) leads me to suspicion that this might be the reason of the problem. Databases are sensitive for requests reordering, so I guess using virtio for postgres storage was quite stupid from me :( So my question is, could You please advise me a bit on the storage configuration? virtio performed much better then SCSI, but of course data integrity is crucial, so would You suggest rather using SCSI? DRBD doesn't have problem with barriers, clustered LVM SHOULD not have problems with it, as we're using just striped volumes, but I'll check it to be sure. So is it safe for me to keep cache=writethrough for the database volume? thanks a lor in advance for any hints! with best regards nik On Tue, Aug 25, 2009 at 08:11:20PM +0200, Christoph Hellwig wrote: As various people wanted to know how the various data integrity patches I've send out recently play together here's a small writeup on what issues we have in QEMU and how to fix it: There are two major aspects of data integrity we need to care in the QEMU block I/O code: (1) stable data storage - we must be able to force data out of caches onto the stable media, and we must get completion notification for it. (2) request ordering - we must be able to make sure some I/O request do not get reordered with other in-flight requests before or after it. Linux uses two related abstractions to implement the this (other operating system are probably similar) (1) a cache flush request that flushes the whole volatile write cache to stable storage (2) a barrier request, which (a) is guaranteed to actually go all the way to stable storage (b) does not reordered versus any requests before or after it For disks not using volatile write caches the cache flush is a no-op, and barrier requests are implemented by draining the queue of outstanding requests before the barrier request, and only allowing new requests to proceed after it has finished. Instead of the queue drain tag ordering could be used, but at this point that is not the case in Linux. For disks using volatile write caches, the cache flush is implemented by a protocol specific request, and the the barrier request are implemented by performing cache flushes before and after the barrier request, in addition to the draining mentioned above. The second cache flush can be replaced by setting the Force Unit Access bit on the barrier request on modern disks. The above is supported by the QEMU emulated disks in the following way: - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/ WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not indicate a volatile write cache in the ATA IDENTIFY command. Because of that guests do no not actually send down cache flush request. Linux guests do however drain the I/O queues to guarantee ordering in absence of volatile write caches. - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command, and also advertises the write cache enabled bit. This means Linux sends down cache flush requests to implement barriers, and provides sufficient queue draining. - The virtio-blk driver does not implement any cache flush command. And while there is a virtio-blk feature bit for barrier support it is not support by virtio-blk. Due to the lack of a cache flush command it also is insufficient to implement the required data integrity semantics. Currently the virtio-blk Linux does not advertise any form of barrier support, and we don't even get the queue draining required for proper operation in a cache-less environment. The I/O from these front end drivers maps to different host kernel I/O patterns depending on the cache= drive command line. There are three choices for it: (a) cache=writethrough (b) cache=writeback (c) cache=none (a) means all writes are synchronous (O_DSYNC), which means the
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: Yes, we don't want that. The best thing is to try to restate the problem in a way that is generic, and then either solve or best use existing solution. Right? I thought I had that, but apparently not. The reason I'm Cc-ing you is not to try and spam you until you give up and accept the patch, it's hoping that you see the pattern behind our usage, and help generalize it. If I understand it correctly, you believe this is not possible and so any solution will have to be in KVM? Or maybe I didn't state the problem clearly enough and should restate it? Please do. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ide vs. scsi read throughput
Hi, While using kvm88, I tried comparing the read throughput between PIIX IDE and the LSI 53c895a emulation. My test code is: while : ; do dd if=/dev/sda1 of=/dev/null bs=4096 count=1024000 results_file 21; done I averaged the results with: cat results_file | grep -v records | cut --delimiter= -f 8 | awk '{n++; a[n]=$1; t+=$1; printf(cur=%d avg=%f n=%d\n, $1, t/n, n)} END {avg=t/n; for (i=1;i=n;i++) sd+=(a[i] - avg) * (a[i] - avg); printf(\n\nstd_dev=%f\n, sqrt(sd/n))}' I ran a hundred loops on: 1. The native host's .img file 2. PIIX ide virtual drive 3. LSI scsi virtual drive The average read throughput results I got are: 100 GB/sec on the native host (standard deviation is 3.7) 55 GB/sec on the virtual ide drive (standard deviation is 21.8) 60 GB/sec on the virtual scsi drive (standard deviation is 5.8) Is there a way to make the virtual drives perform better and maybe get results that are closer to the host's? Another issue I noticed while looking at the results is large fluctuations in ide results. The standard deviation of the ide results is very large. A large portion of the results are around 30GB/sec and around 80GB/sec. Is there any meaningful explanation for this large variation? Is there some issue in the QEMU's ide code path that causes this? Thanks, S -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
syslog overflowing with latest kvm tree
Hello, I got lots of the following with the latest tree: Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:51 Athas kernel: __ratelimit: 50538 callbacks suppressed Aug 25 22:01:51 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:51 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:51 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip 811b959a 66 8b 11 a8 Aug 25 22:01:56 Athas kernel: __ratelimit: 5766 callbacks suppressed Wasn't doing anything fancy, just had a 64 bit guest compiling a kernel. Just letting you guys know, i'm not sure if it's because of something i've done, I do merge the tree with linus'. Regards, Will Trives This email was sent from Netspace Webmail: http://www.netspace.net.au -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
AlacrityVM benchmark numbers updated
We are pleased to announce the availability of the latest networking benchmark numbers for AlacrityVM. We've made several tweaks to the original v0.1 release to improve performance. The most notable is a switch from get_user_pages to switch_mm+copy_[to/from]_user thanks to a review suggestion from Michael Tsirkin (as well as his patch to implement it). This change alone accounted for freeing up an additional 1.2Gbps, which is over 25% improvement from v0.1. The previous numbers were 4560Gbps before the change, and 5708Gbps after (for 1500mtu over 10GE). This moves us ever closer to the goal of native performance under virtualization. These changes will be incorporated into the upcoming v0.2 release. For more details, please visit our project wiki: http://developer.novell.com/wiki/index.php/AlacrityVM Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: Page allocation failures in guest
On Fri, 14 Aug 2009 05:55:48 am Pierre Ossman wrote: On Wed, 12 Aug 2009 15:01:52 +0930 Rusty Russell ru...@rustcorp.com.au wrote: Subject: virtio: net refill on out-of-memory ... Patch applied. Now we wait. :) Any results? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2843250 ] live migration fails with emulated scsi drive
Bugs item #2843250, was opened at 2009-08-24 05:24 Message generated for change (Comment added) made by jamessong You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2843250group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Trent Johnson (tljohnsn) Assigned to: Nobody/Anonymous (nobody) Summary: live migration fails with emulated scsi drive Initial Comment: CPU model: Intel(R) Xeon(R) CPU L5335 KVM version 88 host kernel: centos 2.6.18-128.4.1.el5 guest: Tried both centos 4.8 - 32bit latest kernel and centos 5.3 64 bit, same kernel as above qemu command line: /usr/bin/qemu-kvm -M pc -m 1024 -smp 2 -name c5test-lvm -uuid a298100a-67af-8f80-d32c-de271b4cefb1 -monitor pty -boot c -drive file=,if=ide,media=cdrom,index=2 -drive file=/dev/drbd1,if=scsi,index=0,boot=on -net nic,macaddr=54:52:00:30:7f:ae,vlan=0,name=nic.0 -net tap,vlan=0,name=tap.0 -serial pty -parallel none -usb -curses -vga cirrus Also tried -no-kvm and still get the errors sd 0:0:0:0: ABORT operation timed-out. sd 0:0:0:0: DEVICE RESET operation started. sd 0:0:0:0: DEVICE RESET operation timed-out. sd 0:0:0:0: BUS RESET operation started. sym0: suspicious SCSI data while resetting the BUS. sym0: dp1,d15-8,dp0,d7-0,rst,req,ack,bsy,sel,atn,msg,c/d,i/o = 0x0, expecting 0x100 sd 0:0:0:0: BUS RESET operation timed-out. sd 0:0:0:0: HOST RESET operation started. sym0: suspicious SCSI data while resetting the BUS. sym0: dp1,d15-8,dp0,d7-0,rst,req,ack,bsy,sel,atn,msg,c/d,i/o = 0x0, expecting 0x100 sym0: the chip cannot lock the frequency sym0: SCSI BUS has been reset. sd 0:0:0:0: HOST RESET operation timed-out. sd 0:0:0:0: scsi: Device offlined - not ready after error recovery Live migration will give errors and stop io to the disk when using an emulated scsi drive on the guest. It does not fail when using an ide drive. I get the following errors: -- Comment By: James Song (jamessong) Date: 2009-08-26 11:25 Message: Does the destination box also set up the scsi already? -- Comment By: Nolan Leake (kijiki) Date: 2009-08-24 10:29 Message: This is fixed in qemu-kvm (and qemu upstream, for that matter). -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2843250group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Page allocation failures in guest
On Wed, 26 Aug 2009 11:47:17 +0930 Rusty Russell ru...@rustcorp.com.au wrote: On Fri, 14 Aug 2009 05:55:48 am Pierre Ossman wrote: On Wed, 12 Aug 2009 15:01:52 +0930 Rusty Russell ru...@rustcorp.com.au wrote: Subject: virtio: net refill on out-of-memory ... Patch applied. Now we wait. :) Any results? It's been up for 12 days, so I'd say it works. But there is nothing in dmesg, which suggests I haven't triggered the condition yet. I wonder if there might be something broken with Fedora's kernel. :/ (I am running the same upstream version, and their conf, for this test, but not all of their patches) Rgds -- -- Pierre Ossman WARNING: This correspondence is being monitored by the Swedish government. Make sure your server uses encryption for SMTP traffic and consider using PGP for end-to-end encryption. signature.asc Description: PGP signature