Re: Slow disk IO on virtio kvm guests with Centos 5.5 as hypervisor
On Wed, Feb 16, 2011 at 12:50 PM, Thomas Broda tho...@bassfimass.de wrote: On Tue, 15 Feb 2011 15:50:00 +, Stefan Hajnoczi stefa...@gmail.com wrote: On Tue, Feb 15, 2011 at 10:15 AM, Thomas Broda tho...@bassfimass.de wrote: Using O_DIRECT, performance went down to 11 MB/s on the hypervisor... Hmm...can you restate that as: host X MB/s guest Y MB/s Trying dd with oflag=direct an of=/dev/vg0/lvtest (directly on the KVM hypervisor) yielded a result of 11MB/s. If I try this on the guest with /dev/vda1 as output device, results are between 1.9MB/s and 7.7MB/s, usually around 3.5MB/s. To sum it up: Host: 11 MB/s Guest: 3.5 MB/s I've checked the RAID controller in the meantime. It's a HP Smart Array P400. Write Caching is switched off since the contoller has no BBU (yet). Could it be related to this? The disabled write cache will result in slow writes so your host benchmark result is low on an absolute scale. However, the relative Guest/Host performance is very poor here (3.5/11 = 31%). A number of performance improvements have been made to KVM and Centos 5.5 does not contain them because it is too old. If you want to see a more current reflection of KVM performance, you could try Fedora 14 host and guest. The components that matter are: host kernel, qemu-kvm userspace, and guest kernel. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow disk IO on virtio kvm guests with Centos 5.5 as hypervisor
On Mon, Feb 14, 2011 at 6:15 PM, Thomas Broda tho...@bassfimass.de wrote: dd'ing /dev/zero to a testfile gives me a throughput of about 400MB/s when done directly on the hypervisor. If I try this from within a virtual guest, it's only 19MB/s to 24MB/s if the guest is on the LVM volume (raw device, not qcow2 or something, no filesystem on top of the LVM). Did you run dd with O_DIRECT? dd if=/dev/zero of=path-to-device oflag=direct bs=64k In order to exercise the disk and eliminate page cache effects you need to do this. Also, you are using oldish KVM packages. You could try a modern kernel and KVM userspace. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow disk IO on virtio kvm guests with Centos 5.5 as hypervisor
On Tue, Feb 15, 2011 at 10:15 AM, Thomas Broda tho...@bassfimass.de wrote: On Tue, 15 Feb 2011 09:19:23 +, Stefan Hajnoczi stefa...@gmail.com wrote: Did you run dd with O_DIRECT? dd if=/dev/zero of=path-to-device oflag=direct bs=64k Using O_DIRECT, performance went down to 11 MB/s on the hypervisor... Hmm...can you restate that as: host X MB/s guest Y MB/s I don't understand from your answer which values you have found. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Feb 15
On Mon, Feb 14, 2011 at 10:18 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 02/14/2011 11:56 AM, Chris Wright wrote: Please send in any agenda items you are interested in covering. -rc2 is tagged and waiting for announcement. Please take a look at -rc2 and make sure there is nothing critical missing. Will tag 0.14.0 very late tomorrow but unless there's something critical, it'll be 0.14.0-rc2 with an updated version. Most of my -rc2 testing is done now and has passed. http://wiki.qemu.org/Planning/0.14/Testing Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Feb 8
On Mon, Feb 7, 2011 at 10:40 PM, Chris Wright chr...@redhat.com wrote: Please send in any agenda items you are interested in covering. Automated builds and testing: maintainer trees, integrating KVM-Autotest, and QEMU tests we need but don't exist Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call minutes for Feb 8
On Tue, Feb 8, 2011 at 3:55 PM, Chris Wright chr...@redhat.com wrote: Automated builds and testing - found broken 32-bit The broken build was found (and fixed?) before automated qemu.git builds. It's a good motivator though. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 9, 2011 at 1:55 AM, Michael S. Tsirkin m...@redhat.com wrote: On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote: On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote: On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. Intent wasn't; I've not used tracepoints before, but maybe we should consider a longer-term monitoring solution? Patch welcome! Cheers, Rusty. Sure, I'll look into this. There are several virtio trace events already in QEMU today (see the trace-events file): virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) vq %p elem %p len %u idx %u virtqueue_flush(void *vq, unsigned int count) vq %p count %u virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int out_num) vq %p elem %p in_num %u out_num %u virtio_queue_notify(void *vdev, int n, void *vq) vdev %p n %d vq %p virtio_irq(void *vq) vq %p virtio_notify(void *vdev, void *vq) vdev %p vq %p These can be used by building QEMU with a suitable tracing backend like SystemTap (see docs/tracing.txt). Inside the guest I've used dynamic ftrace in the past, although static tracepoints would be nice. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. The only alternative is Stefan's original proposal. But that required fiddling with the signal mask twice per KVM_RUN. I think my original patch messed with the sigmask in the wrong place, as you mentioned doing it twice per KVM_RUN isn't a good idea. I wonder if we can enable SIGALRM only in blocking calls and guest code execution but without signalfd. It might be possible, I don't see an immediate problem with doing that, we might have to use pselect(2) or similar in a few places. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On Mon, Jan 31, 2011 at 12:18 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 13:13, Stefan Hajnoczi wrote: On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. The only alternative is Stefan's original proposal. But that required fiddling with the signal mask twice per KVM_RUN. I think my original patch messed with the sigmask in the wrong place, as you mentioned doing it twice per KVM_RUN isn't a good idea. I wonder if we can enable SIGALRM only in blocking calls and guest code execution but without signalfd. It might be possible, I don't see an immediate problem with doing that, we might have to use pselect(2) or similar in a few places. My main concern about alternative approaches is that IOTHREAD is about to become the default, and hardly anyone (of the few upstream KVM users) will run without it in the foreseeable future. The next step will be the removal of any !CONFIG_IOTHREAD section. So, how much do we want to invest here (provided my proposal has not remaining issues)? Yes, you're right. I'm not volunteering to dig more into this, the best case would be to switch to a non-I/O thread world that works for everybody. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero-copy block driver?
2011/1/29 Darko Petrović darko.b.petro...@gmail.com: Could you please tell me if it is possible to use a block driver that completely avoids the guest kernel and copies block data directly to/from the given buffer in the guest userspace? If yes, how to activate it? If not... why not? :) Inside the guest, open files using the O_DIRECT flag. This tells the guest kernel to avoid the page cache when possible, enabling zero-copy. You need to use aligned memory buffers and perform I/O in multiples of the block size. See the open(2) man page for details. Make sure you really want to do this, most applications don't. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero-copy block driver?
2011/1/29 Darko Petrović darko.b.petro...@gmail.com: Thanks for your help. Actually, I am more interested in doing it from the outside, if possible (I am not allowed to change the application code). Can the guest be tricked by KVM somehow, using the appropriate drivers? Just to clear it out, copying to/from a host buffer is fine, I just want to avoid having guest buffers. Not really. If the application is designed to use the page cache then it will use it. You might want to look at unmapped page cache control which is not in mainline Linux yet: http://lwn.net/Articles/419713/ Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qemu-img create problem
On Fri, Jan 28, 2011 at 1:13 PM, Himanshu Chauhan hschau...@nulltrace.org wrote: I just cloned qemu-kvm, built and installed it. But the qemu-img fails to create any disk image above 1G. The problem as I see is use of ssize_t for image size. When size is 2G, the check if (sval 0) succeeds and I get the error: This is fixed in qemu.git 70b4f4bb05ff5e6812c6593eeefbd19bd61b517d Make strtosz() return int64_t instead of ssize_t. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call agenda for Jan 25
On Tue, Jan 25, 2011 at 2:02 PM, Luiz Capitulino lcapitul...@redhat.com wrote: - Google summer of code 2011 is on, are we interested? (note: I just saw the news, I don't have any information yet) http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/timeline I'd like to see an in-place QCOW2 - QED image converter with tests. I'm interested in mentoring this year. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Jan 25
On Tue, Jan 25, 2011 at 2:26 PM, Avi Kivity a...@redhat.com wrote: On 01/25/2011 12:06 AM, Anthony Liguori wrote: On 01/24/2011 07:25 AM, Chris Wright wrote: Please send in any agenda items you are interested in covering. - coroutines for the block layer I have a perpetually in progress branch for this, and would very much like to see this done. Seen this? http://repo.or.cz/w/qemu/stefanha.git/commit/8179e8ff20bb3f14f361109afe5b3bf2bac24f0d http://repo.or.cz/w/qemu/stefanha.git/shortlog/8179e8ff20bb3f14f361109afe5b3bf2bac24f0d And the qemu-devel thread: http://www.mail-archive.com/qemu-devel@nongnu.org/msg52522.html Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ATA Trim for qcow(2)
On Sun, Jan 23, 2011 at 9:35 PM, Emil Langrock emil.langr...@gmx.de wrote: there is support for ext4 to use the trim ATA command when a block is freed. I read that there should be an extra command which does that freeing afterwards. So is it possible to use that information inside the qcow to mark those sectors as free? This would make it possible to shrink the size of an image significantly using some offline (maybe also some online) tools. There is currently no TRIM support in qcow2. Christoph Hellwig recently added TRIM support to raw images on an XFS host file system. In the future we'll see wider support. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call agenda for Jan 11
On Mon, Jan 10, 2011 at 1:05 PM, Jes Sorensen jes.soren...@redhat.com wrote: On 01/10/11 12:59, Juan Quintela wrote: Juan Quintela quint...@redhat.com wrote: Now sent it to the right kvm list. Sorry for the second sent. Please send any agenda items you are interested in covering. - KVM Forum 2011 (Jes). Just to add a bit more background. Last year we discussed the issue of whether to aim for a KVM Forum in the same style as we had in 2010, or whether to try to aim for a broader multi-track Virtualization conference that covers the whole stack. Linux Foundation is happy to help host such an event, but they are asking for what our plans are. I posted a mock-proposal for tracks here: http://www.linux-kvm.org/page/KVM_Forum_2011 I thought having both KVM and Xen people at Linux Plumbers 2010 worked out well. Doing that with libvirt, OpenStack, etc has a lot of potential. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-0.13.0 - winsows 2008 - chkdisk too slow
On Thu, Jan 6, 2011 at 7:48 AM, Nikola Ciprich extmaill...@linuxbox.cz wrote: So windows started checking disk integrity, but the problem is, that it's waaay too slow - after ~12 hours, it's still running and seeems like it'll take ages to finish. Please post your KVM command-line. Have you run storage benchmarks on the host to check what sort of maximum I/O performance you can expect? Do you have a RAID setup underneath LVM? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FIXED: Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)
On Wed, Jan 5, 2011 at 5:01 PM, Serge E. Hallyn se...@hallyn.com wrote: I don't see this patch in the git tree, nor a revert of the buggy commit. Was any decision made on this? Blue Swirl posted a patch a few days ago: [PATCH] pc: move port 92 stuff back to pc.c from pckbd.c It hasn't been merged yet but I don't see any objections to it on the email thread. Perhaps he's just busy. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk activity issue...
On Thu, Dec 30, 2010 at 11:25 PM, Erich Weiler bitscrub...@gmail.com wrote: I've got this issue that I've been banging my head against a wall for a while over and I think another pair of eyes may help, if anyone have a moment. We have this new-ish KVM VM server (with the latest CentOS 5.5 updates, kmod-kvm-83-164.el5_5.25) that houses 3 VMs. It works mostly as expected except it has a very high load all the time, like 40-60, when the VMs are running. I suspect it has to do with memory management, because when all 3 VMs are online, they should consume 5GB RAM on the VM server and they only consume like 2GB, so I think the rest of the RAM is swapping or something, because the disks are spinning at 100% all the time (even when the VMs are doing nothing). Although, the VM server does not report any swapping happening. When I shut down the VMs one by one, the load drops and so does the disk activity. I don't think I set this server up with anything out of the ordinary... I've tried rebooting, but the same thing happens immediately upon reboot. Google searches, for me at least, yielded nothing useful. very high load all the time, like 40-60 Is this number the host CPU utilization or load average? Which guest OS (and versions) are you running? Can you paste the qemu-kvm command-line for the VMs? Can you send a few lines of vmstat 5 output on the host while running the 3 VMs? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/21] Introduce event-tap.
On Tue, Jan 4, 2011 at 11:02 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: After doing some heavy load tests, I realized that we have to take a hybrid approach to replay for now. This is because when a device moves to the next state (e.g. virtio decreases inuse) is different between net and block. For example, virtio-net decreases inuse upon returning from the net layer, but virtio-blk does that inside of the callback. If we only use pio/mmio replay, even though event-tap tries to replay net requests, some get lost because the state has proceeded already. This doesn't happen with block, because the state is still old enough to replay. Note that using hybrid approach won't cause duplicated requests on the secondary. Thanks Yoshi. I think I understand what you're saying. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)
On Sat, Dec 25, 2010 at 7:02 PM, Peter Lieven p...@dlh.net wrote: this was the outcome of my bisect session: 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a is first bad commit commit 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a Author: Blue Swirl blauwir...@gmail.com Date: Sat May 22 07:59:01 2010 + Compile pckbd only once Use a qemu_irq to indicate A20 line changes. Move I/O port 92 to pckbd.c. Signed-off-by: Blue Swirl blauwir...@gmail.com :100644 100644 acbaf227455f931f3ef6dbe0bb4494c6b41f2cd9 1a33d4eb4a5624c55896871b5f4ecde78a49ff28 M Makefile.objs :100644 100644 a22484e1e98355a35deeb5038a45fb8fe8685a91 ba5147fbc48e4faef072a5be6b0d69d3201c1e18 M Makefile.target :04 04 dd03f81a42b5162c93c40c517f45eb9f7bece93c 309f472328632319a15128a59715aa63daf4d92c M default-configs :04 04 83201c4fcde2f592a771479246e0a33a8906515b b1192bce85f2a7129fb19cf2fe7462ef168165cb M hw bisect run success Nice job bisecting this! I can reproduce the Memtest86+ V4.10 system reset with qemu-kvm.git and qemu.git. The following code path is hit when val=0x2: if (!(val 1)) { qemu_system_reset_request(); } I think unifying ioport 0x92 and KBD_CCMD_WRITE_OUTPORT was incorrect. ioport 0x92 is the System Control Port A and resets the system if bit 0 is 1. The keyboard outport seems to reset if bit 0 is 0. Here are the links I've found describing the i8042 keyboard controller and System Control Port A: http://www.computer-engineering.org/ps2keyboard/ http://www.win.tue.nl/~aeb/linux/kbd/A20.html Blue Swirl: Any thoughts on this? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FIXED: Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)
On Sun, Dec 26, 2010 at 9:21 PM, Peter Lieven p...@dlh.net wrote: Am 25.12.2010 um 20:02 schrieb Peter Lieven: Am 23.12.2010 um 03:42 schrieb Stefan Hajnoczi: On Wed, Dec 22, 2010 at 10:02 AM, Peter Lieven p...@dlh.net wrote: If I start a VM with the following parameters qemu-kvm-0.13.0 -m 2048 -smp 2 -monitor tcp:0:4014,server,nowait -vnc :14 -name 'ubuntu.test' -boot order=dc,menu=off -cdrom ubuntu-10.04.1-desktop-amd64.iso -k de and select memtest in the Ubuntu CD Boot Menu, the VM immediately resets. After this reset there happen several errors including graphic corruption or the qemu-kvm binary aborting with error 134. Exactly the same scenario on the same machine with qemu-kvm-0.12.5 works flawlessly. Any ideas? You could track down the commit which broke this using git-bisect(1). The steps are: $ git bisect start v0.13.0 v0.12.5 Then: $ ./configure [...] make $ x86_64-softmmu/qemu-system-x86_64 -m 2048 -smp 2 -monitor tcp:0:4014,server,nowait -vnc :14 -name 'ubuntu.test' -boot order=dc,menu=off -cdrom ubuntu-10.04.1-desktop-amd64.iso -k de If memtest runs as expected: $ git bisect good otherwise: $ git bisect bad Keep repeating this and you should end up at the commit that introduced the bug. this was the outcome of my bisect session: 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a is first bad commit commit 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a Author: Blue Swirl blauwir...@gmail.com Date: Sat May 22 07:59:01 2010 + Compile pckbd only once Use a qemu_irq to indicate A20 line changes. Move I/O port 92 to pckbd.c. Signed-off-by: Blue Swirl blauwir...@gmail.com :100644 100644 acbaf227455f931f3ef6dbe0bb4494c6b41f2cd9 1a33d4eb4a5624c55896871b5f4ecde78a49ff28 M Makefile.objs :100644 100644 a22484e1e98355a35deeb5038a45fb8fe8685a91 ba5147fbc48e4faef072a5be6b0d69d3201c1e18 M Makefile.target :04 04 dd03f81a42b5162c93c40c517f45eb9f7bece93c 309f472328632319a15128a59715aa63daf4d92c M default-configs :04 04 83201c4fcde2f592a771479246e0a33a8906515b b1192bce85f2a7129fb19cf2fe7462ef168165cb M hw bisect run success I tracked down the regression to a bug in commit 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a In the patch the outport of the keyboard controller and ioport 0x92 are made the same. this cannot work: a) both share bit 1 to enable a20_gate. 1=enable, 0=disable - ok so far b) both implement a fast reset option through bit 0, but with inverse logic!!! the keyboard controller resets if bit 0 is lowered, the ioport 0x92 resets if bit 0 is raised. c) all other bits have nothing in common at all. see: http://www.brokenthorn.com/Resources/OSDev9.html I have a proposed patch attached. Comments appreciated. The state of the A20 Gate is still shared between ioport 0x92 and outport of the keyboard controller, but all other bits are ignored. They might be used in the future to emulate e.g. hdd led activity or other usage of ioport 0x92. I have tested the attached patch. memtest works again as expected. I think it crashed because it uses ioport 0x92 directly to enable the a20 gate. Peter --- --- qemu-0.13.0/hw/pckbd.c 2010-10-15 22:56:09.0 +0200 +++ qemu-0.13.0-fix/hw/pckbd.c 2010-12-26 19:38:35.835114033 +0100 @@ -212,13 +212,16 @@ static void ioport92_write(void *opaque, uint32_t addr, uint32_t val) { KBDState *s = opaque; - - DPRINTF(kbd: write outport=0x%02x\n, val); - s-outport = val; - if (s-a20_out) { - qemu_set_irq(*s-a20_out, (val 1) 1); + if (val 0x02) { // bit 1: enable/disable A20 + if (s-a20_out) qemu_irq_raise(*s-a20_out); + s-outport |= KBD_OUT_A20; + } + else + { + if (s-a20_out) qemu_irq_lower(*s-a20_out); + s-outport = ~KBD_OUT_A20; } - if (!(val 1)) { + if ((val 1)) { // bit 0: raised - fast reset qemu_system_reset_request(); } } @@ -226,11 +229,8 @@ static uint32_t ioport92_read(void *opaque, uint32_t addr) { KBDState *s = opaque; - uint32_t ret; - - ret = s-outport; - DPRINTF(kbd: read outport=0x%02x\n, ret); - return ret; + return (s-outport 0x02); // only bit 1 (KBD_OUT_A20) of port 0x92 is identical to s-outport + /* XXX: bit 0 is fast reset, bits 6-7 hdd activity */ } static void kbd_write_command(void *opaque, uint32_t addr, uint32_t val) @@ -340,7 +340,9 @@ kbd_queue(s, val, 1); break; case KBD_CCMD_WRITE_OUTPORT: - ioport92_write(s, 0, val); + ioport92_write(s, 0, (ioport92_read(s,0) 0xfc) // copy bits 2-7 of 0x92 + | (val 0x02) // bit 1 (enable a20) + | (~val 0x01)); // bit 0 (fast reset) of port 0x92 has inverse logic break; case KBD_CCMD_WRITE_MOUSE: ps2_write_mouse(s-mouse, val); I just replied to the original thread. I think we should separate 0x92
Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)
On Wed, Dec 22, 2010 at 10:02 AM, Peter Lieven p...@dlh.net wrote: If I start a VM with the following parameters qemu-kvm-0.13.0 -m 2048 -smp 2 -monitor tcp:0:4014,server,nowait -vnc :14 -name 'ubuntu.test' -boot order=dc,menu=off -cdrom ubuntu-10.04.1-desktop-amd64.iso -k de and select memtest in the Ubuntu CD Boot Menu, the VM immediately resets. After this reset there happen several errors including graphic corruption or the qemu-kvm binary aborting with error 134. Exactly the same scenario on the same machine with qemu-kvm-0.12.5 works flawlessly. Any ideas? You could track down the commit which broke this using git-bisect(1). The steps are: $ git bisect start v0.13.0 v0.12.5 Then: $ ./configure [...] make $ x86_64-softmmu/qemu-system-x86_64 -m 2048 -smp 2 -monitor tcp:0:4014,server,nowait -vnc :14 -name 'ubuntu.test' -boot order=dc,menu=off -cdrom ubuntu-10.04.1-desktop-amd64.iso -k de If memtest runs as expected: $ git bisect good otherwise: $ git bisect bad Keep repeating this and you should end up at the commit that introduced the bug. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
On Fri, Dec 17, 2010 at 4:19 PM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/12/17 Stefan Hajnoczi stefa...@gmail.com: On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/12/16 Michael S. Tsirkin m...@redhat.com: On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote: 2010/11/28 Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp: 2010/11/28 Michael S. Tsirkin m...@redhat.com: On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote: Record ioport event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Interesting. This will have to be extended to support ioeventfd. Since each eventfd is really just a binary trigger it should be enough to read out the fd state. Haven't thought about eventfd yet. Will try doing it in the next spin. Hi Michael, I looked into eventfd and realized it's only used with vhost now. There are patches on list to use it for block/userspace net. Thanks. Now I understand. In that case, inserting an even-tap function to the following code should be appropriate? int event_notifier_test_and_clear(EventNotifier *e) { uint64_t value; int r = read(e-fd, value, sizeof(value)); return r == sizeof(value); } However, I believe vhost bypass the net layer in qemu, and there is no way for Kemari to detect the outputs. To me, it doesn't make sense to extend this patch to support eventfd... Here is the userspace ioeventfd patch series: http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html Instead of switching to QEMU userspace to handle the virtqueue kick pio write, we signal the eventfd inside the kernel and resume guest code execution. The I/O thread can then process the virtqueue kick in parallel to guest code execution. I think this can still be tied into Kemari. If you are switching to a pure net/block-layer event tap instead of pio/mmio, then I think it should just work. That should take a while until we solve how to set correct callbacks to the secondary upon failover. BTW, do you have a plan to move the eventfd framework to the upper layer as pio/mmio. Not only Kemari works for free, other emulators should be able to benefit from it. I'm not sure I understand the question but I have considered making ioeventfd a first-class interface like register_ioport_write(). In some ways that would be cleaner than the way we use ioeventfd in vhost and virtio-pci today. For vhost it would be more difficult to integrate with Kemari. At this point, it's impossible. As Michael said, I should prevent starting Kemari when vhost=on. If you add some functionality to vhost it might be possible, although that would slow it down. So perhaps for the near future using vhost with Kemari is pointless anyway since you won't be able to reach the performance that vhost-net can achieve. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/12/16 Michael S. Tsirkin m...@redhat.com: On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote: 2010/11/28 Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp: 2010/11/28 Michael S. Tsirkin m...@redhat.com: On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote: Record ioport event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Interesting. This will have to be extended to support ioeventfd. Since each eventfd is really just a binary trigger it should be enough to read out the fd state. Haven't thought about eventfd yet. Will try doing it in the next spin. Hi Michael, I looked into eventfd and realized it's only used with vhost now. There are patches on list to use it for block/userspace net. Thanks. Now I understand. In that case, inserting an even-tap function to the following code should be appropriate? int event_notifier_test_and_clear(EventNotifier *e) { uint64_t value; int r = read(e-fd, value, sizeof(value)); return r == sizeof(value); } However, I believe vhost bypass the net layer in qemu, and there is no way for Kemari to detect the outputs. To me, it doesn't make sense to extend this patch to support eventfd... Here is the userspace ioeventfd patch series: http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html Instead of switching to QEMU userspace to handle the virtqueue kick pio write, we signal the eventfd inside the kernel and resume guest code execution. The I/O thread can then process the virtqueue kick in parallel to guest code execution. I think this can still be tied into Kemari. If you are switching to a pure net/block-layer event tap instead of pio/mmio, then I think it should just work. For vhost it would be more difficult to integrate with Kemari. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ConVirt 2.0.1 Open Source released.
On Wed, Dec 15, 2010 at 8:28 PM, jd jdsw2...@yahoo.com wrote: We are pleased to announce availability of ConVirt 2.0.1 open source. We would like to thank ConVirt user community for their continuing participation and support. This release incorporates feedback gathered from the community over last few months. jd: A description of ConVirt would be nice. Here's what I've figured out from the links: It is a management tool for Xen, KVM, and others. Written in Python under the GPLv2 but developed as open core software (there's an open source edition and an enterprise edition). It talks to KVM using the QEMU (human) monitor. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/O Performance Tips
On Thu, Dec 9, 2010 at 12:52 PM, Sebastian Nickel - Hetzner Online AG sebastian.nic...@hetzner.de wrote: here is the qemu command line we are using (or which libvirt generates): /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name vm-933 -uuid 0d737610-e59b-012d-f453-32287f7402ab -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/vm-933.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot nc -drive file=/dev/vg0/934,if=none,id=drive-ide0-0-0,boot=on,format=raw,cache=writeback -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device rtl8139,vlan=0,id=net0,mac=00:1c:14:01:03:67,bus=pci.0,addr=0x3 -net tap,fd=23,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 We are explicit using writeback in cache settings. The first backtrace is odd. You are using logical volumes for the guest but the backtrace shows kjournald is blocked. I believe logical volumes should not directly affect kjournald at all (they don't use journalling). Perhaps this is a deadlock. The dmesg output was just an example. Most of the time I can see the tasks kjournald and flush. I recently saw kvm,kthreadd,rsyslogd and others in such outputs. I thought that sometimes /proc/sys/vm/dirty_ratio gets exceeded and so all processes are blocked for writes to the cache (kvm processes too). Could this be the case? I set dirty_background_ratio to 5 to constantly flush the cache to the disk, but this did not help. About the flush-251:0:505 hang, please cat /proc/partitions on the host to see which block device has major number 251 and minor number 0 is. This is our logical volume root partition of the physical host. The fact that your host is having problems suggests the issue is not in qemu-kvm (it's just a userspace process). Are you sure disk I/O is working under load on this machine without KVM? I do not think that kvm generates this issue (as you said it is a normal user space process). I thought that perhaps somebody knows how to handle this situation, because the kvm developers have much more experience with kvm than I do. Perhaps there are some tuning tips for this or anybody knows why only OpenSuse sets the filesystem read only if there are disk timeouts in the guest? This behavior appeared on almost all hosts (20) so I can eliminate a single machine HW failure. Christoph any pointers on how to debug this? The backtraces from the original email are below: Am Donnerstag, den 09.12.2010, 10:30 + schrieb Stefan Hajnoczi: On Thu, Dec 9, 2010 at 8:10 AM, Sebastian Nickel - Hetzner Online AG sebastian.nic...@hetzner.de wrote: Hello, we have got some issues with I/O in our kvm environment. We are using kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we are using ksm, too. Recently we noticed that sometimes the guest systems (mainly OpenSuse guest systems) suddenly have a read only filesystem. After some inspection we found out that the guest system generates some ata errors due to timeouts (mostly in flush cache situations). On the physical host there are always the same kernel messages when this happens: [1508127.195469] INFO: task kjournald:497 blocked for more than 120 seconds. [1508127.212828] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [1508127.246841] kjournald D 0 497 2 0x [1508127.246848] 88062128dba0 0046 00015bc0 00015bc0 [1508127.246855] 880621089ab0 88062128dfd8 00015bc0 8806210896f0 [1508127.246862] 00015bc0 88062128dfd8 00015bc0 880621089ab0 [1508127.246868] Call Trace: [1508127.246880] [8116e500] ? sync_buffer+0x0/0x50 [1508127.246889] [81557d87] io_schedule+0x47/0x70 [1508127.246893] [8116e545] sync_buffer+0x45/0x50 [1508127.246897] [8155825a] __wait_on_bit_lock+0x5a/0xc0 [1508127.246901] [8116e500] ? sync_buffer+0x0/0x50 [1508127.246905] [81558338] out_of_line_wait_on_bit_lock +0x78/0x90 [1508127.246911] [810850d0] ? wake_bit_function+0x0/0x40 [1508127.246915] [8116e6c6] __lock_buffer+0x36/0x40 [1508127.246920] [81213d11] journal_submit_data_buffers +0x311/0x320 [1508127.246924] [81213ff2] journal_commit_transaction +0x2d2/0xe40 [1508127.246931] [810397a9] ? default_spin_lock_flags +0x9/0x10 [1508127.246935] [81076c7c] ? lock_timer_base+0x3c/0x70 [1508127.246939] [81077719] ? try_to_del_timer_sync+0x79/0xd0 [1508127.246943] [81217f0d] kjournald+0xed/0x250 [1508127.246947] [81085090] ? autoremove_wake_function +0x0/0x40 [1508127.246951] [81217e20] ? kjournald+0x0/0x250
Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Mon, Nov 15, 2010 at 11:20 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Sun, Nov 14, 2010 at 12:19 PM, Avi Kivity a...@redhat.com wrote: On 11/14/2010 01:05 PM, Avi Kivity wrote: I agree, but let's enable virtio-ioeventfd carefully because bad code is out there. Sure. Note as long as the thread waiting on ioeventfd doesn't consume too much cpu, it will awaken quickly and we won't have the transaction per timeslice effect. btw, what about virtio-blk with linux-aio? Have you benchmarked that with and without ioeventfd? And, what about efficiency? As in bits/cycle? We are running benchmarks with this latest patch and will report results. Full results here (thanks to Khoa Huynh): http://wiki.qemu.org/Features/VirtioIoeventfd The host CPU utilization is scaled to 16 CPUs so a 2-3% reduction is actually in the 32-48% range for a single CPU. The guest CPU utilization numbers include an efficiency metric: %vcpu per MB/sec. Here we see significant improvements too. Guests that previously couldn't get more CPU work done now have regained some breathing space. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Wed, Dec 1, 2010 at 12:30 PM, Avi Kivity a...@redhat.com wrote: On 12/01/2010 01:44 PM, Stefan Hajnoczi wrote: And, what about efficiency? As in bits/cycle? We are running benchmarks with this latest patch and will report results. Full results here (thanks to Khoa Huynh): http://wiki.qemu.org/Features/VirtioIoeventfd The host CPU utilization is scaled to 16 CPUs so a 2-3% reduction is actually in the 32-48% range for a single CPU. The guest CPU utilization numbers include an efficiency metric: %vcpu per MB/sec. Here we see significant improvements too. Guests that previously couldn't get more CPU work done now have regained some breathing space. Thanks for those numbers. The guest improvements were expected, but the host numbers surprised me. Do you have an explanation as to why total host load should decrease? The first vcpu does virtqueue kick - it holds the guest driver vblk-lock across kick. Before this kick completes a second vcpu tries to acquire vblk-lock, finds it is contended, and spins. So we're burning CPU due to the long vblk-lock hold times. With virtio-ioeventfd those kick times are reduced an there is less contention on vblk-lock. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/21] Introduce event-tap.
On Tue, Nov 30, 2010 at 9:50 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/29 Stefan Hajnoczi stefa...@gmail.com: On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: event-tap controls when to start FT transaction, and provides proxy functions to called from net/block devices. While FT transaction, it queues up net/block requests, and flush them when the transaction gets completed. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.target | 1 + block.h | 9 + event-tap.c | 794 +++ event-tap.h | 34 +++ net.h | 4 + net/queue.c | 1 + 6 files changed, 843 insertions(+), 0 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h event_tap_state is checked at the beginning of several functions. If there is an unexpected state the function silently returns. Should these checks really be assert() so there is an abort and backtrace if the program ever reaches this state? Fancier error handling would work too. For example cleaning up, turning off Kemari, and producing an error message with error_report(). In that case we need to think through the state of the environment carefully and make sure we don't cause secondary failures (like memory leaks). BTW, I would like to ask a question regarding this. There is a callback which net/block calls after processing the requests, and is there a clean way to set this callback on the failovered host upon replay? I think this is a limitation in the current design. If requests are re-issued by Kemari at the net/block level, how will the higher layers know about these requests? How will they be prepared to accept callbacks? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Nov 30
On Tue, Nov 30, 2010 at 12:58 PM, Dor Laor dl...@redhat.com wrote: Please send in any agenda items you are interested in covering. Juan already has a thread for agenda items. It includes: As I forgot to put the call for agenda befor, Anthony already suggested: - 2011 kvm conference - 0.14.0 release plan - infrastructure changes (irc channel migration, git tree migration) Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limiting guest block i/o for qos
On Mon, Nov 29, 2010 at 2:00 AM, T Johnson tjohnso...@gmail.com wrote: Hello, On Thu, Nov 25, 2010 at 3:33 AM, Nikola Ciprich extmaill...@linuxbox.cz wrote: Hello Thomas, I t hink blkio-cgroup really can't help You here, but since NFS is network protocol, why not just consider some kind of network shaping? n. I thought about this, but it's rather imprecise I imagine if I try to limit the number of packets per second and hope that matches reads or writes per second. Secondly, I have many guests running to the same NFS server which makes limiting per kvm guest somewhat impossible when the network tools I know if would limit per NFS server. Perhaps iptables/tc can mark the stream based on the client process ID? Each VM has a qemu-kvm userspace process that will issue file I/O. Someone with more networking knowledge could confirm whether or not it is possible to mark based on the process ID using the in-kernel NFS client. You don't need to limit based on packets per second. You can do bandwidth-based traffic shaping with tc. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
On Sat, Nov 27, 2010 at 1:11 PM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/27 Stefan Hajnoczi stefa...@gmail.com: On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/27 Stefan Hajnoczi stefa...@gmail.com: On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/27 Blue Swirl blauwir...@gmail.com: On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: Somehow I find some similarities to instrumentation patches. Perhaps the instrumentation framework could be used (maybe with some changes) for Kemari as well? That could be beneficial to both. Yes. I had the same idea but I'm not sure how tracing works. I think Stefan Hajnoczi knows it better. Stefan, is it possible to call arbitrary functions from the trace points? Yes, if you add code to ./tracetool. I'm not sure I see the connection between Kemari and tracing though. The connection is that it may be possible to remove Kemari specific hook point like in ioport.c and exec.c, and let tracing notify Kemari instead. I actually think the other way. Tracing just instruments and stashes away values. It does not change inputs or outputs, it does not change control flow, it does not affect state. Going down the route of side-effects mixes two different things: hooking into a subsystem and instrumentation. For hooking into a subsystem we should define proper interfaces. That interface can explicitly support modifying inputs/outputs or changing control flow. Tracing is much more ad-hoc and not a clean interface. It's also based on a layer of indirection via the tracetool code generator. That's okay because it doesn't affect the code it is called from and you don't need to debug trace events (they are simple and have almost no behavior). Hooking via tracing is just taking advantage of the cheap layer of indirection in order to get at interesting events in a subsystem. It's easy to hook up and quick to develop, but it's not a proper interface and will be hard to understand for other developers. One question I have about Kemari is whether it adds new constraints to the QEMU codebase? Fault tolerance seems like a cross-cutting concern - everyone writing device emulation or core QEMU code may need to be aware of new constraints. For example, you are not allowed to release I/O operations to the outside world directly, instead you need to go through Kemari code which makes I/O transactional and communicates with the passive host. You have converted e1000, virtio-net, and virtio-blk. How do we make sure new devices that are merged into qemu.git don't break Kemari? How do we go about supporting the existing hw/* devices? Whether Kemari adds constraints such as you mentioned, yes. If the devices (including existing ones) don't call Kemari code, they would certainly break Kemari. Altough using proxies looks explicit, to make it unaware from people writing device emulation, it's possible to remove proxies and put changes only into the block/net layer as Blue suggested. Anything that makes it hard to violate the constraints is good. Otherwise Kemari might get broken in the future and no one will know until a failover behaves incorrectly. Blue and Paul prefer to put it into block/net layer, and you think it's better to provide API. Sorry, I wasn't clear. I agree that event tap behavior should be in generic block and net layer code. That way we're guaranteeing that all net and block I/O goes through event tap. Could you formulate the constraints so developers are aware of them in the future and can protect the codebase. How about expanding the Kemari wiki pages? If you like the idea above, I'm happy to make the list also on the wiki page. Here's a different question: what requirements must an emulated device meet in order to be added to the Kemari supported whitelist? That's what I want to know so that I don't break existing devices and can add new devices that work with Kemari :). Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/21] Introduce event-tap.
On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: event-tap controls when to start FT transaction, and provides proxy functions to called from net/block devices. While FT transaction, it queues up net/block requests, and flush them when the transaction gets completed. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.target | 1 + block.h | 9 + event-tap.c | 794 +++ event-tap.h | 34 +++ net.h | 4 + net/queue.c | 1 + 6 files changed, 843 insertions(+), 0 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h event_tap_state is checked at the beginning of several functions. If there is an unexpected state the function silently returns. Should these checks really be assert() so there is an abort and backtrace if the program ever reaches this state? +typedef struct EventTapBlkReq { + char *device_name; + int num_reqs; + int num_cbs; + bool is_multiwrite; Is multiwrite logging necessary? If event tap is called from within the block layer then multiwrite is turned into one or more bdrv_aio_writev() calls. +static void event_tap_replay(void *opaque, int running, int reason) +{ + EventTapLog *log, *next; + + if (!running) { + return; + } + + if (event_tap_state != EVENT_TAP_LOAD) { + return; + } + + event_tap_state = EVENT_TAP_REPLAY; + + QTAILQ_FOREACH(log, event_list, node) { + EventTapBlkReq *blk_req; + + /* event resume */ + switch (log-mode ~EVENT_TAP_TYPE_MASK) { + case EVENT_TAP_NET: + event_tap_net_flush(log-net_req); + break; + case EVENT_TAP_BLK: + blk_req = log-blk_req; + if ((log-mode EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) { + switch (log-ioport.index) { + case 0: + cpu_outb(log-ioport.address, log-ioport.data); + break; + case 1: + cpu_outw(log-ioport.address, log-ioport.data); + break; + case 2: + cpu_outl(log-ioport.address, log-ioport.data); + break; + } + } else { + /* EVENT_TAP_MMIO */ + cpu_physical_memory_rw(log-mmio.address, + log-mmio.buf, + log-mmio.len, 1); + } + break; Why are net tx packets replayed at the net level but blk requests are replayed at the pio/mmio level? I expected everything to replay either as pio/mmio or as net/block. +static void event_tap_blk_load(QEMUFile *f, EventTapBlkReq *blk_req) +{ + BlockRequest *req; + ram_addr_t page_addr; + int i, j, len; + + len = qemu_get_byte(f); + blk_req-device_name = qemu_malloc(len + 1); + qemu_get_buffer(f, (uint8_t *)blk_req-device_name, len); + blk_req-device_name[len] = '\0'; + blk_req-num_reqs = qemu_get_byte(f); + + for (i = 0; i blk_req-num_reqs; i++) { + req = blk_req-reqs[i]; + req-sector = qemu_get_be64(f); + req-nb_sectors = qemu_get_be32(f); + req-qiov = qemu_malloc(sizeof(QEMUIOVector)); It would make sense to have common QEMUIOVector load/save functions instead of inlining this code here. +static int event_tap_load(QEMUFile *f, void *opaque, int version_id) +{ + EventTapLog *log, *next; + int mode; + + event_tap_state = EVENT_TAP_LOAD; + + QTAILQ_FOREACH_SAFE(log, event_list, node, next) { + QTAILQ_REMOVE(event_list, log, node); + event_tap_free_log(log); + } + + /* loop until EOF */ + while ((mode = qemu_get_byte(f)) != 0) { + EventTapLog *log = event_tap_alloc_log(); + + log-mode = mode; + switch (log-mode EVENT_TAP_TYPE_MASK) { + case EVENT_TAP_IOPORT: + event_tap_ioport_load(f, log-ioport); + break; + case EVENT_TAP_MMIO: + event_tap_mmio_load(f, log-mmio); + break; + case 0: + DPRINTF(No event\n); + break; + default: + fprintf(stderr, Unknown state %d\n, log-mode); + return -1; log is leaked here... + } + + switch (log-mode ~EVENT_TAP_TYPE_MASK) { + case EVENT_TAP_NET: + event_tap_net_load(f, log-net_req); + break; + case EVENT_TAP_BLK: + event_tap_blk_load(f, log-blk_req); + break; + default: + fprintf(stderr, Unknown state %d\n, log-mode); + return -1; ...and here. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
On Mon, Nov 29, 2010 at 3:00 PM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/29 Paul Brook p...@codesourcery.com: If devices incorrectly claim support for live migration, then that should also be fixed, either by removing the broken code or by making it work. I totally agree with you. AFAICT your current proposal is just feeding back the results of some fairly specific QA testing. I'd rather not get into that game. The correct response in the context of upstream development is to file a bug and/or fix the code. We already have config files that allow third party packagers to remove devices they don't want to support. Sorry, I didn't get what you're trying to tell me. My plan would be to initially start from a subset of devices, and gradually grow the number of devices that Kemari works with. While this process, it'll include what you said above, file a but and/or fix the code. Am I missing what you're saying? My point is that the whitelist shouldn't exist at all. Devices either support migration or they don't. Having some sort of separate whitelist is the wrong way to determine which devices support migration. Alright! Then if a user encounters a problem with Kemari, we'll fix Kemari or the devices or both. Correct? Is this a fair summary: any device that supports live migration workw under Kemari? (If such a device does not work under Kemari then this is a bug that needs to be fixed in live migration, Kemari, or the device.) Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/27 Stefan Hajnoczi stefa...@gmail.com: On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/27 Blue Swirl blauwir...@gmail.com: On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: Somehow I find some similarities to instrumentation patches. Perhaps the instrumentation framework could be used (maybe with some changes) for Kemari as well? That could be beneficial to both. Yes. I had the same idea but I'm not sure how tracing works. I think Stefan Hajnoczi knows it better. Stefan, is it possible to call arbitrary functions from the trace points? Yes, if you add code to ./tracetool. I'm not sure I see the connection between Kemari and tracing though. The connection is that it may be possible to remove Kemari specific hook point like in ioport.c and exec.c, and let tracing notify Kemari instead. I actually think the other way. Tracing just instruments and stashes away values. It does not change inputs or outputs, it does not change control flow, it does not affect state. Going down the route of side-effects mixes two different things: hooking into a subsystem and instrumentation. For hooking into a subsystem we should define proper interfaces. That interface can explicitly support modifying inputs/outputs or changing control flow. Tracing is much more ad-hoc and not a clean interface. It's also based on a layer of indirection via the tracetool code generator. That's okay because it doesn't affect the code it is called from and you don't need to debug trace events (they are simple and have almost no behavior). Hooking via tracing is just taking advantage of the cheap layer of indirection in order to get at interesting events in a subsystem. It's easy to hook up and quick to develop, but it's not a proper interface and will be hard to understand for other developers. One question I have about Kemari is whether it adds new constraints to the QEMU codebase? Fault tolerance seems like a cross-cutting concern - everyone writing device emulation or core QEMU code may need to be aware of new constraints. For example, you are not allowed to release I/O operations to the outside world directly, instead you need to go through Kemari code which makes I/O transactional and communicates with the passive host. You have converted e1000, virtio-net, and virtio-blk. How do we make sure new devices that are merged into qemu.git don't break Kemari? How do we go about supporting the existing hw/* devices? Whether Kemari adds constraints such as you mentioned, yes. If the devices (including existing ones) don't call Kemari code, they would certainly break Kemari. Altough using proxies looks explicit, to make it unaware from people writing device emulation, it's possible to remove proxies and put changes only into the block/net layer as Blue suggested. Anything that makes it hard to violate the constraints is good. Otherwise Kemari might get broken in the future and no one will know until a failover behaves incorrectly. Could you formulate the constraints so developers are aware of them in the future and can protect the codebase. How about expanding the Kemari wiki pages? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loading snapshot with -loadvm?
On Fri, Nov 26, 2010 at 8:47 AM, Jun Koi junkoi2...@gmail.com wrote: this created a snapshot named test on my image. then i tried to start the snapshot VM, like below: qemu-system-x86_64 -m 2000 -vga std -usb -usbdevice tablet -localtime -loadvm 2 -hda img.qcow2.win7_x64 but then i have a problem: the Qemu window shows up, with [Stopped] at window caption. it stays forever there, and doesnt proceed. is this a bug, or did i do something wrong? Solution: Switch to the QEMU monitor (Ctrl+Alt+2) and type 'c' to continue the VM. To switch back to the VM's display use Ctrl+Alt+1. I checked that it is expected behavior: if (loadvm) { if (load_vmstate(loadvm) 0) { autostart = 0; } } autostart = 0 means that your VM will be stopped. I'm not sure why -loadvm implies the VM will be stopped, there's already a different command-line option to keep the VM stopped (-S). Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Loading snapshot with -loadvm?
On Fri, Nov 26, 2010 at 9:26 AM, Jun Koi junkoi2...@gmail.com wrote: On Fri, Nov 26, 2010 at 5:18 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Nov 26, 2010 at 8:47 AM, Jun Koi junkoi2...@gmail.com wrote: this created a snapshot named test on my image. then i tried to start the snapshot VM, like below: qemu-system-x86_64 -m 2000 -vga std -usb -usbdevice tablet -localtime -loadvm 2 -hda img.qcow2.win7_x64 but then i have a problem: the Qemu window shows up, with [Stopped] at window caption. it stays forever there, and doesnt proceed. is this a bug, or did i do something wrong? Solution: Switch to the QEMU monitor (Ctrl+Alt+2) and type 'c' to continue the VM. To switch back to the VM's display use Ctrl+Alt+1. yes, i tried that, but the problem is that the Qemu window is not responsive. looks like it hangs up ... i even tried with -monitor stdio, then at console type c, but the monitor is not responsive, either. I checked that it is expected behavior: if (loadvm) { if (load_vmstate(loadvm) 0) { autostart = 0; } } autostart = 0 means that your VM will be stopped. but that happens when load_vmstate() 0, which means something is wrong? i must look at that code more closely. Me too, I missed the 0. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory leaks in virtio drivers?
On Fri, Nov 26, 2010 at 7:19 PM, Freddie Cash fjwc...@gmail.com wrote: Within 2 weeks of booting, the host machine is using 2 GB of swap, and disk I/O wait is through the roof. Restarting all of the VMs will free up RAM, but restarting the whole box is the only way to get performance back up. A guest configured to use 8 GB of RAM will have 9 GB virt and 7.5 GB res shown in top. In fact, every single VM shows virt above the limit set for the VM. Usually by close to 25%. Not sure about specific known issues with those Debian package versions, but... Virtual memory does not mean much. For example, a 64-bit process can map in 32 GB and never touch it. The virt number will be 32 GB but actually no RAM is being used. Or it could be a memory mapped file, which is backed by the disk and can pages can dropped if physical memory runs low. Looking at the virtual memory figure is not that useful. Also remember that qemu-kvm itself requires memory to perform the device emulation and virtualization. If you have an 8 GB VM, plan for more than 8 GB to be used. Clearly this memory overhead should be kept low, is your 25% virtual memory overhead figure from a small VM because 9 GB virtual / 8 GB VM is 12.5% not 25%? What is the sum of all VMs' RAM? I'm guessing you may have overcommitted resources (e.g. 2 x 8 GB VM on a 16 GB machine). If you don't leave host Linux system some resources you will get bad VM performance. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory leaks in virtio drivers?
On Fri, Nov 26, 2010 at 8:16 PM, Freddie Cash fjwc...@gmail.com wrote: On Fri, Nov 26, 2010 at 12:04 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Nov 26, 2010 at 7:19 PM, Freddie Cash fjwc...@gmail.com wrote: Within 2 weeks of booting, the host machine is using 2 GB of swap, and disk I/O wait is through the roof. Restarting all of the VMs will free up RAM, but restarting the whole box is the only way to get performance back up. A guest configured to use 8 GB of RAM will have 9 GB virt and 7.5 GB res shown in top. In fact, every single VM shows virt above the limit set for the VM. Usually by close to 25%. Not sure about specific known issues with those Debian package versions, but... Virtual memory does not mean much. For example, a 64-bit process can map in 32 GB and never touch it. The virt number will be 32 GB but actually no RAM is being used. Or it could be a memory mapped file, which is backed by the disk and can pages can dropped if physical memory runs low. Looking at the virtual memory figure is not that useful. Also remember that qemu-kvm itself requires memory to perform the device emulation and virtualization. If you have an 8 GB VM, plan for more than 8 GB to be used. Clearly this memory overhead should be kept low, is your 25% virtual memory overhead figure from a small VM because 9 GB virtual / 8 GB VM is 12.5% not 25%? What is the sum of all VMs' RAM? I'm guessing you may have overcommitted resources (e.g. 2 x 8 GB VM on a 16 GB machine). If you don't leave host Linux system some resources you will get bad VM performance. Nope, not overcommitted. Sum of RAM for all VMs (in MB): 512 + 768 + 1024 + 512 + 512 + 1024 + 1024 + 768 + 8192 = 14226 Leaving a little under 2 GB for the host. How do those VM RAM numbers stack up with ps -eo rss,args | grep kvm? If the rss reveals the qemu-kvm processes are 15 GB RAM then it might be worth giving them more breathing room. Doing further googling, could it be a caching issue in the host? We currently have no cache= settings for any of our virtual disks. I believe the default is still write-through? so the host is trying to cache everything. Yes, the default is writethrough. cache=none would reduce buffered file pages so it's worth a shot. Anyone know how to force libvirt to use cache='none' in the driver block? libvirt-bin 0.8.3 and virt-manager 0.8.4 ignore it if I edit the domain.xml file directly, and there's nowhere to set it in the virt-manager GUI. (Only 1 of the VMs is managed via libvirt currently.) A hack you can do if your libvirt does not support the driver cache=none/ attribute is to move /usr/bin/qemu-kvm out of the way and replace it with a shell script that does s/if=virtio/if=virtio,cache=none/ on its arguments before invoking the real /usr/bin/qemu-kvm. (Perhaps the cleaner way is editing the domain XML for emulator/usr/bin/kvm_cache_none.sh/emulator but I haven't tested it.) Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v8)
On Fri, Nov 26, 2010 at 9:59 PM, Christian Brunner c.m.brun...@gmail.com wrote: Thanks for the review. What am I supposed to do now? Kevin is the block maintainer. His review is the next step, I have CCed him. After that rbd would be ready to merge. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2010/11/27 Blue Swirl blauwir...@gmail.com: On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: Hi, This patch series is a revised version of Kemari for KVM, which applied comments for the previous post and KVM Forum 2010. The current code is based on qemu.git f711df67d611e4762966a249742a5f7499e19f99. For general information about Kemari, I've made a wiki page at qemu.org. http://wiki.qemu.org/Features/FaultTolerance The changes from v0.1.1 - v0.2 are: - Introduce a queue in event-tap to make VM sync live. - Change transaction receiver to a state machine for async receiving. - Replace net/block layer functions with event-tap proxy functions. - Remove dirty bitmap optimization for now. - convert DPRINTF() in ft_trans_file to trace functions. - convert fprintf() in ft_trans_file to error_report(). - improved error handling in ft_trans_file. - add a tmp pointer to qemu_del_vm_change_state_handler. The changes from v0.1 - v0.1.1 are: - events are tapped in net/block layer instead of device emulation layer. - Introduce a new option for -incoming to accept FT transaction. - Removed writev() support to QEMUFile and FdMigrationState for now. I would post this work in a different series. - Modified virtio-blk save/load handler to send inuse variable to correctly replay. - Removed configure --enable-ft-mode. - Removed unnecessary check for qemu_realloc(). The first 6 patches modify several functions of qemu to prepare introducing Kemari specific components. The next 6 patches are the components of Kemari. They introduce event-tap and the FT transaction protocol file based on buffered file. The design document of FT transaction protocol can be found at, http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf Then the following 4 patches modifies dma-helpers, virtio-blk virtio-net and e1000 to replace net/block layer functions with event-tap proxy functions. Please note that if Kemari is off, event-tap will just passthrough, and there is most no intrusion to exisiting functions including normal live migration. Would it be possible to make the changes only in the block/net layer, so that the devices are not modified at all? That is, the proxy function would always replaces the unproxied version. I understand the benefit of your suggestion. However it seems a bit tricky. It's because event-tap uses functions of emulators and net, but block.c is also linked for utilities like qemu-img that doesn't need emulators or net. In the previous version, I added function pointers to get around. http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html I wasn't confident of this approach and discussed it at KVM Forum, and decided to give a try to replace emulator functions with proxies. Suggestions are welcomed of course. Somehow I find some similarities to instrumentation patches. Perhaps the instrumentation framework could be used (maybe with some changes) for Kemari as well? That could be beneficial to both. Yes. I had the same idea but I'm not sure how tracing works. I think Stefan Hajnoczi knows it better. Stefan, is it possible to call arbitrary functions from the trace points? Yes, if you add code to ./tracetool. I'm not sure I see the connection between Kemari and tracing though. One question I have about Kemari is whether it adds new constraints to the QEMU codebase? Fault tolerance seems like a cross-cutting concern - everyone writing device emulation or core QEMU code may need to be aware of new constraints. For example, you are not allowed to release I/O operations to the outside world directly, instead you need to go through Kemari code which makes I/O transactional and communicates with the passive host. You have converted e1000, virtio-net, and virtio-blk. How do we make sure new devices that are merged into qemu.git don't break Kemari? How do we go about supporting the existing hw/* devices? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Nov 23
On Tue, Nov 23, 2010 at 2:37 PM, Kevin Wolf kw...@redhat.com wrote: Am 22.11.2010 14:55, schrieb Stefan Hajnoczi: On Mon, Nov 22, 2010 at 1:38 PM, Juan Quintela quint...@redhat.com wrote: Please send in any agenda items you are interested in covering. QCOW2 performance roadmap: * What can be done to achieve near-raw image format performance? * Benchmark results from an ideal QCOW2 model. Performance figures from a series of I/O scenarios: http://wiki.qemu.org/Qcow2/PerformanceRoadmap Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Nov 23
On Mon, Nov 22, 2010 at 1:38 PM, Juan Quintela quint...@redhat.com wrote: Please send in any agenda items you are interested in covering. QCOW2 performance roadmap: * What can be done to achieve near-raw image format performance? * Benchmark results from an ideal QCOW2 model. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on virtio frontend backend drivers
On Mon, Nov 22, 2010 at 2:05 PM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: I was under the impression that the each virtio driver will have a frontend and backend part. The frontend part would be loaded in the Guest OS and the backend driver will be loaded in the Host OS. These two drivers will communicate with each other. The backend driver will then retransmit the actual request to correct driver. But seems like my understanding is wrong. I attached a virtio disk to the Guest OS. When the Guest was booted, after creating a file system on the attached disk I mounted it. [pra...@prasad-fedora12-vm ~]$ lsmod | grep -i virtio virtio_blk 7352 1 virtio_pci 8680 0 virtio_ring 6080 1 virtio_pci virtio 5220 2 virtio_blk,virtio_pci But on the host machine no backend driver was loaded r...@prasad-desktop:~/VMDisks# lsmod | grep -i virtio r...@prasad-desktop:~/VMDisks# Does this mean there is no explicit backend driver? A virtio device is a PCI adapter in the guest. That's why you see virtio_pci. The userspace QEMU process (called qemu-kvm or qemu) does device emulation and contains the virtio code you are looking for. See hw/virtio-blk.c in qemu-kvm.git. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v8)
Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Tue, Nov 16, 2010 at 4:02 PM, Michael S. Tsirkin m...@redhat.com wrote: On Fri, Nov 12, 2010 at 01:24:28PM +, Stefan Hajnoczi wrote: Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) * !VIRTIO_CONFIG_S_DRIVER_OK - qemu_set_fd_handler(NULL), deassign host notifiers * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL) * virtio_pci_set_host_notifier(false) - qemu_set_fd_handler(virtio_pci_host_notifier_read) Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 152 ++ hw/virtio.c | 14 - hw/virtio.h | 13 + 3 files changed, 153 insertions(+), 26 deletions(-) Now toggles host notifiers based on VIRTIO_CONFIG_S_DRIVER_OK status changes. The cleanest way I could see was to introduce pre and a post set_status() callbacks. They allow a binding to hook status changes, including the status change from virtio_reset(). diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 549118d..117e855 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -83,6 +83,11 @@ /* Flags track per-device state like workarounds for quirks in older guests. */ #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) +/* Performance improves when virtqueue kick processing is decoupled from the + * vcpu thread using ioeventfd for some devices. */ +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT 1 +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD (1 VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT) + /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for * KVM or if kqemu gets SMP support. @@ -179,12 +184,125 @@ static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f) return 0; } +static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy, + int n, bool assign) +{ + VirtQueue *vq = virtio_get_queue(proxy-vdev, n); + EventNotifier *notifier = virtio_queue_get_host_notifier(vq); + int r; + if (assign) { + r = event_notifier_init(notifier, 1); + if (r 0) { + return r; + } + r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); + if (r 0) { + event_notifier_cleanup(notifier); + } + } else { + r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); + if (r 0) { + return r; + } + event_notifier_cleanup(notifier); + } + return r; +} + +static void virtio_pci_host_notifier_read(void *opaque) +{ + VirtQueue *vq = opaque; + EventNotifier *n = virtio_queue_get_host_notifier(vq); + if (event_notifier_test_and_clear(n)) { + virtio_queue_notify_vq(vq); + } +} + +static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy, + int n, bool assign) +{ + VirtQueue *vq = virtio_get_queue(proxy-vdev, n); + EventNotifier *notifier = virtio_queue_get_host_notifier(vq); + if (assign) { + qemu_set_fd_handler(event_notifier_get_fd(notifier), + virtio_pci_host_notifier_read, NULL, vq); + } else { + qemu_set_fd_handler(event_notifier_get_fd(notifier), + NULL, NULL, NULL); + } +} + +static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign) +{ + int n, r; + + for (n = 0; n VIRTIO_PCI_QUEUE_MAX
Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Sun, Nov 14, 2010 at 12:19 PM, Avi Kivity a...@redhat.com wrote: On 11/14/2010 01:05 PM, Avi Kivity wrote: I agree, but let's enable virtio-ioeventfd carefully because bad code is out there. Sure. Note as long as the thread waiting on ioeventfd doesn't consume too much cpu, it will awaken quickly and we won't have the transaction per timeslice effect. btw, what about virtio-blk with linux-aio? Have you benchmarked that with and without ioeventfd? And, what about efficiency? As in bits/cycle? We are running benchmarks with this latest patch and will report results. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Sun, Nov 14, 2010 at 10:34 AM, Avi Kivity a...@redhat.com wrote: On 11/12/2010 11:20 AM, Stefan Hajnoczi wrote: Who guarantees that less common virtio-blk and virtio-net guest drivers for non-Linux OSes are fine with it? Maybe you should add a feature flag that the guest has to ACK to enable it. Virtio-blk and virtio-net are fine. Both of those devices are expected to operate asynchronously. SeaBIOS and gPXE virtio-net drivers spin but they expect to and it is okay in those environments. They already burn CPU today. Virtio-console expects synchronous virtqueue kick. In Linux, virtio_console.c __send_control_msg() and send_buf() will spin. Qemu userspace is able to complete those requests synchronously so that the guest never actually burns CPU (e.g. hw/virtio-serial-bus.c:send_control_msg()). I don't want to burn CPU in places where we previously didn't. This is a horrible bug. virtio is an asynchronous API. Some hypervisor implementations cannot even provide synchronous notifications. It's good that QEMU can decide whether or not to handle virtqueue kick in the vcpu thread. For high performance asynchronous devices like virtio-net and virtio-blk it makes sense to use ioeventfd. For others it may not be useful. I'm not sure a feature bit that exposes this detail to the guest would be useful. The guest should always assume that virtio devices are asynchronous. I agree, but let's enable virtio-ioeventfd carefully because bad code is out there. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 3:53 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote: Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) we should also deassign when VIRTIO_CONFIG_S_DRIVER_OK is cleared by io write or bus master bit? You're right, I'll fix the lifecycle to trigger symmetrically on status bit changes rather than VIRTIO_CONFIG_S_DRIVER_OK/reset. +static void virtio_pci_reset_vdev(VirtIOPCIProxy *proxy) +{ + /* Poke virtio device so it deassigns its host notifiers (if any) */ + virtio_set_status(proxy-vdev, 0); Hmm. virtio_reset already sets status to 0. I guess it should just be fixed to call virtio_set_status? This part is ugly. The problem is that virtio_reset() calls virtio_set_status(vdev, 0) but doesn't give the transport binding a chance clean up after the virtio device has cleaned up. Since virtio-net will spot status=0 and deassign its host notifier, we need to perform our own clean up after vhost. What makes this slightly less of a hack is the fact that virtio-pci.c was already causing virtio_set_status(vdev, 0) to be invoked twice during reset. When 0 is written to the VIRTIO_PCI_STATUS register, we do virtio_set_status(proxy-vdev, val 0xFF) and then virtio_reset(proxy-vdev). So the status byte callback already gets invoked twice today. I've just split this out into virtio_pci_reset_vdev() and (ab)used it to correctly clean up virtqueue ioeventfd. The alternative is to add another callback from virtio.c so we are notified after the vdev's reset callback has finished. @@ -223,10 +322,16 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) virtio_queue_notify(vdev, val); break; case VIRTIO_PCI_STATUS: - virtio_set_status(vdev, val 0xFF); - if (vdev-status == 0) { - virtio_reset(proxy-vdev); - msix_unuse_all_vectors(proxy-pci_dev); + if ((val VIRTIO_CONFIG_S_DRIVER_OK) + !(vdev-status VIRTIO_CONFIG_S_DRIVER_OK) + (proxy-flags VIRTIO_PCI_FLAG_USE_IOEVENTFD)) { + virtio_pci_set_host_notifiers(proxy, true); + } So we set host notifiers to true from here, but to false only on reset? This seems strange. Should not we disable notifiers when driver clears OK status? How about on bus master disable? You're right, this needs to be fixed. @@ -714,6 +803,8 @@ static PCIDeviceInfo virtio_info[] = { .exit = virtio_net_exit_pci, .romfile = pxe-virtio.bin, .qdev.props = (Property[]) { + DEFINE_PROP_UINT32(flags, VirtIOPCIProxy, flags, + VIRTIO_PCI_FLAG_USE_IOEVENTFD), DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 3), DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), This ties interface to an internal macro value. Further, user gets to tweak other fields in this integer which we don't want. Finally, the interface is extremely unfriendly. Please use a bit property instead: DEFINE_PROP_BIT. Will fix in v3. diff --git a/hw/virtio.c b/hw/virtio.c index a2a657e..f588e29 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -582,6 +582,11 @@ void virtio_queue_notify(VirtIODevice *vdev, int n) } } +void virtio_queue_notify_vq(VirtQueue *vq) +{ + virtio_queue_notify(vq-vdev, vq - vq-vdev-vq); Let's implement virtio_queue_notify in terms of virtio_queue_notify_vq. Not the other way around. Will fix in v3. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 4:45 PM, Christoph Hellwig h...@infradead.org wrote: On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote: Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Who guarantees that less common virtio-blk and virtio-net guest drivers for non-Linux OSes are fine with it? Maybe you should add a feature flag that the guest has to ACK to enable it. Virtio-blk and virtio-net are fine. Both of those devices are expected to operate asynchronously. SeaBIOS and gPXE virtio-net drivers spin but they expect to and it is okay in those environments. They already burn CPU today. Virtio-console expects synchronous virtqueue kick. In Linux, virtio_console.c __send_control_msg() and send_buf() will spin. Qemu userspace is able to complete those requests synchronously so that the guest never actually burns CPU (e.g. hw/virtio-serial-bus.c:send_control_msg()). I don't want to burn CPU in places where we previously didn't. It's good that QEMU can decide whether or not to handle virtqueue kick in the vcpu thread. For high performance asynchronous devices like virtio-net and virtio-blk it makes sense to use ioeventfd. For others it may not be useful. I'm not sure a feature bit that exposes this detail to the guest would be useful. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Fri, Nov 12, 2010 at 9:25 AM, Michael S. Tsirkin m...@redhat.com wrote: On Fri, Nov 12, 2010 at 09:18:48AM +, Stefan Hajnoczi wrote: On Thu, Nov 11, 2010 at 3:53 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote: Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) we should also deassign when VIRTIO_CONFIG_S_DRIVER_OK is cleared by io write or bus master bit? You're right, I'll fix the lifecycle to trigger symmetrically on status bit changes rather than VIRTIO_CONFIG_S_DRIVER_OK/reset. +static void virtio_pci_reset_vdev(VirtIOPCIProxy *proxy) +{ + /* Poke virtio device so it deassigns its host notifiers (if any) */ + virtio_set_status(proxy-vdev, 0); Hmm. virtio_reset already sets status to 0. I guess it should just be fixed to call virtio_set_status? This part is ugly. The problem is that virtio_reset() calls virtio_set_status(vdev, 0) but doesn't give the transport binding a chance clean up after the virtio device has cleaned up. Since virtio-net will spot status=0 and deassign its host notifier, we need to perform our own clean up after vhost. What makes this slightly less of a hack is the fact that virtio-pci.c was already causing virtio_set_status(vdev, 0) to be invoked twice during reset. When 0 is written to the VIRTIO_PCI_STATUS register, we do virtio_set_status(proxy-vdev, val 0xFF) and then virtio_reset(proxy-vdev). So the status byte callback already gets invoked twice today. I've just split this out into virtio_pci_reset_vdev() and (ab)used it to correctly clean up virtqueue ioeventfd. The alternative is to add another callback from virtio.c so we are notified after the vdev's reset callback has finished. Oh, likely not worth it. Mabe put the above explanation in the comment. Will this go away now that you move to set notifiers on status write? For v3 I have switched to a bindings callback. I wish it wasn't necessary but the only other ways I can think of catching status writes are hacks which depend on side-effects too much. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/3] virtio-pci: Use ioeventfd for virtqueue notify
Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) * !VIRTIO_CONFIG_S_DRIVER_OK - qemu_set_fd_handler(NULL), deassign host notifiers * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL) * virtio_pci_set_host_notifier(false) - qemu_set_fd_handler(virtio_pci_host_notifier_read) Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 152 ++ hw/virtio.c | 14 - hw/virtio.h | 13 + 3 files changed, 153 insertions(+), 26 deletions(-) Now toggles host notifiers based on VIRTIO_CONFIG_S_DRIVER_OK status changes. The cleanest way I could see was to introduce pre and a post set_status() callbacks. They allow a binding to hook status changes, including the status change from virtio_reset(). diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 549118d..117e855 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -83,6 +83,11 @@ /* Flags track per-device state like workarounds for quirks in older guests. */ #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) +/* Performance improves when virtqueue kick processing is decoupled from the + * vcpu thread using ioeventfd for some devices. */ +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT 1 +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD (1 VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT) + /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for * KVM or if kqemu gets SMP support. @@ -179,12 +184,125 @@ static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f) return 0; } +static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy, + int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +int r; +if (assign) { +r = event_notifier_init(notifier, 1); +if (r 0) { +return r; +} +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +event_notifier_cleanup(notifier); +} +} else { +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +return r; +} +event_notifier_cleanup(notifier); +} +return r; +} + +static void virtio_pci_host_notifier_read(void *opaque) +{ +VirtQueue *vq = opaque; +EventNotifier *n = virtio_queue_get_host_notifier(vq); +if (event_notifier_test_and_clear(n)) { +virtio_queue_notify_vq(vq); +} +} + +static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy, +int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +if (assign) { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +virtio_pci_host_notifier_read, NULL, vq); +} else { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +NULL, NULL, NULL); +} +} + +static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign) +{ +int n, r; + +for (n = 0; n VIRTIO_PCI_QUEUE_MAX; n++) { +if (!virtio_queue_get_num(proxy-vdev, n)) { +continue; +} + +if (assign) { +r = virtio_pci_set_host_notifier_ioeventfd(proxy, n, true); +if (r 0) { +goto assign_error
[PATCH v3 3/3] virtio-pci: Don't use ioeventfd on old kernels
There used to be a limit of 6 KVM io bus devices inside the kernel. On such a kernel, don't use ioeventfd for virtqueue host notification since the limit is reached too easily. This ensures that existing vhost-net setups (which always use ioeventfd) have ioeventfds available so they can continue to work. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c |4 kvm-all.c | 46 ++ kvm-stub.c |5 + kvm.h |1 + 4 files changed, 56 insertions(+), 0 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 117e855..d3a7a9c 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -661,6 +661,10 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, pci_register_bar(proxy-pci_dev, 0, size, PCI_BASE_ADDRESS_SPACE_IO, virtio_map); +if (!kvm_has_many_ioeventfds()) { +proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD; +} + virtio_bind_device(vdev, virtio_pci_bindings, proxy); proxy-host_features |= 0x1 VIRTIO_F_NOTIFY_ON_EMPTY; proxy-host_features |= 0x1 VIRTIO_F_BAD_FEATURE; diff --git a/kvm-all.c b/kvm-all.c index 37b99c7..ba302bc 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -28,6 +28,11 @@ #include kvm.h #include bswap.h +/* This check must be after config-host.h is included */ +#ifdef CONFIG_EVENTFD +#include sys/eventfd.h +#endif + /* KVM uses PAGE_SIZE in it's definition of COALESCED_MMIO_MAX */ #define PAGE_SIZE TARGET_PAGE_SIZE @@ -72,6 +77,7 @@ struct KVMState int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; +int many_ioeventfds; }; static KVMState *kvm_state; @@ -441,6 +447,39 @@ int kvm_check_extension(KVMState *s, unsigned int extension) return ret; } +static int kvm_check_many_ioeventfds(void) +{ +/* Older kernels have a 6 device limit on the KVM io bus. Find out so we + * can avoid creating too many ioeventfds. + */ +#ifdef CONFIG_EVENTFD +int ioeventfds[7]; +int i, ret = 0; +for (i = 0; i ARRAY_SIZE(ioeventfds); i++) { +ioeventfds[i] = eventfd(0, EFD_CLOEXEC); +if (ioeventfds[i] 0) { +break; +} +ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true); +if (ret 0) { +close(ioeventfds[i]); +break; +} +} + +/* Decide whether many devices are supported or not */ +ret = i == ARRAY_SIZE(ioeventfds); + +while (i-- 0) { +kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false); +close(ioeventfds[i]); +} +return ret; +#else +return 0; +#endif +} + static void kvm_set_phys_mem(target_phys_addr_t start_addr, ram_addr_t size, ram_addr_t phys_offset) @@ -717,6 +756,8 @@ int kvm_init(int smp_cpus) kvm_state = s; cpu_register_phys_memory_client(kvm_cpu_phys_memory_client); +s-many_ioeventfds = kvm_check_many_ioeventfds(); + return 0; err: @@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void) return kvm_state-xcrs; } +int kvm_has_many_ioeventfds(void) +{ +return kvm_state-many_ioeventfds; +} + void kvm_setup_guest_memory(void *start, size_t size) { if (!kvm_has_sync_mmu()) { diff --git a/kvm-stub.c b/kvm-stub.c index 5384a4b..33d4476 100644 --- a/kvm-stub.c +++ b/kvm-stub.c @@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void) return 0; } +int kvm_has_many_ioeventfds(void) +{ +return 0; +} + void kvm_setup_guest_memory(void *start, size_t size) { } diff --git a/kvm.h b/kvm.h index 60a9b42..ce08d42 100644 --- a/kvm.h +++ b/kvm.h @@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void); int kvm_has_debugregs(void); int kvm_has_xsave(void); int kvm_has_xcrs(void); +int kvm_has_many_ioeventfds(void); #ifdef NEED_CPU_H int kvm_init_vcpu(CPUState *env); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/3] virtio-pci: Rename bugs field to flags
The VirtIOPCIProxy bugs field is currently used to enable workarounds for older guests. Rename it to flags so that other per-device behavior can be tracked. A later patch uses the flags field to remember whether ioeventfd should be used for virtqueue host notification. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 729917d..549118d 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -80,9 +80,8 @@ * 12 is historical, and due to x86 page size. */ #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12 -/* We can catch some guest bugs inside here so we continue supporting older - guests. */ -#define VIRTIO_PCI_BUG_BUS_MASTER (1 0) +/* Flags track per-device state like workarounds for quirks in older guests. */ +#define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for @@ -95,7 +94,7 @@ typedef struct { PCIDevice pci_dev; VirtIODevice *vdev; -uint32_t bugs; +uint32_t flags; uint32_t addr; uint32_t class_code; uint32_t nvectors; @@ -159,7 +158,7 @@ static int virtio_pci_load_config(void * opaque, QEMUFile *f) in ready state. Then we have a buggy guest OS. */ if ((proxy-vdev-status VIRTIO_CONFIG_S_DRIVER_OK) !(proxy-pci_dev.config[PCI_COMMAND] PCI_COMMAND_MASTER)) { -proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER; +proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG; } return 0; } @@ -185,7 +184,7 @@ static void virtio_pci_reset(DeviceState *d) VirtIOPCIProxy *proxy = container_of(d, VirtIOPCIProxy, pci_dev.qdev); virtio_reset(proxy-vdev); msix_reset(proxy-pci_dev); -proxy-bugs = 0; +proxy-flags = 0; } static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) @@ -235,7 +234,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) some safety checks. */ if ((val VIRTIO_CONFIG_S_DRIVER_OK) !(proxy-pci_dev.config[PCI_COMMAND] PCI_COMMAND_MASTER)) { -proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER; +proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG; } break; case VIRTIO_MSI_CONFIG_VECTOR: @@ -403,7 +402,7 @@ static void virtio_write_config(PCIDevice *pci_dev, uint32_t address, if (PCI_COMMAND == address) { if (!(val PCI_COMMAND_MASTER)) { -if (!(proxy-bugs VIRTIO_PCI_BUG_BUS_MASTER)) { +if (!(proxy-flags VIRTIO_PCI_FLAG_BUS_MASTER_BUG)) { virtio_set_status(proxy-vdev, proxy-vdev-status ~VIRTIO_CONFIG_S_DRIVER_OK); } -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] virtio-pci: Don't use ioeventfd on old kernels
On Fri, Nov 12, 2010 at 1:24 PM, Stefan Hajnoczi stefa...@linux.vnet.ibm.com wrote: @@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void) return kvm_state-xcrs; } +int kvm_has_many_ioeventfds(void) +{ + return kvm_state-many_ioeventfds; +} + Missing if (!kvm_enabled()) { return 0; }. Will fix in next version, would still appreciate review comments on any other aspect of the patch. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to start VM using COWed image
On Thu, Nov 11, 2010 at 12:17 PM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: Though specifying the absolute path for source image worked for me. Can any one please let me know the situation in which one would not want to specify the absolute path? How does relative path help? Advantage of using relative path rather than absolute path. I think using the absolute path would always work. Relative paths are useful when sharing images with other people. An absolute path won't work on another machine unless you use the same parent directory structure. If you send me an image file with an absolute path in your home directory, I won't be able to use it easily on my machine. (Actually the new qemu-img rebase -u command can be used to fix up the image file on the destination machine but it's an extra step and not user-friendly.) Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] virtio: Use ioeventfd for virtqueue notify
This is a rewrite of the virtio-ioeventfd patchset to work at the virtio-pci.c level instead of virtio.c. This results in better integration with the host/guest notifier code and makes the code simpler (no more state machine). Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Now that this code is in virtio-pci.c it is possible to explicitly enable devices for which virtio-ioeventfd should be used. Only virtio-blk and virtio-net are enabled at this time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) * reset - qemu_set_fd_handler(NULL), deassign host notifiers * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL) * virtio_pci_set_host_notifier(false) - qemu_set_fd_handler(virtio_pci_host_notifier_read) Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 155 +++--- hw/virtio.c |5 ++ hw/virtio.h |1 + 3 files changed, 129 insertions(+), 32 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 549118d..436fc59 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -83,6 +83,10 @@ /* Flags track per-device state like workarounds for quirks in older guests. */ #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) +/* Performance improves when virtqueue kick processing is decoupled from the + * vcpu thread using ioeventfd for some devices. */ +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD (1 1) + /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for * KVM or if kqemu gets SMP support. @@ -179,12 +183,108 @@ static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f) return 0; } +static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy, int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +int r; +if (assign) { +r = event_notifier_init(notifier, 1); +if (r 0) { +return r; +} +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +event_notifier_cleanup(notifier); +} +} else { +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +return r; +} +event_notifier_cleanup(notifier); +} +return r; +} + +static void virtio_pci_host_notifier_read(void *opaque) +{ +VirtQueue *vq = opaque; +EventNotifier *n = virtio_queue_get_host_notifier(vq); +if (event_notifier_test_and_clear(n)) { +virtio_queue_notify_vq(vq); +} +} + +static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy, int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +if (assign) { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +virtio_pci_host_notifier_read, NULL, vq); +} else { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +NULL, NULL, NULL); +} +} + +static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign) +{ +int n, r; + +for (n = 0; n VIRTIO_PCI_QUEUE_MAX; n++) { +if (!virtio_queue_get_num(proxy-vdev, n)) { +continue; +} + +if (assign) { +r = virtio_pci_set_host_notifier_ioeventfd(proxy, n, true); +if (r 0) { +goto assign_error; +} + +virtio_pci_set_host_notifier_fd_handler(proxy, n, true); +} else { +virtio_pci_set_host_notifier_fd_handler(proxy, n, false); +virtio_pci_set_host_notifier_ioeventfd(proxy, n, false); +} +} +return 0; + +assign_error: +proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD; +while (--n = 0) { +virtio_pci_set_host_notifier_fd_handler(proxy, n, false); +virtio_pci_set_host_notifier_ioeventfd
[PATCH 1/3] virtio-pci: Rename bugs field to flags
The VirtIOPCIProxy bugs field is currently used to enable workarounds for older guests. Rename it to flags so that other per-device behavior can be tracked. A later patch uses the flags field to remember whether ioeventfd should be used for virtqueue host notification. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 729917d..549118d 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -80,9 +80,8 @@ * 12 is historical, and due to x86 page size. */ #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12 -/* We can catch some guest bugs inside here so we continue supporting older - guests. */ -#define VIRTIO_PCI_BUG_BUS_MASTER (1 0) +/* Flags track per-device state like workarounds for quirks in older guests. */ +#define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for @@ -95,7 +94,7 @@ typedef struct { PCIDevice pci_dev; VirtIODevice *vdev; -uint32_t bugs; +uint32_t flags; uint32_t addr; uint32_t class_code; uint32_t nvectors; @@ -159,7 +158,7 @@ static int virtio_pci_load_config(void * opaque, QEMUFile *f) in ready state. Then we have a buggy guest OS. */ if ((proxy-vdev-status VIRTIO_CONFIG_S_DRIVER_OK) !(proxy-pci_dev.config[PCI_COMMAND] PCI_COMMAND_MASTER)) { -proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER; +proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG; } return 0; } @@ -185,7 +184,7 @@ static void virtio_pci_reset(DeviceState *d) VirtIOPCIProxy *proxy = container_of(d, VirtIOPCIProxy, pci_dev.qdev); virtio_reset(proxy-vdev); msix_reset(proxy-pci_dev); -proxy-bugs = 0; +proxy-flags = 0; } static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) @@ -235,7 +234,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) some safety checks. */ if ((val VIRTIO_CONFIG_S_DRIVER_OK) !(proxy-pci_dev.config[PCI_COMMAND] PCI_COMMAND_MASTER)) { -proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER; +proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG; } break; case VIRTIO_MSI_CONFIG_VECTOR: @@ -403,7 +402,7 @@ static void virtio_write_config(PCIDevice *pci_dev, uint32_t address, if (PCI_COMMAND == address) { if (!(val PCI_COMMAND_MASTER)) { -if (!(proxy-bugs VIRTIO_PCI_BUG_BUS_MASTER)) { +if (!(proxy-flags VIRTIO_PCI_FLAG_BUS_MASTER_BUG)) { virtio_set_status(proxy-vdev, proxy-vdev-status ~VIRTIO_CONFIG_S_DRIVER_OK); } -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio-pci: Don't use ioeventfd on old kernels
There used to be a limit of 6 KVM io bus devices inside the kernel. On such a kernel, don't use ioeventfd for virtqueue host notification since the limit is reached too easily. This ensures that existing vhost-net setups (which always use ioeventfd) have ioeventfds available so they can continue to work. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c |4 kvm-all.c | 46 ++ kvm-stub.c |5 + kvm.h |1 + 4 files changed, 56 insertions(+), 0 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 436fc59..365a26b 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -646,6 +646,10 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, pci_register_bar(proxy-pci_dev, 0, size, PCI_BASE_ADDRESS_SPACE_IO, virtio_map); +if (!kvm_has_many_ioeventfds()) { +proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD; +} + virtio_bind_device(vdev, virtio_pci_bindings, proxy); proxy-host_features |= 0x1 VIRTIO_F_NOTIFY_ON_EMPTY; proxy-host_features |= 0x1 VIRTIO_F_BAD_FEATURE; diff --git a/kvm-all.c b/kvm-all.c index 37b99c7..ba302bc 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -28,6 +28,11 @@ #include kvm.h #include bswap.h +/* This check must be after config-host.h is included */ +#ifdef CONFIG_EVENTFD +#include sys/eventfd.h +#endif + /* KVM uses PAGE_SIZE in it's definition of COALESCED_MMIO_MAX */ #define PAGE_SIZE TARGET_PAGE_SIZE @@ -72,6 +77,7 @@ struct KVMState int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; +int many_ioeventfds; }; static KVMState *kvm_state; @@ -441,6 +447,39 @@ int kvm_check_extension(KVMState *s, unsigned int extension) return ret; } +static int kvm_check_many_ioeventfds(void) +{ +/* Older kernels have a 6 device limit on the KVM io bus. Find out so we + * can avoid creating too many ioeventfds. + */ +#ifdef CONFIG_EVENTFD +int ioeventfds[7]; +int i, ret = 0; +for (i = 0; i ARRAY_SIZE(ioeventfds); i++) { +ioeventfds[i] = eventfd(0, EFD_CLOEXEC); +if (ioeventfds[i] 0) { +break; +} +ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true); +if (ret 0) { +close(ioeventfds[i]); +break; +} +} + +/* Decide whether many devices are supported or not */ +ret = i == ARRAY_SIZE(ioeventfds); + +while (i-- 0) { +kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false); +close(ioeventfds[i]); +} +return ret; +#else +return 0; +#endif +} + static void kvm_set_phys_mem(target_phys_addr_t start_addr, ram_addr_t size, ram_addr_t phys_offset) @@ -717,6 +756,8 @@ int kvm_init(int smp_cpus) kvm_state = s; cpu_register_phys_memory_client(kvm_cpu_phys_memory_client); +s-many_ioeventfds = kvm_check_many_ioeventfds(); + return 0; err: @@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void) return kvm_state-xcrs; } +int kvm_has_many_ioeventfds(void) +{ +return kvm_state-many_ioeventfds; +} + void kvm_setup_guest_memory(void *start, size_t size) { if (!kvm_has_sync_mmu()) { diff --git a/kvm-stub.c b/kvm-stub.c index 5384a4b..33d4476 100644 --- a/kvm-stub.c +++ b/kvm-stub.c @@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void) return 0; } +int kvm_has_many_ioeventfds(void) +{ +return 0; +} + void kvm_setup_guest_memory(void *start, size_t size) { } diff --git a/kvm.h b/kvm.h index 60a9b42..ce08d42 100644 --- a/kvm.h +++ b/kvm.h @@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void); int kvm_has_debugregs(void); int kvm_has_xsave(void); int kvm_has_xcrs(void); +int kvm_has_many_ioeventfds(void); #ifdef NEED_CPU_H int kvm_init_vcpu(CPUState *env); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v7)
On Fri, Oct 15, 2010 at 8:54 PM, Christian Brunner c...@muc.de wrote: Hi, once again, Yehuda committed fixes for all the suggestions made on the list (and more). Here is the next update for the ceph/rbd block driver. Please let us know if there are any pending issues. For those who didn't follow the previous postings: This is an block driver for the distributed file system Ceph (http://ceph.newdream.net/). This driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace (Yehuda also wrote a driver for the linux kernel, that can be used to access rbd volumes as a block device). Kind Regards, Christian Signed-off-by: Christian Brunner c...@muc.de Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net --- Makefile.objs |1 + block/rbd.c | 1059 + block/rbd_types.h | 71 configure | 31 ++ 4 files changed, 1162 insertions(+), 0 deletions(-) create mode 100644 block/rbd.c create mode 100644 block/rbd_types.h This patch is close now. Just some minor issues below. diff --git a/Makefile.objs b/Makefile.objs index 6ee077c..56a13c1 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -19,6 +19,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o +block-nested-$(CONFIG_RBD) += rbd.o block-obj-y += $(addprefix block/, $(block-nested-y)) diff --git a/block/rbd.c b/block/rbd.c new file mode 100644 index 000..fbfb93e --- /dev/null +++ b/block/rbd.c @@ -0,0 +1,1059 @@ +/* + * QEMU Block driver for RADOS (Ceph) + * + * Copyright (C) 2010 Christian Brunner c...@muc.de + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include qemu-error.h + +#include rbd_types.h +#include block_int.h + +#include rados/librados.h + + + +/* + * When specifying the image filename use: + * + * rbd:poolname/devicename + * + * poolname must be the name of an existing rados pool + * + * devicename is the basename for all objects used to + * emulate the raw device. + * + * Metadata information (image size, ...) is stored in an + * object with the name devicename.rbd. + * + * The raw device is split into 4MB sized objects by default. + * The sequencenumber is encoded in a 12 byte long hex-string, + * and is attached to the devicename, separated by a dot. + * e.g. devicename.1234567890ab + * + */ + +#define OBJ_MAX_SIZE (1UL OBJ_DEFAULT_OBJ_ORDER) + +typedef struct RBDAIOCB { +BlockDriverAIOCB common; +QEMUBH *bh; +int ret; +QEMUIOVector *qiov; +char *bounce; +int write; +int64_t sector_num; +int aiocnt; +int error; +struct BDRVRBDState *s; +int cancelled; +} RBDAIOCB; + +typedef struct RADOSCB { +int rcbid; +RBDAIOCB *acb; +struct BDRVRBDState *s; +int done; +int64_t segsize; +char *buf; +int ret; +} RADOSCB; + +#define RBD_FD_READ 0 +#define RBD_FD_WRITE 1 + +typedef struct BDRVRBDState { +int fds[2]; +rados_pool_t pool; +rados_pool_t header_pool; +char name[RBD_MAX_OBJ_NAME_SIZE]; +char block_name[RBD_MAX_BLOCK_NAME_SIZE]; +uint64_t size; +uint64_t objsize; +int qemu_aio_count; +int read_only; +int event_reader_pos; +RADOSCB *event_rcb; +} BDRVRBDState; + +typedef struct rbd_obj_header_ondisk RbdHeader1; + +static int rbd_next_tok(char *dst, int dst_len, +char *src, char delim, +const char *name, +char **p) +{ +int l; +char *end; + +*p = NULL; + +if (delim != '\0') { +end = strchr(src, delim); +if (end) { +*p = end + 1; +*end = '\0'; +} +} +l = strlen(src); +if (l = dst_len) { +error_report(%s too long, name); +return -EINVAL; +} else if (l == 0) { +error_report(%s too short, name); +return -EINVAL; +} + +pstrcpy(dst, dst_len, src); + +return 0; +} + +static int rbd_parsename(const char *filename, + char *pool, int pool_len, + char *snap, int snap_len, + char *name, int name_len) +{ +const char *start; +char *p, *buf; +int ret; + +if (!strstart(filename, rbd:, start)) { +return -EINVAL; +} + +buf = qemu_strdup(start); +p = buf; + +ret = rbd_next_tok(pool, pool_len, p, '/', pool name, p); +if (ret 0 || !p) { +ret = -EINVAL; +goto done; +} +ret = rbd_next_tok(name, name_len, p, '@', object name, p); +if (ret 0) { +
Re: Unable to start VM using COWed image
On Wed, Nov 10, 2010 at 10:08 AM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: Where can I get the code of the qemu-kvm program? I cloned the qemu-lvm git repository and compiled the code. But it looks like qemu-kvm program is not part of this code.-- qemu-kvm.git contains the qemu-kvm codebase but the binary is built in x86_64-softmmu/qemu-system-x86_64. Distro packages typically rename it to qemu-kvm. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM, windows 2000 and qcow2
On Tue, Nov 9, 2010 at 8:06 PM, RaSca ra...@miamammausalinux.org wrote: Today i saw this page: http://www.linux-kvm.org/page/Guest_Support_Status in which is explained how is better to run win2k on qcow2 images. The page does not state that it is better to run Windows 2000 on qcow2, it's probably just that the tester decided to use qcow2 on his machine and noted his configuration. Raw should work just like qcow2 does and you can expect better performance with raw. Although qcow2 does run on raw devices like drbd or lvm logical volumes, I suspect you'll see the same issue you're getting now. It's worth figuring out why your current configuration freezes. I am interested in the same questions Jernej asked about the freeze: is the guest crashing with a BSOD or is the KVM process consuming all CPU? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to start VM using COWed image
On Wed, Nov 10, 2010 at 12:40 PM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: From: Stefan Hajnoczi [stefa...@gmail.com] Sent: 10 November 2010 11:12 To: Prasad Joshi Cc: Keqin Hong; kvm@vger.kernel.org Subject: Re: Unable to start VM using COWed image On Wed, Nov 10, 2010 at 10:08 AM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: Where can I get the code of the qemu-kvm program? I cloned the qemu-lvm git repository and compiled the code. But it looks like qemu-kvm program is not part of this code.-- qemu-kvm.git contains the qemu-kvm codebase but the binary is built in x86_64-softmmu/qemu-system-x86_64. Distro packages typically rename it to qemu-kvm. Thanks Stefan for your reply. I guess you pointed out the problem in the first mail. QEMU places a restriction on location of the COWed file. The source image and COWed image should be in the same drectory. In my case the source image was in directory /var/lib/libvirt/images/ and the COWed image was in /home/prasad/Virtual directory. While debuging the source code using gdb I realized this limitation. It would be good to fix this problem. I will see if I can solve this problem. This behavior is a feature. You chose to use a relative backing file path when you used qemu-img create -b relative-path. If you want an absolute path you need to use qemu-img create -b /home/prasad/Virtual/... (i.e. specify an absolute path instead of a relative path). One more question on the same lines, How does QEMU detect the file is COWed and the name of the file (not whole path) from it is COWed? COW support comes from the image file format that you choose. A qcow file is not just a raw image file like the kind you can dd from a real disk. Instead it has its own file format including a header and metadata for tracking allocated space. The header contains the name of the backing file. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: Use ioeventfd for virtqueue notify
Hi Michael, I have looked into the way irqfd with msix mask notifiers works. From what I can tell, the guest notifiers are enabled by vhost net in order to hook up irqfds for the virtqueues. MSIX allows vectors to be masked so there is a mmio write notifier in qemu-kvm to toggle the irqfd and its QEMU fd handler when the guest toggles the MSIX mask. The irqfd is disabled but stays open as an eventfd while masked. That means masking/unmasking the vector does not close/open the eventfd file descriptor itself. I'm having trouble finding a direct parallel to virtio-ioeventfd here. We always want to have an ioeventfd per virtqueue unless the host kernel does not support 6 ioeventfds per VM. When vhost sets the host notifier we want to remove the QEMU fd handler and allow vhost to use the event notifier's fd as it wants. When vhost clears the host notifier we want to add the QEMU fd handler again (unless the kernel does not support 6 ioeventfds per VM). I think hooking in at the virtio-pci.c level instead of virtio.c is possible but we're still going to have the same state transitions. I hope it can be done without adding per-virtqueue variables that track state. Before I go down this route, is there something I've missed and do you think this approach will be better? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to start VM using COWed image
On Tue, Nov 9, 2010 at 3:36 PM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: Hello, I am trying to run KVM machine from the image created as COW from the original image. But it not working. Screenshot that shows the KVM works with the original image [r...@prasad images]# qemu-kvm /var/lib/libvirt/images/Ubuntu.img -m 512 ^Z [1]+ Stopped qemu-kvm /var/lib/libvirt/images/Ubuntu.img -m 512 [r...@prasad images]# bg [1]+ qemu-kvm /var/lib/libvirt/images/Ubuntu.img -m 512 [r...@prasad images]# pwd /var/lib/libvirt/images [r...@prasad images]# lsmod | grep -i kvm kvm_intel 42122 3 kvm 257132 1 kvm_intel 1. Created COW copy of the image after stoping the VM that was running [r...@prasad images]# pwd /var/lib/libvirt/images [r...@prasad images]# qemu-img create -b Ubuntu.img -f qcow /home/prasad/Virtual/Ubuntu_copy.ovl 2. Trying to run VM using the copy created [pra...@prasad Virtual]$ ls -l total 36 -rw-r--r--. 1 root root 32832 Nov 9 15:33 Ubuntu_copy.ovl Why is Ubuntu.img not visible in the ls output? Ubuntu.img is a relative path to the backing file. It looks like QEMU will not be able to open the backing file. Also, is there a reason you're using qcow and not qcow2? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk I/O stuck with KVM - no clue how to solve that
On Sun, Nov 7, 2010 at 4:07 PM, Hermann Himmelbauer du...@qwer.tk wrote: Am Samstag 06 November 2010 20:58:12 schrieb Stefan Hajnoczi: On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote: I experience strange disk I/O stucks on my Linux Host + Guest with KVM, which make the system (especially the guests) almost unusable. These stucks come periodically, e.g. every 2 to 10 seconds and last between 3 and sometimes over 120 seconds, which trigger kernel messages like this (on host and/or guest): INFO: task postgres:2195 blocked for more than 120 seconds The fact that this happens on the host too suggests there's an issue with the host software/hardware and the VM is triggering it but not the root cause. Does dmesg display any other suspicious messages? No, there's anything that can be seen via dmesg. I at first suspected the hardware, too. I can think of the following reasons: 1) Broken SATA cable / Harddisks - I changed some cables, no change, thus this is probably ruled out. I also can't see anything via S.M.A.R.T. Moreover, the problem is not bound to a specific device, instead it happens on sda - sdd, so I doubt it's harddisk related. 2) Broken Power Supply / Insufficient Power - I'd expect either a complete crash or some error messages in this case, so I'd rather rule that out. 3) Broken SATA-Controller - I cannot think of any way to check that, but I'd also expect some crashes or kernel messages. I flashed the board to the latest BIOS version, no change either. However, it seems no one except me seems to have this problem, so I'll buy a new, similar but different mainboard (Intel instead of Asus), hopefully this solves the problem. What do you think, any better idea? If you have the time, you can use perf probes to trace I/O requests in the host kernel. Perhaps completion interrupts are being dropped. You may wish to start by tracing requests issued and completed by the SATA driver. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk I/O stuck with KVM - no clue how to solve that
On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote: I experience strange disk I/O stucks on my Linux Host + Guest with KVM, which make the system (especially the guests) almost unusable. These stucks come periodically, e.g. every 2 to 10 seconds and last between 3 and sometimes over 120 seconds, which trigger kernel messages like this (on host and/or guest): INFO: task postgres:2195 blocked for more than 120 seconds The fact that this happens on the host too suggests there's an issue with the host software/hardware and the VM is triggering it but not the root cause. Does dmesg display any other suspicious messages? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
On Wed, Oct 27, 2010 at 10:05 PM, Shirley Ma mashi...@us.ibm.com wrote: This patch changes vhost TX used buffer signal to guest from one by one to up to 3/4 of vring size. This change improves vhost TX message size from 256 to 8K performance for both bandwidth and CPU utilization without inducing any regression. Any concerns about introducing latency or does the guest not care when TX completions come in? Signed-off-by: Shirley Ma x...@us.ibm.com --- drivers/vhost/net.c | 19 ++- drivers/vhost/vhost.c | 31 +++ drivers/vhost/vhost.h | 3 +++ 3 files changed, 52 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 4b4da5b..bd1ba71 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -198,7 +198,24 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + /* + * if no pending buffer size allocate, signal used buffer + * one by one, otherwise, signal used buffer when reaching + * 3/4 ring size to reduce CPU utilization. + */ + if (unlikely(vq-pend)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else { + vq-pend[vq-num_pend].id = head; I don't understand the logic here: if !vq-pend then we assign to vq-pend[vq-num_pend]. + vq-pend[vq-num_pend].len = 0; + ++vq-num_pend; + if (vq-num_pend == (vq-num - (vq-num 2))) { + vhost_add_used_and_signal_n(net-dev, vq, + vq-pend, + vq-num_pend); + vq-num_pend = 0; + } + } total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 94701ff..47696d2 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq-call_ctx = NULL; vq-call = NULL; vq-log_ctx = NULL; + /* signal pending used buffers */ + if (vq-pend) { + if (vq-num_pend != 0) { + vhost_add_used_and_signal_n(dev, vq, vq-pend, + vq-num_pend); + vq-num_pend = 0; + } + kfree(vq-pend); + } + vq-pend = NULL; } static int vhost_worker(void *data) @@ -273,7 +283,13 @@ long vhost_dev_init(struct vhost_dev *dev, dev-vqs[i].heads = NULL; dev-vqs[i].dev = dev; mutex_init(dev-vqs[i].mutex); + dev-vqs[i].num_pend = 0; + dev-vqs[i].pend = NULL; vhost_vq_reset(dev, dev-vqs + i); + /* signal 3/4 of ring size used buffers */ + dev-vqs[i].pend = kmalloc((dev-vqs[i].num - + (dev-vqs[i].num 2)) * + sizeof *vq-peed, GFP_KERNEL); Has this patch been compile tested? vq-peed? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
On Thu, Oct 28, 2010 at 9:57 AM, Stefan Hajnoczi stefa...@gmail.com wrote: Just read the patch 1/1 discussion and it looks like you're already on it. Sorry for the noise. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qcow2: Fix segfault when qcow2 preallocate fails
When an image is created with -o preallocate, ensure that we only call preallocate() if the image was indeed opened successfully. Also use bdrv_delete() instead of bdrv_close() to avoid leaking the BlockDriverState structure. This fixes the segfault reported at https://bugzilla.redhat.com/show_bug.cgi?id=646538. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- Here's a fix for the segfault. block/qcow2.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index ee3481b..0fceb0d 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1059,9 +1059,11 @@ exit: BlockDriverState *bs; BlockDriver *drv = bdrv_find_format(qcow2); bs = bdrv_new(); -bdrv_open(bs, filename, BDRV_O_CACHE_WB | BDRV_O_RDWR, drv); -ret = preallocate(bs); -bdrv_close(bs); +ret = bdrv_open(bs, filename, BDRV_O_CACHE_WB | BDRV_O_RDWR, drv); +if (ret == 0) { +ret = preallocate(bs); +} +bdrv_delete(bs); } return ret; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qcow2: Fix segfault when qcow2 preallocate fails
On Tue, Oct 26, 2010 at 2:48 PM, Kevin Wolf kw...@redhat.com wrote: Am 26.10.2010 15:23, schrieb Stefan Hajnoczi: When an image is created with -o preallocate, ensure that we only call preallocate() if the image was indeed opened successfully. Also use bdrv_delete() instead of bdrv_close() to avoid leaking the BlockDriverState structure. This fixes the segfault reported at https://bugzilla.redhat.com/show_bug.cgi?id=646538. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Looks good for stable-0.13. In master we'll have the new qcow_create2 implementation as soon as Anthony pulls, so it doesn't apply there. I forgot about that :). Thanks Kevin. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: Use ioeventfd for virtqueue notify
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote: Apologies if you receive this twice, the original message either disappeared or was delayed somehow. My main concern is with the fact that we add more state in notifiers that can easily get out of sync with users. If we absolutely need this state, let's try to at least document the state machine, and make the API for state transitions more transparent. I'll try to describe how it works. If you're happy with the design in principle then I can rework the code. Otherwise we can think about a different design. The goal is to use ioeventfd instead of the synchronous pio emulation path that userspace virtqueues use today. Both virtio-blk and virtio-net increase performance with this approach because it does not block the vcpu from executing guest code while the I/O operation is initiated. We want to automatically create an event notifier and setup ioeventfd for each initialized virtqueue. Vhost already uses ioeventfd so it is important not to interfere with devices that have enabled vhost. If vhost is enabled, then the device's virtqueues are off-limits and should not be tampered with. Furthermore, older kernels limit you to 6 ioeventfds per guest. On such systems it is risky to automatically use ioeventfd for userspace virtqueues, since that could take a precious ioeventfd away from another virtio device using vhost. Existing guest configurations would break so it is simplest to avoid using ioeventfd for userspace virtqueues on such hosts. The design adds logic into hw/virtio.c to automatically use ioeventfd for userspace virtqueues. Specific virtio devices like blk and net require no modification. The logic sits below the set_host_notifier() function that vhost uses. This design stays in sync because it speaks two interfaces that allow it to accurately track whether or not to use ioeventfd: 1. virtio_set_host_notifier() is used by vhost. When vhost enables the host notifier we stay out of the way. 2. virtio_reset()/virtio_set_status()/virtio_load() define the device life-cycle and transition the state machine appropriately. Migration is supported. Here is the state machine that tracks a virtqueue: assigned ^ / \ ^ e. / / c. g. \ \ b. / / \ \ / v f. v \ a. offlimits --- deassigned -- start --- d. a. The virtqueue starts deassigned with no ioeventfd. b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to assign an ioeventfd to each virtqueue, except if the 6 ioeventfd limitation is present. c, d. The virtqueue becomes offlimits if vhost enables the host notifier. e. The ioeventfd becomes assigned again when the host notifier is disabled by vhost. f. Except when the 6 ioeventfd limitation is present, then the ioeventfd becomes unassigned because we want to avoid using ioeventfd. g. When the device is reset its virtqueues become deassigned again. Does this make sense? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: Use ioeventfd for virtqueue notify
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote: My main concern is with the fact that we add more state in notifiers that can easily get out of sync with users. If we absolutely need this state, let's try to at least document the state machine, and make the API for state transitions more transparent. I'll try to describe how it works. If you're happy with the design in principle then I can rework the code. Otherwise we can think about a different design. The goal is to use ioeventfd instead of the synchronous pio emulation path that userspace virtqueues use today. Both virtio-blk and virtio-net increase performance with this approach because it does not block the vcpu from executing guest code while the I/O operation is initiated. We want to automatically create an event notifier and setup ioeventfd for each initialized virtqueue. Vhost already uses ioeventfd so it is important not to interfere with devices that have enabled vhost. If vhost is enabled, then the device's virtqueues are off-limits and should not be tampered with. Furthermore, older kernels limit you to 6 ioeventfds per guest. On such systems it is risky to automatically use ioeventfd for userspace virtqueues, since that could take a precious ioeventfd away from another virtio device using vhost. Existing guest configurations would break so it is simplest to avoid using ioeventfd for userspace virtqueues on such hosts. The design adds logic into hw/virtio.c to automatically use ioeventfd for userspace virtqueues. Specific virtio devices like blk and net require no modification. The logic sits below the set_host_notifier() function that vhost uses. This design stays in sync because it speaks two interfaces that allow it to accurately track whether or not to use ioeventfd: 1. virtio_set_host_notifier() is used by vhost. When vhost enables the host notifier we stay out of the way. 2. virtio_reset()/virtio_set_status()/virtio_load() define the device life-cycle and transition the state machine appropriately. Migration is supported. Here is the state machine that tracks a virtqueue: assigned ^ / \ ^ e. / / c. g. \ \ b. / / \ \ / v f. v \ a. offlimits --- deassigned -- start --- d. a. The virtqueue starts deassigned with no ioeventfd. b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to assign an ioeventfd to each virtqueue, except if the 6 ioeventfd limitation is present. c, d. The virtqueue becomes offlimits if vhost enables the host notifier. e. The ioeventfd becomes assigned again when the host notifier is disabled by vhost. f. Except when the 6 ioeventfd limitation is present, then the ioeventfd becomes unassigned because we want to avoid using ioeventfd. g. When the device is reset its virtqueues become deassigned again. Does this make sense? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VM with two interfaces
On Thu, Oct 21, 2010 at 11:49 PM, Nirmal Guhan vavat...@gmail.com wrote: On Thu, Oct 21, 2010 at 9:23 AM, Nirmal Guhan vavat...@gmail.com wrote: Hi, Am trying to create a VM using qemu-kvm with two interfaces(fedora12 is the host and vm) and running into an issue. Given below is the command : qemu-kvm -net nic,macaddr=$macaddress,model=pcnet -net tap,script=/etc/qemu-ifup -net nic,model=pcnet -net tap,script=/etc/qemu-ifup -m 1024 -hda ./vdisk.img -kernel ./bzImage-1019 -append ip=x.y.z.w:a.b.c.d:p.q.r.s:a.b.c.d ip=x.y.z.u:a.b.c.d:p.q.r.s:a.b.c.d root=/dev/nfs rw nfsroot=x.y.z.v:/blahblahblah On boot, both eth0 and eth1 come up but the vm tries to send dhcp and rarp requests instead of using the command line IP addresses. DHCP would fail in my case. With just one interface, dhcp is not attempted and nfs mount of root works fine. Any clue on what could be wrong here? Thanks, Nirmal Can someone help please? Hard pressed on time... sorry Try the #qemu or #kvm IRC channels on chat.freenode.net. Often people will respond and debug interactively with you there. Your problem does not seem QEMU/KVM related. You'll need to debug the guest's boot process and network configuration just like a physical machine. In fact, I bet with an identical setup on a physical machine you'd see the same problem. Double-check the kernel parameters documentation. See if you have a network configuration in the initramfs or the root filesystem that would cause it to DHCP and how to pull the pre-configured settings from the kernel. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Build error with the io-thread option
2010/10/22 Jean-Philippe Menil jean-philippe.me...@univ-nantes.fr: i encounter the following problem, when i attempt to build qemu-kvm with the --enable-io-thread option: --enable-io-thread doesn't build in qemu-kvm.git. qemu-kvm.git has an equivalent implemented and used automatically so you don't need to set --enable-io-thread. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CD-ROM size not updated when switching CD-ROM images.
On Tue, Oct 19, 2010 at 4:43 AM, Alex Davis alex14...@yahoo.com wrote: Steps to reproduce: ) Download the first two Slackware-13.1 32-bit CD-ROM ISO images. ) Start KVM with the following command qemu-system-x86_64 -m 1024M \ -cdrom full path of 1st install disk \ -boot d ) Hit return when prompted for extra boot parameters. ) Hit return when asked to select a keyboard map. ) Hit return at the login prompt. ) cat /sys/block/sr0/size: this should return 1209360. ) Press Alt-Ctrl-2 to access the monitor ) eject ide1-cd0 ) change ide1-cd0 full path name of 2nd install disk ) Press Alt-Ctrl-1 to return to the guest. ) dd if=/dev/sr0 of=/dev/null bs=512 skip=1209360 count=3 this should return 3+0 records in 3+0 records out. instead it returns 0+0 ) cat /sys/block/sr0/size: this still returns 1209360; it should return 1376736. Oddly, when mount /dev/sr0 is executed in the guest, ls of the mounted directory shows the correct contents for the 2nd CD. After changing the CD-ROM, does running blockdev --rereadpt /dev/sr0 update the size as expected? You ejected the CD-ROM on the QEMU side, the guest doesn't necessarily know about the medium change. What happens when you use eject /dev/sr0 inside the guest instead? I don't know how CD-ROM media change works on real hardware, but that is the behavior that QEMU should be following. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: Use ioeventfd for virtqueue notify
On Thu, Sep 30, 2010 at 03:01:52PM +0100, Stefan Hajnoczi wrote: Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Full numbers are below. This patch employs ioeventfd virtqueue notify for all virtio devices. Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care must be taken so that vhost-net, the other ioeventfd user in QEMU, is able to function. On such kernels ioeventfd virtqueue notify will not be used. Khoa Huynh k...@us.ibm.com collected the following data for virtio-blk with cache=none,aio=native: FFSB Test Threads Unmodified Patched (MB/s) (MB/s) Large file create 121.721.8 8101.0 118.0 16 119.0 157.0 Sequential reads 121.923.2 8114.0 139.0 16 143.0 178.0 Random reads 13.3 3.6 823.025.4 16 43.347.8 Random writes 122.223.0 893.1111.6 16 110.5 132.0 Sridhar Samudrala s...@us.ibm.com collected the following data for virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest. Guest to Host TCP_STREAM throughput(Mb/sec) --- Msg Size vhost-net virtio-net virtio-net/ioeventfd 65536 127556430 7590 16384 84993084 5764 4096 47231578 3659 1024 1827 981 2060 Host to Guest TCP_STREAM throughput(Mb/sec) --- Msg Size vhost-net virtio-net virtio-net/ioeventfd 65536 111565790 5853 16384 107875575 5691 4096 104525556 4277 1024 44373671 5277 Guest to Host TCP_RR latency(transactions/sec) -- Msg Size vhost-net virtio-net virtio-net/ioeventfd 1 99033459 3425 4096 71851931 1899 16384 61082102 1923 65536 31611610 1744 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- Small changes are required for qemu-kvm.git. I will send them once qemu.git has virtio-ioeventfd support. hw/vhost.c |6 ++-- hw/virtio.c | 106 +++ hw/virtio.h |9 + kvm-all.c | 39 + kvm-stub.c |5 +++ kvm.h |1 + 6 files changed, 156 insertions(+), 10 deletions(-) Is there anything stopping this patch from being merged? Thanks, Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call agenda for Oct 19
On Tue, Oct 19, 2010 at 2:33 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 10/19/2010 08:27 AM, Avi Kivity wrote: On 10/19/2010 03:22 PM, Anthony Liguori wrote: I had assumed that this would involve: qemu -hda windows.img (qemu) snapshot ide0-disk0 snap0.img 1) create snap0.img internally by doing the equivalent of `qemu-img create -f qcow2 -b windows.img snap0.img' 2) bdrv_flush('ide0-disk0') 3) bdrv_open(snap0.img) 4) bdrv_close(windows.img) 5) rename('windows.img', 'windows.img.tmp') 6) rename('snap0.img', 'windows.img') 7) rename('windows.img.tmp', 'snap0.img') Looks reasonable. Would be interesting to look at this as a use case for the threading work. We should eventually be able to create a snapshot without stalling vcpus (stalling I/O of course allowed). If we had another block-level command, like bdrv_aio_freeze(), that queued all pending requests until the given callback completed, it would be very easy to do this entirely asynchronously. For instance: bdrv_aio_freeze(create_snapshot) create_snapshot(): bdrv_aio_flush(done_flush) done_flush(): bdrv_open(...) bdrv_close(...) ... Of course, closing a device while it's being frozen is probably a recipe for disaster but you get the idea :-) bdrv_aio_freeze() or any mechanism to deal with pending requests in the generic block code would be a good step for future live support of other operations like truncate. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: Use ioeventfd for virtqueue notify
On Tue, Oct 19, 2010 at 2:35 PM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote: On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote: Is there anything stopping this patch from being merged? Michael, any objections? If not, I'll merge it. I don't really understand what's going on there. The extra state in notifiers especially scares me. If you do and are comfortable with the code, go ahead :) I'm happy to address your comments. The state machine was a bit icky but I don't see a way around it. Will follow up to your review email. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v6)
On Wed, Oct 13, 2010 at 12:18 AM, Christian Brunner c...@muc.de wrote: +static int rbd_set_snapc(rados_pool_t pool, const char *snap, RbdHeader1 *header) +{ + uint32_t snap_count = header-snap_count; + rados_snap_t *snaps = NULL; + rados_snap_t seq; + uint32_t i; + uint64_t snap_names_len = header-snap_names_len; + int r; + rados_snap_t snapid = 0; + + cpu_to_le32s(snap_count); + cpu_to_le64s(snap_names_len); It is clearer to do byteswapping immediately, rather than having the variable take on different endianness at different times: uint32_t snap_count = cpu_to_le32(header-snap_count); uint64_t snap_names_len = cpu_to_le64(header-snap_names_len); + if (snap_count) { + const char *header_snap = (const char *)header-snaps[snap_count]; + const char *end = header_snap + snap_names_len; snap_names_len is little-endian. This won't work on big-endian hosts. Did you mean le64_to_cpu() instead of cpu_to_le64()? + snaps = qemu_malloc(sizeof(rados_snap_t) * header-snap_count); snaps is allocated here... + + for (i=0; i snap_count; i++) { + snaps[i] = (uint64_t)header-snaps[i].id; + cpu_to_le64s(snaps[i]); + + if (snap strcmp(snap, header_snap) == 0) { + snapid = snaps[i]; + } + + header_snap += strlen(header_snap) + 1; + if (header_snap end) { + error_report(bad header, snapshot list broken); + } + } + } + + if (snap !snapid) { + error_report(snapshot not found); + return -ENOENT; ...but never freed here. + } + seq = header-snap_seq; + cpu_to_le32s((uint32_t *)seq); + + r = rados_set_snap_context(pool, seq, snaps, snap_count); + + rados_set_snap(pool, snapid); + + qemu_free(snaps); + + return r; +} Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH -v5] ceph/rbd block driver for qemu-kvm
On Fri, Oct 8, 2010 at 8:00 PM, Yehuda Sadeh yeh...@hq.newdream.net wrote: No flush operation is supported. Can the guest be sure written data is on stable storage when it receives completion? +/* + * This aio completion is being called from rbd_aio_event_reader() and + * runs in qemu context. It schedules a bh, but just in case the aio + * was not cancelled before. Cancellation looks unsafe to me because acb is freed for cancel but then accessed here! Also see my comment on aio_cancel() below. +/* + * Cancel aio. Since we don't reference acb in a non qemu threads, + * it is safe to access it here. + */ +static void rbd_aio_cancel(BlockDriverAIOCB *blockacb) +{ + RBDAIOCB *acb = (RBDAIOCB *) blockacb; + qemu_bh_delete(acb-bh); + acb-bh = NULL; + qemu_aio_release(acb); Any pending librados completions are still running here and will then cause acb to be accessed after they complete. If there is no safe way to cancel then wait for the request to complete. +} + +static AIOPool rbd_aio_pool = { + .aiocb_size = sizeof(RBDAIOCB), + .cancel = rbd_aio_cancel, +}; + +/* + * This is the callback function for rados_aio_read and _write + * + * Note: this function is being called from a non qemu thread so + * we need to be careful about what we do here. Generally we only + * write to the block notification pipe, and do the rest of the + * io completion handling from rbd_aio_event_reader() which + * runs in a qemu context. Do librados threads have all signals blocked? QEMU uses signals so it is important that this signal not get sent to a librados thread and discarded. I have seen this issue in the past when using threaded libraries in QEMU. + */ +static void rbd_finish_aiocb(rados_completion_t c, RADOSCB *rcb) +{ + rcb-ret = rados_aio_get_return_value(c); + rados_aio_release(c); + if (write(rcb-s-fds[RBD_FD_WRITE], (void *)rcb, sizeof(rcb)) 0) { You are writing RADOSCB* so sizeof(rcb) should be used. + error_report(failed writing to acb-s-fds\n); + qemu_free(rcb); + } +} + +/* Callback when all queued rados_aio requests are complete */ + +static void rbd_aio_bh_cb(void *opaque) +{ + RBDAIOCB *acb = opaque; + + if (!acb-write) { + qemu_iovec_from_buffer(acb-qiov, acb-bounce, acb-qiov-size); + } + qemu_vfree(acb-bounce); + acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); + qemu_bh_delete(acb-bh); + acb-bh = NULL; + + qemu_aio_release(acb); +} + +static BlockDriverAIOCB *rbd_aio_rw_vector(BlockDriverState *bs, + int64_t sector_num, + QEMUIOVector *qiov, + int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque, int write) +{ + RBDAIOCB *acb; + RADOSCB *rcb; + rados_completion_t c; + char n[RBD_MAX_SEG_NAME_SIZE]; + int64_t segnr, segoffs, segsize, last_segnr; + int64_t off, size; + char *buf; + + BDRVRBDState *s = bs-opaque; + + acb = qemu_aio_get(rbd_aio_pool, bs, cb, opaque); + acb-write = write; + acb-qiov = qiov; + acb-bounce = qemu_blockalign(bs, qiov-size); + acb-aiocnt = 0; + acb-ret = 0; + acb-error = 0; + acb-s = s; + + if (!acb-bh) { + acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); + } When do you expect acb-bh to be non-NULL? + + if (write) { + qemu_iovec_to_buffer(acb-qiov, acb-bounce); + } + + buf = acb-bounce; + + off = sector_num * BDRV_SECTOR_SIZE; + size = nb_sectors * BDRV_SECTOR_SIZE; + segnr = off / s-objsize; + segoffs = off % s-objsize; + segsize = s-objsize - segoffs; + + last_segnr = ((off + size - 1) / s-objsize); + acb-aiocnt = (last_segnr - segnr) + 1; + + s-qemu_aio_count += acb-aiocnt; /* All the RADOSCB */ + + if (write s-read_only) { + acb-ret = -EROFS; + return NULL; + } block.c:bdrv_aio_writev() will reject writes to read-only block devices. This check can be eliminated and it also prevents leaking acb here. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: Use ioeventfd for virtqueue notify
On Sun, Oct 3, 2010 at 12:01 PM, Avi Kivity a...@redhat.com wrote: On 09/30/2010 04:01 PM, Stefan Hajnoczi wrote: Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. Note that this is a tradeoff. If an idle core is available and the scheduler places the iothread on that core, then the heavyweight exit is replaced by a lightweight exit + IPI. If the iothread is co-located with the vcpu, then we'll take a heavyweight exit in any case. The first case is very likely if the host cpu is undercommitted and there is heavy I/O activity. This is a typical subsystem benchmark scenario (as opposed to a system benchmark like specvirt). My feeling is that total system throughput will be decreased unless the scheduler is clever enough to place the iothread and vcpu on the same host cpu when the system is overcommitted. We can't balance feeling against numbers, especially when we have a precedent in vhost-net, so I think this should go in. But I think we should also try to understand the effects of the extra IPIs and cacheline bouncing that this creates. While virtio was designed to minimize this, we know it has severe problems in this area. Right, there is a danger of optimizing for subsystem benchmark cases rather than real world usage. I have posted some results that we've gathered but more scrutiny is welcome. Khoa Huynhk...@us.ibm.com collected the following data for virtio-blk with cache=none,aio=native: FFSB Test Threads Unmodified Patched (MB/s) (MB/s) Large file create 1 21.7 21.8 8 101.0 118.0 16 119.0 157.0 Sequential reads 1 21.9 23.2 8 114.0 139.0 16 143.0 178.0 Random reads 1 3.3 3.6 8 23.0 25.4 16 43.3 47.8 Random writes 1 22.2 23.0 8 93.1 111.6 16 110.5 132.0 Impressive numbers. Can you also provide efficiency (bytes per host cpu seconds)? Khoa, do you have the host CPU numbers for these benchmark runs? How many guest vcpus were used with this? With enough vcpus, there is also a reduction in cacheline bouncing, since the virtio state in the host gets to stay on one cpu (especially with aio=native). Guest: 2 vcpu, 4 GB RAM Host: 16 cpus, 12 GB RAM Khoa, is this correct? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: disk image snapshot functionality
Try this: http://wiki.qemu.org/download/qemu-doc.html#vm_005fsnapshots To list the snapshots in your QCOW2 image: qemu-img snapshot -l myimage.qcow2 To revert the disk to a saved state: qemu-img snapshot -a snapshot-name myimage.qcow2 Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio: Use ioeventfd for virtqueue notify
Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Full numbers are below. This patch employs ioeventfd virtqueue notify for all virtio devices. Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care must be taken so that vhost-net, the other ioeventfd user in QEMU, is able to function. On such kernels ioeventfd virtqueue notify will not be used. Khoa Huynh k...@us.ibm.com collected the following data for virtio-blk with cache=none,aio=native: FFSB Test Threads Unmodified Patched (MB/s) (MB/s) Large file create 121.721.8 8101.0 118.0 16 119.0 157.0 Sequential reads 121.923.2 8114.0 139.0 16 143.0 178.0 Random reads 13.3 3.6 823.025.4 16 43.347.8 Random writes 122.223.0 893.1111.6 16 110.5 132.0 Sridhar Samudrala s...@us.ibm.com collected the following data for virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest. Guest to Host TCP_STREAM throughput(Mb/sec) --- Msg Size vhost-net virtio-net virtio-net/ioeventfd 65536 127556430 7590 16384 84993084 5764 4096 47231578 3659 1024 1827 981 2060 Host to Guest TCP_STREAM throughput(Mb/sec) --- Msg Size vhost-net virtio-net virtio-net/ioeventfd 65536 111565790 5853 16384 107875575 5691 4096 104525556 4277 1024 44373671 5277 Guest to Host TCP_RR latency(transactions/sec) -- Msg Size vhost-net virtio-net virtio-net/ioeventfd 1 99033459 3425 4096 71851931 1899 16384 61082102 1923 65536 31611610 1744 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- Small changes are required for qemu-kvm.git. I will send them once qemu.git has virtio-ioeventfd support. hw/vhost.c |6 ++-- hw/virtio.c | 106 +++ hw/virtio.h |9 + kvm-all.c | 39 + kvm-stub.c |5 +++ kvm.h |1 + 6 files changed, 156 insertions(+), 10 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1b8624d..f127a07 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -517,7 +517,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev, goto fail_guest_notifier; } -r = vdev-binding-set_host_notifier(vdev-binding_opaque, idx, true); +r = virtio_set_host_notifier(vdev, idx, true); if (r 0) { fprintf(stderr, Error binding host notifier: %d\n, -r); goto fail_host_notifier; @@ -539,7 +539,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev, fail_call: fail_kick: -vdev-binding-set_host_notifier(vdev-binding_opaque, idx, false); +virtio_set_host_notifier(vdev, idx, false); fail_host_notifier: vdev-binding-set_guest_notifier(vdev-binding_opaque, idx, false); fail_guest_notifier: @@ -575,7 +575,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, } assert (r = 0); -r = vdev-binding-set_host_notifier(vdev-binding_opaque, idx, false); +r = virtio_set_host_notifier(vdev, idx, false); if (r 0) { fprintf(stderr, vhost VQ %d host cleanup failed: %d\n, idx, r); fflush(stderr); diff --git a/hw/virtio.c b/hw/virtio.c index fbef788..f075b3a 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -16,6 +16,7 @@ #include trace.h #include virtio.h #include sysemu.h +#include kvm.h /* The alignment to use between consumer and producer parts of vring. * x86 pagesize again. */ @@ -77,6 +78,11 @@ struct VirtQueue VirtIODevice *vdev; EventNotifier guest_notifier; EventNotifier host_notifier; +enum { +HOST_NOTIFIER_DEASSIGNED
Re: disk image snapshot functionality
On Thu, Sep 30, 2010 at 2:49 PM, Peter Doherty dohe...@hkl.hms.harvard.edu wrote: On Sep 30, 2010, at 04:31 , Stefan Hajnoczi wrote: Try this: http://wiki.qemu.org/download/qemu-doc.html#vm_005fsnapshots To list the snapshots in your QCOW2 image: qemu-img snapshot -l myimage.qcow2 To revert the disk to a saved state: qemu-img snapshot -a snapshot-name myimage.qcow2 Stefan Thanks, It looks like the savevm and loadvm commands are pretty close to what I'm looking for. Any idea what version of qemu that's available in, and when we might be seeing that version available in RHEL/CentOS? The snapshot command was added in January 2009. I don't have a RHEL 5 handy for checking, sorry. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: disk image snapshot functionality
On Tue, Sep 28, 2010 at 9:24 PM, Peter Doherty dohe...@hkl.hms.harvard.edu wrote: I thought I could do this with the qcow2 images. I've used: qemu-img snapshot -c snapname disk_image.qcow2 to create the snapshot. It doesn't work. The snapshots claim to be created, but if I shut down the guest, apply the snapshot ( qemu-img snapshot -a snapname disk_image.qcow2 ) the guest either: a.) no longer boots (No bootable disk found) b.) boots, but is just how it was when I shut it down (it hasn't reverted back to what it was like when the snapshot was made) It is not possible to use qemu-img on the qcow2 image file of a running guest. It might work sometimes but chances are you'll corrupt the image or end up with random behavior. Compare this to mounting a filesystem from the guest and the host at the same time - it is not safe to do this! You can take snapshots using the savevm command inside QEMU but that will pause the guest. If you use a logical volume instead of a qcow2 file then you can create an LVM snapshot. Try to ensure that the guest is in a reasonably idle disk I/O state, otherwise the snapshot may catch the guest at a bad time. On boot up the filesystem may need to perform some recovery (e.g. journal rollback). Live disk snapshots could be supported at the QEMU block layer. Once a snapshot request is issued, all following I/O requests are queued and not started yet. Once existing requests have finished (the block device is quiesced), the snapshot can be taken. When the snapshot completes, the queued requests are started and operation resumes as normal. Qcow2 is unique because it supports internal snapshots. Disk snapshots are part of the image file itself and blocks are shared between snapshots using reference counting and copy-on-write. Other image formats only support external snapshots via backing files. Examples of this are QCOW1, VMDK, and LVM (using LVM snapshot commands). In order to take an external snapshot you create a new image file that uses the snapshot as a backing file. New writes after the snapshot go to the new file. The old file is the snapshot and should stay unmodified in the future. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio-blk XFS corruption
On Sat, Sep 25, 2010 at 2:43 PM, Peter Lieven p...@dlh.net wrote: we experience filesystem corruption using virtio-blk on some guest systems togehter with XFS. We still use qemu-kvm 0.12.4. [...] It seems that 64-bit Ubuntu LTS 10.04.1 is affected as well as an older openSuse 11.1 system with kernel 2.6.27.45-0.1-pae. Suprisingly we have an openSuse 11.1 with 2.6.27.45-0.1-default working absolutely stable for months. Affected guests: 64-bit Ubuntu LTS 10.04.1, openSuse 11.1 2.6.27.45-0.1-pae Unaffected guests: openSuse 11.1 2.6.27.45-0.1-default qemu-kvm version: 0.12.4 qemu-kvm command-line: ? Disk image format: ? Steps to reproduce: ? Can you please provide information on the unknown items above? Does this happen with IDE (-drive file=myimage.img,if=ide) or only with virtio-blk? The only thing I have seen in the syslog of the 64-bit Ubuntu LTS 10.04.1 is the following: [19001.346897] Filesystem vda1: XFS internal error xfs_trans_cancel at line 1162 of file /build/buildd/linux-2.6.32/fs/xfs/xfs_trans.c. Caller 0xa013091d [19001.346897] [19002.174492] Pid: 1210, comm: diablo Not tainted 2.6.32-24-server #43-Ubuntu [19002.174492] Call Trace: [19002.174492] [a010f403] xfs_error_report+0x43/0x50 [xfs] [19002.174492] [a013091d] ? xfs_create+0x1dd/0x5f0 [xfs] [19002.174492] [a012bb35] xfs_trans_cancel+0xf5/0x120 [xfs] [19002.174492] [a013091d] xfs_create+0x1dd/0x5f0 [xfs] XFS has given up because of an error in xfs_create() but I don't think this output gives enough information to determine what the error was. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Growing qcow2 files during block migration ?
On Thu, Sep 23, 2010 at 3:40 PM, Christoph Adomeit christoph.adom...@gatworks.de wrote: Lets say my source machine has a qcow2 file with virtual size of 60 GB but only 2 GB are in use. So the qcow2 file only has a size of 2 GB. After block migration the resulting qcow2 file on the target machine has a size of 60 GB. Also the migration is quite slow because it seems to transfer 60 GB instead of 2 GB. Are there any workarounds, ideas/plans to optimize this ? Yes. Although this isn't currently possible in mainline QEMU it is getting attention. There have been several recent threads on QED, image streaming, and block migration. Here is one which touches on zero sectors: http://www.spinics.net/linux/fedora/libvir/msg28144.html Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tracing KVM with Systemtap
On Wed, Sep 22, 2010 at 1:11 PM, Rayson Ho r...@redhat.com wrote: On Tue, 2010-09-21 at 14:33 +0100, Stefan Hajnoczi wrote: I will see what other probes are useful for the end users. Also, are there developer documentations for KVM? (I googled but found a lot of presentations about KVM but not a lot of info about the internals.) Not really. I suggest grabbing the source and following vl.c:main() to the main KVM execution code. I was looking for the hardware interfacing code earlier this morning -- QEMU has the hardware specific directories (e.g. target-i386/ , target-ppc/ ), and I was trying to understand the execution environment when the host and guest are running on the same architecture. I believe cpu_gen_code() and other related functions are what I should dig into... KVM does not generate code. Almost all the emulation code in the source tree is part of the Tiny Code Generator (TCG) used when KVM is not enabled (e.g. to emulate an ARM board on an x86-64 host). If you follow the life-cycle in vl.c it will take you through cpus.c and into kvm-all.c:kvm_cpu_exec(). Note that the details differ slightly between qemu.git and qemu-kvm.git, and I have described qemu.git. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tracing KVM with Systemtap
On Wed, Sep 22, 2010 at 1:42 PM, Rayson Ho r...@redhat.com wrote: On Wed, 2010-09-22 at 13:33 +0100, Stefan Hajnoczi wrote: KVM does not generate code. Almost all the emulation code in the source tree is part of the Tiny Code Generator (TCG) used when KVM is not enabled (e.g. to emulate an ARM board on an x86-64 host). Thanks, that's what I thought too. Otherwise it would be really slow to run KVM :) But if KVM is not used, and QEMU host guest are running on the same architecture, is TCG off? (Hmm, I guess I can find that answer myself by reading the code). TCG is unused when KVM is enabled. There has been discussion about building without it for KVM-only builds and qemu-kvm.git can do that today with a ./configure option. Stefan, are you accepting patches? If so, I will create a patch with the Systemtap framework other probes. I am not a qemu.git or qemu-kvm.git committer but I review patches in areas that I work in, like tracing. I'll be happy to give you feedback. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tracing KVM with Systemtap
On Tue, Sep 21, 2010 at 1:58 PM, Rayson Ho r...@redhat.com wrote: On Mon, 2010-09-20 at 14:36 +0100, Stefan Hajnoczi wrote: Right now there are few pre-defined probes (trace events in QEMU tracing speak). As I develop I try to be mindful of new ones I create and whether they would be generally useful. I intend to contribute more probes and hope others will too! I am still looking at/hacking the QEMU code. I have looked at the following places in the code that I think can be useful to have statistics gathered: net.c qemu_deliver_packet(), etc - network statistics Yes. CPU Arch/op_helper.c global_cpu_lock(), tlb_fill() - lock unlock, and TLB refill statistics These are not relevant to KVM, they are only used when running with KVM disabled (TCG mode). balloon.c, hw/virtio-balloon.c - ballooning information. Prerna added a balloon event which is in qemu.git trace-events. Does that one do what you need? I will see what other probes are useful for the end users. Also, are there developer documentations for KVM? (I googled but found a lot of presentations about KVM but not a lot of info about the internals.) Not really. I suggest grabbing the source and following vl.c:main() to the main KVM execution code. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tracing KVM with Systemtap
On Mon, Sep 20, 2010 at 2:19 PM, Rayson Ho r...@redhat.com wrote: On Wed, 2010-09-08 at 15:08 +0100, Stefan Hajnoczi wrote: Hi Rayson, For the KVM kernel module Linux trace events are already used. For example, see arch/x86/kvm/trace.h and check out /sys/kernel/debug/tracing/events/kvm/*. There is a set of useful static trace points for vm_exit/vm_enter, pio, mmio, etc. For the KVM guest there is perf-kvm(1). This allows perf(1) to look up addresses inside the guest (kernel only?). It produces system-wide performance profiles including guests. Perhaps someone can comment on perf-kvm's full feature set and limitations? For QEMU userspace Prerna Saxena and I are proposing a static tracing patchset. It abstracts the trace backend (SystemTap, LTTng UST, DTrace, etc) from the actual tracepoints so that portability can be achieved. There is a built-in trace backend that has a basic feature set but isn't as fancy as SystemTap. I have implemented LTTng Userspace Tracer support, perhaps you'd like to add SystemTap/DTrace support with sdt.h? Thanks Stefan for the reply! I've looked at the tracing additions in QEMU, including the Simple trace backend (simpletrace.c) and the tracetool script, and I think the SystemTap version can be implemented in a straightforward way. One thing I was wondering, there seems to be not a lot of probes (except the examples?) in the QEMU code, are we expected to see more probes in the next release, or this work will be a long-term project that will not be added to the official QEMU code in the near future? (I believe if we can get the tracing framework integrated, then specific probes can be added on-demand -- but of course that is just my own opinion :-D ) Right now there are few pre-defined probes (trace events in QEMU tracing speak). As I develop I try to be mindful of new ones I create and whether they would be generally useful. I intend to contribute more probes and hope others will too! Prerna is also looking at adding useful probes. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtio with Debian GNU/Linux Etch
On Fri, Sep 17, 2010 at 5:56 PM, Daniel Bareiro daniel-lis...@gmx.net wrote: I have some installations with Debian GNU/Linux Etch I'm migrating to KVM. I just installed a kernel 2.6.26 from backports to use Virtio. But when I try to boot the operating system, it can not find the vd* device to mount the root filesystem. I made sure to change the /etc/fstab using partitions vd* and using in the 'root' GRUB parameter the vd corresponding partition. Just in case it were not loading the module, I added 'virtioi-blk' in /etc/modules, but I keep getting the same problem. What could be the problem? 1. Check that your VM has the virtio PCI adapters present. You can do this by running lspci from a livecd/installer or grep 1af4 /proc/bus/pci/devices. Look for PCI adapters with vendor ID 1af4 (Red Hat). 2. Check that the virtio kernel modules are loaded by your initramfs. Either check the kernel messages as it boots for virtio-pci or vd* related messages. Or get a debug shell in the initramfs and grep virtio-pci /proc/bus/pci/devices. 3. Did you really mean 'virtoi-blk' in /etc/modules? You can gunzip and cpio extract the initramfs to check that virtio, virtio_pci, virtio_ring, virtio_blk are present. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: .img on nfs, relative on ram, consuming mass ram
2010/9/16 Andre Przywara andre.przyw...@amd.com: TOURNIER Frédéric wrote: Ok, thanks for taking time. I'll dig into your answers. So as i run relative.img on diskless systems with original.img on nfs, what are the best practice/tips i can use ? I thinks it is -snapshot you are looking for. This will put the backing store into normal RAM, and you can later commit it to the original image if needed. See the qemu manpage for more details. In a nutshell you just specify the original image and add -snapshot to the command line. -snapshot creates a temporary qcow2 image in /tmp whose backing file is your original image. I'm not sure what you mean by This will put the backing store into normal RAM? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] Add support for a USB audio device model
On Fri, Sep 10, 2010 at 10:47 PM, H. Peter Anvin h...@linux.intel.com wrote: diff --git a/hw/usb-audio.c b/hw/usb-audio.c new file mode 100644 index 000..d4cf488 --- /dev/null +++ b/hw/usb-audio.c @@ -0,0 +1,702 @@ +/* + * QEMU USB Net devices + * + * Copyright (c) 2006 Thomas Sailer + * Copyright (c) 2008 Andrzej Zaborowski Want to update this for usb-audio? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tracing KVM with Systemtap
On Wed, Sep 8, 2010 at 2:20 PM, Rayson Ho r...@redhat.com wrote: Hi all, I am a developer of Systemtap. I am looking into tracing KVM (the kernel part and QEMU) and also the KVM guests with Systemtap. I googled and found references to Xenprobes and xdt+dtrace, and I was wondering if someone is working on the dynamic tracing interface for KVM? I've read the KVM kernel code and I think some expensive operations (things that need to be trapped back to the host kernel - eg. loading of control registers on x86/x64) can be interesting spots for adding an SDT (static marker), and I/O operations performed for the guests can be useful information to collect. I know that KVM guests run like a userspace process and thus techniques for tracing Xen might be overkilled, and also gdb can be used to trace KVM guests. However, are that anything special I need to be aware of before I go further into the development of the Systemtap KVM probes? (Opinions / Suggestions / Criticisms welcome!) Hi Rayson, For the KVM kernel module Linux trace events are already used. For example, see arch/x86/kvm/trace.h and check out /sys/kernel/debug/tracing/events/kvm/*. There is a set of useful static trace points for vm_exit/vm_enter, pio, mmio, etc. For the KVM guest there is perf-kvm(1). This allows perf(1) to look up addresses inside the guest (kernel only?). It produces system-wide performance profiles including guests. Perhaps someone can comment on perf-kvm's full feature set and limitations? For QEMU userspace Prerna Saxena and I are proposing a static tracing patchset. It abstracts the trace backend (SystemTap, LTTng UST, DTrace, etc) from the actual tracepoints so that portability can be achieved. There is a built-in trace backend that has a basic feature set but isn't as fancy as SystemTap. I have implemented LTTng Userspace Tracer support, perhaps you'd like to add SystemTap/DTrace support with sdt.h? http://www.mail-archive.com/qemu-de...@nongnu.org/msg41323.html http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/tracing_v3 Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On Tue, Aug 31, 2010 at 11:32 PM, Alex Williamson alex.william...@redhat.com wrote: On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: The bottom half handler shows big improvements over the timer with few downsides, default to it when the iothread is enabled. Using the following tests, with the guest and host connected via tap+bridge: guest netperf -t TCP_STREAM -H $HOST host netperf -t TCP_STREAM -H $GUEST guest netperf -t UDP_STREAM -H $HOST host netperf -t UDP_STREAM -H $GUEST guest netperf -t TCP_RR -H $HOST Results: base throughput, exits/throughput - patched throughput, exits/throughput --enable-io-thread TCP guest-host 2737.77, 47.82 - 6767.09, 29.15 = 247%, 61% TCP host-guest 2231.33, 74.00 - 4125.80, 67.61 = 185%, 91% UDP guest-host 6281.68, 14.66 - 12569.27, 1.98 = 200%, 14% UDP host-guest 275.91, 289.22 - 264.80, 293.53 = 96%, 101% interations/s 1949.65, 82.97 - 7417.56, 84.31 = 380%, 102% No --enable-io-thread TCP guest-host 3041.57, 55.11 - 1038.93, 517.57 = 34%, 939% TCP host-guest 2416.03, 76.67 - 5655.92, 55.52 = 234%, 72% UDP guest-host 12255.82, 6.11 - 7775.87, 31.32 = 63%, 513% UDP host-guest 587.92, 245.95 - 611.88, 239.92 = 104%, 98% interations/s 1975.59, 83.21 - 8935.50, 88.18 = 452%, 106% Signed-off-by: Alex Williamson alex.william...@redhat.com parameter having different settings based on config options might surprise some users. I don't think we really need a parameter here ... I'm not a bit fan of this either, but I'd also prefer not to introduce a regression for a performance difference we know about in advance. It gets even more complicated when we factor in qemu-kvm, as it doesn't build with iothread enabled, but seems to get and even better boost in performance across the board thanks largely to the kvm-irqchip. Should we instead make this a configure option? --enable-virtio-net-txbh? Thanks, Alex qemu-kvm uses its own iothread implementation by default. It doesn't need --enable-io-thread because it already uses a similar model. Stefan --- hw/s390-virtio-bus.c | 3 ++- hw/syborg_virtio.c | 3 ++- hw/virtio-pci.c | 3 ++- hw/virtio.h | 6 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c index 1483362..985f99a 100644 --- a/hw/s390-virtio-bus.c +++ b/hw/s390-virtio-bus.c @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { .qdev.size = sizeof(VirtIOS390Device), .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), - DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, 1), + DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, VirtIOS390Device, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c index 7b76972..ee5746d 100644 --- a/hw/syborg_virtio.c +++ b/hw/syborg_virtio.c @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), - DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, 1), + DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, SyborgVirtIOProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index e025c09..9740f57 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 3), DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), - DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, 1), + DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, VirtIOPCIProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/virtio.h b/hw/virtio.h index 4051889..a1a17a2 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, void *opaque); +#ifdef CONFIG_IOTHREAD + #define TXTIMER_DEFAULT 0 +#else + #define TXTIMER_DEFAULT 1 +#endif + Add a comment explaning that this is just a performance optimization? /* Base devices. */ VirtIODevice
Re: Tracing KVM with LTTng
On Fri, Aug 6, 2010 at 4:42 PM, Julien Desfossez j...@klipix.org wrote: On [2] you can see a closer look of the state of the kvm threads (blue means syscall, red means running in VM mode (vm_entry), dark yellow means waiting for CPU). This is a great visualization. It shows the state of the entire system in a clear way and makes the wait times easier to understand. In the next days, I will send my patches to the official LTTng git. My next step is to synchronise the traces collected from the host with the traces collected from the guest (by finding an efficient way to share the TSC_OFFSET) to have some infos of what is happening during the time the VM has the control. I'd like to try this patch out, will be checking the LTTng mailing list. The reason I post these screenshots now is that I will be at Linuxcon next week, and I would really appreciate some feedbacks and ideas for future improvements from the KVM community. So if you are interested, contact me directly and if you are there we'll try to meet. I'd like to find out more about what you're doing. Unfortunately I will not be at LinuxCon/KVM Forum this year. Prerna Saxena and I have been working on static trace events for QEMU userspace. Here is the commit to add LTTng Userspace Tracer support: http://repo.or.cz/w/qemu/stefanha.git/commitdiff/5560c202f4c5cc37692d35e53f784c74d65c Using the tracing branch above you can place static trace events in QEMU userspace and collect the trace with LTTng. See the trace-events file for sample trace event definitions. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html