Re: Slow disk IO on virtio kvm guests with Centos 5.5 as hypervisor

2011-02-16 Thread Stefan Hajnoczi
On Wed, Feb 16, 2011 at 12:50 PM, Thomas Broda tho...@bassfimass.de wrote:
 On Tue, 15 Feb 2011 15:50:00 +, Stefan Hajnoczi
 stefa...@gmail.com wrote:

 On Tue, Feb 15, 2011 at 10:15 AM, Thomas Broda tho...@bassfimass.de
 wrote:
 Using O_DIRECT, performance went down to 11 MB/s on the hypervisor...


 Hmm...can you restate that as:

 host  X MB/s
 guest Y MB/s

 Trying dd with oflag=direct an of=/dev/vg0/lvtest (directly on the
 KVM hypervisor) yielded a result of 11MB/s.

 If I try this on the guest with /dev/vda1 as output device, results are
 between 1.9MB/s and 7.7MB/s, usually around 3.5MB/s.

 To sum it up:

 Host: 11 MB/s
 Guest: 3.5 MB/s

 I've checked the RAID controller in the meantime. It's a HP Smart Array
 P400. Write Caching is switched off since the contoller has no BBU
 (yet).

 Could it be related to this?

The disabled write cache will result in slow writes so your host
benchmark result is low on an absolute scale.  However, the relative
Guest/Host performance is very poor here (3.5/11 = 31%).

A number of performance improvements have been made to KVM and Centos
5.5 does not contain them because it is too old.  If you want to see a
more current reflection of KVM performance, you could try Fedora 14
host and guest.  The components that matter are: host kernel, qemu-kvm
userspace, and guest kernel.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow disk IO on virtio kvm guests with Centos 5.5 as hypervisor

2011-02-15 Thread Stefan Hajnoczi
On Mon, Feb 14, 2011 at 6:15 PM, Thomas Broda tho...@bassfimass.de wrote:
 dd'ing /dev/zero to a testfile gives me a throughput of about 400MB/s when
 done directly on the hypervisor. If I try this from within a virtual guest,
 it's only 19MB/s to 24MB/s if the guest is on the LVM volume (raw device,
 not qcow2 or something, no filesystem on top of the LVM).

Did you run dd with O_DIRECT?

dd if=/dev/zero of=path-to-device oflag=direct bs=64k

In order to exercise the disk and eliminate page cache effects you
need to do this.

Also, you are using oldish KVM packages.  You could try a modern
kernel and KVM userspace.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow disk IO on virtio kvm guests with Centos 5.5 as hypervisor

2011-02-15 Thread Stefan Hajnoczi
On Tue, Feb 15, 2011 at 10:15 AM, Thomas Broda tho...@bassfimass.de wrote:
 On Tue, 15 Feb 2011 09:19:23 +, Stefan Hajnoczi
 stefa...@gmail.com wrote:

 Did you run dd with O_DIRECT?

 dd if=/dev/zero of=path-to-device oflag=direct bs=64k

 Using O_DIRECT, performance went down to 11 MB/s on the hypervisor...

Hmm...can you restate that as:

host  X MB/s
guest Y MB/s

I don't understand from your answer which values you have found.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Feb 15

2011-02-15 Thread Stefan Hajnoczi
On Mon, Feb 14, 2011 at 10:18 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 02/14/2011 11:56 AM, Chris Wright wrote:

 Please send in any agenda items you are interested in covering.


 -rc2 is tagged and waiting for announcement.  Please take a look at -rc2 and
 make sure there is nothing critical missing.  Will tag 0.14.0 very late
 tomorrow but unless there's something critical, it'll be 0.14.0-rc2 with an
 updated version.

Most of my -rc2 testing is done now and has passed.

http://wiki.qemu.org/Planning/0.14/Testing

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Feb 8

2011-02-08 Thread Stefan Hajnoczi
On Mon, Feb 7, 2011 at 10:40 PM, Chris Wright chr...@redhat.com wrote:
 Please send in any agenda items you are interested in covering.

Automated builds and testing: maintainer trees, integrating
KVM-Autotest, and QEMU tests we need but don't exist

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call minutes for Feb 8

2011-02-08 Thread Stefan Hajnoczi
On Tue, Feb 8, 2011 at 3:55 PM, Chris Wright chr...@redhat.com wrote:
 Automated builds and testing
 - found broken 32-bit

The broken build was found (and fixed?) before automated qemu.git
builds.  It's a good motivator though.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-08 Thread Stefan Hajnoczi
On Wed, Feb 9, 2011 at 1:55 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote:
 On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
  On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
   On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
 
  On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
Confused. We compare capacity to skb frags, no?
That's sg I think ...
  
   Current guest kernel use indirect buffers, num_free returns how 
   many
   available descriptors not skb frags. So it's wrong here.
  
   Shirley
 
  I see. Good point. In other words when we complete the buffer
  it was indirect, but when we add a new one we
  can not allocate indirect so we consume.
  And then we start the queue and add will fail.
  I guess we need some kind of API to figure out
  whether the buf we complete was indirect?
  
   I've finally read this thread... I think we need to get more serious
   with our stats gathering to diagnose these kind of performance issues.
  
   This is a start; it should tell us what is actually happening to the
   virtio ring(s) without significant performance impact...
  
   Subject: virtio: CONFIG_VIRTIO_STATS
  
   For performance problems we'd like to know exactly what the ring looks
   like.  This patch adds stats indexed by how-full-ring-is; we could extend
   it to also record them by how-used-ring-is if we need.
  
   Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
  Not sure whether the intent is to merge this. If yes -
  would it make sense to use tracing for this instead?
  That's what kvm does.

 Intent wasn't; I've not used tracepoints before, but maybe we should
 consider a longer-term monitoring solution?

 Patch welcome!

 Cheers,
 Rusty.

 Sure, I'll look into this.

There are several virtio trace events already in QEMU today (see the
trace-events file):
virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned
int idx) vq %p elem %p len %u idx %u
virtqueue_flush(void *vq, unsigned int count) vq %p count %u
virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int
out_num) vq %p elem %p in_num %u out_num %u
virtio_queue_notify(void *vdev, int n, void *vq) vdev %p n %d vq %p
virtio_irq(void *vq) vq %p
virtio_notify(void *vdev, void *vq) vdev %p vq %p

These can be used by building QEMU with a suitable tracing backend
like SystemTap (see docs/tracing.txt).

Inside the guest I've used dynamic ftrace in the past, although static
tracepoints would be nice.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Stefan Hajnoczi
On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 11:03, Avi Kivity wrote:
 On 01/27/2011 04:33 PM, Jan Kiszka wrote:
 Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
 checking for exit_request on vcpu entry and timer signals arriving
 before KVM starts to catch them. Plug it by blocking both timer related
 signals also on !CONFIG_IOTHREAD and process those via signalfd.

 As this fix depends on real signalfd support (otherwise the timer
 signals only kick the compat helper thread, and the main thread hangs),
 we need to detect the invalid constellation and abort configure.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---

 I don't want to invest that much into !IOTHREAD anymore, so let's see if
 the proposed catchabort is acceptable.


 I don't understand the dependency on signalfd.  The normal way of doing
 things, either waiting for the signal in sigtimedwait() or in
 ioctl(KVM_RUN), works with SIGALRM just fine.

 And how would you be kicked out of the select() call if it is waiting
 with a timeout? We only have a single thread here.

 The only alternative is Stefan's original proposal. But that required
 fiddling with the signal mask twice per KVM_RUN.

I think my original patch messed with the sigmask in the wrong place,
as you mentioned doing it twice per KVM_RUN isn't a good idea.  I
wonder if we can enable SIGALRM only in blocking calls and guest code
execution but without signalfd.  It might be possible, I don't see an
immediate problem with doing that, we might have to use pselect(2) or
similar in a few places.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Stefan Hajnoczi
On Mon, Jan 31, 2011 at 12:18 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 13:13, Stefan Hajnoczi wrote:
 On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 11:03, Avi Kivity wrote:
 On 01/27/2011 04:33 PM, Jan Kiszka wrote:
 Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
 checking for exit_request on vcpu entry and timer signals arriving
 before KVM starts to catch them. Plug it by blocking both timer related
 signals also on !CONFIG_IOTHREAD and process those via signalfd.

 As this fix depends on real signalfd support (otherwise the timer
 signals only kick the compat helper thread, and the main thread hangs),
 we need to detect the invalid constellation and abort configure.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---

 I don't want to invest that much into !IOTHREAD anymore, so let's see if
 the proposed catchabort is acceptable.


 I don't understand the dependency on signalfd.  The normal way of doing
 things, either waiting for the signal in sigtimedwait() or in
 ioctl(KVM_RUN), works with SIGALRM just fine.

 And how would you be kicked out of the select() call if it is waiting
 with a timeout? We only have a single thread here.

 The only alternative is Stefan's original proposal. But that required
 fiddling with the signal mask twice per KVM_RUN.

 I think my original patch messed with the sigmask in the wrong place,
 as you mentioned doing it twice per KVM_RUN isn't a good idea.  I
 wonder if we can enable SIGALRM only in blocking calls and guest code
 execution but without signalfd.  It might be possible, I don't see an
 immediate problem with doing that, we might have to use pselect(2) or
 similar in a few places.

 My main concern about alternative approaches is that IOTHREAD is about
 to become the default, and hardly anyone (of the few upstream KVM users)
 will run without it in the foreseeable future. The next step will be the
 removal of any !CONFIG_IOTHREAD section. So, how much do we want to
 invest here (provided my proposal has not remaining issues)?

Yes, you're right.  I'm not volunteering to dig more into this, the
best case would be to switch to a non-I/O thread world that works for
everybody.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero-copy block driver?

2011-01-29 Thread Stefan Hajnoczi
2011/1/29 Darko Petrović darko.b.petro...@gmail.com:
 Could you please tell me if it is possible to use a block driver that
 completely avoids the guest kernel and copies block data directly to/from
 the given buffer in the guest userspace?
 If yes, how to activate it? If not... why not? :)

Inside the guest, open files using the O_DIRECT flag.  This tells the
guest kernel to avoid the page cache when possible, enabling
zero-copy.  You need to use aligned memory buffers and perform I/O in
multiples of the block size.

See the open(2) man page for details.  Make sure you really want to do
this, most applications don't.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero-copy block driver?

2011-01-29 Thread Stefan Hajnoczi
2011/1/29 Darko Petrović darko.b.petro...@gmail.com:
 Thanks for your help. Actually, I am more interested in doing it from the
 outside, if possible (I am not allowed to change the application code). Can
 the guest be tricked by KVM somehow, using the appropriate drivers? Just to
 clear it out, copying to/from a host buffer is fine, I just want to avoid
 having guest buffers.

Not really.  If the application is designed to use the page cache then
it will use it.

You might want to look at unmapped page cache control which is not in
mainline Linux yet:
http://lwn.net/Articles/419713/

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qemu-img create problem

2011-01-28 Thread Stefan Hajnoczi
On Fri, Jan 28, 2011 at 1:13 PM, Himanshu Chauhan
hschau...@nulltrace.org wrote:
 I just cloned qemu-kvm, built and installed it. But the qemu-img fails
 to create any disk image above 1G. The problem as I see is use of
 ssize_t for image size. When size is 2G, the check if (sval  0)
 succeeds and I get the error:

This is fixed in qemu.git 70b4f4bb05ff5e6812c6593eeefbd19bd61b517d
Make strtosz() return int64_t instead of ssize_t.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: KVM call agenda for Jan 25

2011-01-25 Thread Stefan Hajnoczi
On Tue, Jan 25, 2011 at 2:02 PM, Luiz Capitulino lcapitul...@redhat.com wrote:
  - Google summer of code 2011 is on, are we interested? (note: I just saw the
   news, I don't have any information yet)

http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/timeline

I'd like to see an in-place QCOW2 - QED image converter with tests.
I'm interested in mentoring this year.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Jan 25

2011-01-25 Thread Stefan Hajnoczi
On Tue, Jan 25, 2011 at 2:26 PM, Avi Kivity a...@redhat.com wrote:
 On 01/25/2011 12:06 AM, Anthony Liguori wrote:

 On 01/24/2011 07:25 AM, Chris Wright wrote:

 Please send in any agenda items you are interested in covering.

 - coroutines for the block layer

 I have a perpetually in progress branch for this, and would very much like
 to see this done.

Seen this?
http://repo.or.cz/w/qemu/stefanha.git/commit/8179e8ff20bb3f14f361109afe5b3bf2bac24f0d
http://repo.or.cz/w/qemu/stefanha.git/shortlog/8179e8ff20bb3f14f361109afe5b3bf2bac24f0d

And the qemu-devel thread:
http://www.mail-archive.com/qemu-devel@nongnu.org/msg52522.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ATA Trim for qcow(2)

2011-01-23 Thread Stefan Hajnoczi
On Sun, Jan 23, 2011 at 9:35 PM, Emil Langrock emil.langr...@gmx.de wrote:
 there is support for ext4 to use the trim ATA command when a block is freed. I
 read that there should be an extra command which does that freeing afterwards.
 So is it possible to use that information inside the qcow to mark those
 sectors as free? This would make it possible to shrink the size of an image
 significantly using some offline (maybe also some online) tools.

There is currently no TRIM support in qcow2.  Christoph Hellwig
recently added TRIM support to raw images on an XFS host file system.
In the future we'll see wider support.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: KVM call agenda for Jan 11

2011-01-10 Thread Stefan Hajnoczi
On Mon, Jan 10, 2011 at 1:05 PM, Jes Sorensen jes.soren...@redhat.com wrote:
 On 01/10/11 12:59, Juan Quintela wrote:
 Juan Quintela quint...@redhat.com wrote:

 Now sent it to the right kvm list.  Sorry for the second sent.

 Please send any agenda items you are interested in covering.

 - KVM Forum 2011 (Jes).

 Just to add a bit more background. Last year we discussed the issue of
 whether to aim for a KVM Forum in the same style as we had in 2010, or
 whether to try to aim for a broader multi-track Virtualization
 conference that covers the whole stack.

 Linux Foundation is happy to help host such an event, but they are
 asking for what our plans are. I posted a mock-proposal for tracks here:
 http://www.linux-kvm.org/page/KVM_Forum_2011

I thought having both KVM and Xen people at Linux Plumbers 2010 worked
out well.  Doing that with libvirt, OpenStack, etc has a lot of
potential.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-0.13.0 - winsows 2008 - chkdisk too slow

2011-01-06 Thread Stefan Hajnoczi
On Thu, Jan 6, 2011 at 7:48 AM, Nikola Ciprich extmaill...@linuxbox.cz wrote:
 So windows started checking disk integrity, but the problem is, that
 it's waaay too slow - after ~12 hours, it's still running and seeems
 like it'll take ages to finish.

Please post your KVM command-line.

Have you run storage benchmarks on the host to check what sort of
maximum I/O performance you can expect?  Do you have a RAID setup
underneath LVM?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FIXED: Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)

2011-01-06 Thread Stefan Hajnoczi
On Wed, Jan 5, 2011 at 5:01 PM, Serge E. Hallyn se...@hallyn.com wrote:
 I don't see this patch in the git tree, nor a revert of the buggy
 commit.  Was any decision made on this?

Blue Swirl posted a patch a few days ago:
[PATCH] pc: move port 92 stuff back to pc.c from pckbd.c

It hasn't been merged yet but I don't see any objections to it on the
email thread.  Perhaps he's just busy.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk activity issue...

2011-01-04 Thread Stefan Hajnoczi
On Thu, Dec 30, 2010 at 11:25 PM, Erich Weiler bitscrub...@gmail.com wrote:
 I've got this issue that I've been banging my head against a wall for a
 while over and I think another pair of eyes may help, if anyone have a
 moment.  We have this new-ish KVM VM server (with the latest CentOS 5.5
 updates, kmod-kvm-83-164.el5_5.25) that houses 3 VMs.  It works mostly as
 expected except it has a very high load all the time, like 40-60, when the
 VMs are running.  I suspect it has to do with memory management, because
 when all 3 VMs are online, they should consume 5GB RAM on the VM server and
 they only consume like 2GB, so I think the rest of the RAM is swapping or
 something, because the disks are spinning at 100% all the time (even when
 the VMs are doing nothing).  Although, the VM server does not report any
 swapping happening. When I shut down the VMs one by one, the load drops and
 so does the disk activity.  I don't think I set this server up with anything
 out of the ordinary...  I've tried rebooting, but the same thing happens
 immediately upon reboot. Google searches, for me at least, yielded nothing
 useful.

very high load all the time, like 40-60
Is this number the host CPU utilization or load average?

Which guest OS (and versions) are you running?

Can you paste the qemu-kvm command-line for the VMs?

Can you send a few lines of vmstat 5 output on the host while
running the 3 VMs?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/21] Introduce event-tap.

2011-01-04 Thread Stefan Hajnoczi
On Tue, Jan 4, 2011 at 11:02 AM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 After doing some heavy load tests, I realized that we have to
 take a hybrid approach to replay for now.  This is because when a
 device moves to the next state (e.g. virtio decreases inuse) is
 different between net and block.  For example, virtio-net
 decreases inuse upon returning from the net layer, but virtio-blk
 does that inside of the callback.  If we only use pio/mmio
 replay, even though event-tap tries to replay net requests, some
 get lost because the state has proceeded already.  This doesn't
 happen with block, because the state is still old enough to
 replay.  Note that using hybrid approach won't cause duplicated
 requests on the secondary.

Thanks Yoshi.  I think I understand what you're saying.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)

2010-12-26 Thread Stefan Hajnoczi
On Sat, Dec 25, 2010 at 7:02 PM, Peter Lieven p...@dlh.net wrote:
 this was the outcome of my bisect session:

 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a is first bad commit
 commit 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a
 Author: Blue Swirl blauwir...@gmail.com
 Date:   Sat May 22 07:59:01 2010 +

    Compile pckbd only once

    Use a qemu_irq to indicate A20 line changes. Move I/O port 92
    to pckbd.c.

    Signed-off-by: Blue Swirl blauwir...@gmail.com

 :100644 100644 acbaf227455f931f3ef6dbe0bb4494c6b41f2cd9 
 1a33d4eb4a5624c55896871b5f4ecde78a49ff28 M      Makefile.objs
 :100644 100644 a22484e1e98355a35deeb5038a45fb8fe8685a91 
 ba5147fbc48e4faef072a5be6b0d69d3201c1e18 M      Makefile.target
 :04 04 dd03f81a42b5162c93c40c517f45eb9f7bece93c 
 309f472328632319a15128a59715aa63daf4d92c M      default-configs
 :04 04 83201c4fcde2f592a771479246e0a33a8906515b 
 b1192bce85f2a7129fb19cf2fe7462ef168165cb M      hw
 bisect run success

Nice job bisecting this!  I can reproduce the Memtest86+ V4.10 system
reset with qemu-kvm.git and qemu.git.

The following code path is hit when val=0x2:
if (!(val  1)) {
qemu_system_reset_request();
}

I think unifying ioport 0x92 and KBD_CCMD_WRITE_OUTPORT was incorrect.
 ioport 0x92 is the System Control Port A and resets the system if bit
0 is 1.  The keyboard outport seems to reset if bit 0 is 0.

Here are the links I've found describing the i8042 keyboard controller
and System Control Port A:
http://www.computer-engineering.org/ps2keyboard/
http://www.win.tue.nl/~aeb/linux/kbd/A20.html

Blue Swirl: Any thoughts on this?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FIXED: Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)

2010-12-26 Thread Stefan Hajnoczi
On Sun, Dec 26, 2010 at 9:21 PM, Peter Lieven p...@dlh.net wrote:

 Am 25.12.2010 um 20:02 schrieb Peter Lieven:


 Am 23.12.2010 um 03:42 schrieb Stefan Hajnoczi:

 On Wed, Dec 22, 2010 at 10:02 AM, Peter Lieven p...@dlh.net wrote:
 If I start a VM with the following parameters
 qemu-kvm-0.13.0 -m 2048 -smp 2 -monitor tcp:0:4014,server,nowait -vnc :14 
 -name 'ubuntu.test'  -boot order=dc,menu=off  -cdrom 
 ubuntu-10.04.1-desktop-amd64.iso -k de

 and select memtest in the Ubuntu CD Boot Menu, the VM immediately resets. 
 After this reset there happen several errors including graphic corruption 
 or the qemu-kvm binary
 aborting with error 134.

 Exactly the same scenario on the same machine with qemu-kvm-0.12.5 works 
 flawlessly.

 Any ideas?

 You could track down the commit which broke this using git-bisect(1).
 The steps are:

 $ git bisect start v0.13.0 v0.12.5

 Then:

 $ ./configure [...]  make
 $ x86_64-softmmu/qemu-system-x86_64 -m 2048 -smp 2 -monitor
 tcp:0:4014,server,nowait -vnc :14 -name 'ubuntu.test'  -boot
 order=dc,menu=off  -cdrom ubuntu-10.04.1-desktop-amd64.iso -k de

 If memtest runs as expected:
 $ git bisect good
 otherwise:
 $ git bisect bad

 Keep repeating this and you should end up at the commit that introduced the 
 bug.

 this was the outcome of my bisect session:

 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a is first bad commit
 commit 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a
 Author: Blue Swirl blauwir...@gmail.com
 Date:   Sat May 22 07:59:01 2010 +

    Compile pckbd only once

    Use a qemu_irq to indicate A20 line changes. Move I/O port 92
    to pckbd.c.

    Signed-off-by: Blue Swirl blauwir...@gmail.com

 :100644 100644 acbaf227455f931f3ef6dbe0bb4494c6b41f2cd9 
 1a33d4eb4a5624c55896871b5f4ecde78a49ff28 M      Makefile.objs
 :100644 100644 a22484e1e98355a35deeb5038a45fb8fe8685a91 
 ba5147fbc48e4faef072a5be6b0d69d3201c1e18 M      Makefile.target
 :04 04 dd03f81a42b5162c93c40c517f45eb9f7bece93c 
 309f472328632319a15128a59715aa63daf4d92c M      default-configs
 :04 04 83201c4fcde2f592a771479246e0a33a8906515b 
 b1192bce85f2a7129fb19cf2fe7462ef168165cb M      hw
 bisect run success

 I tracked down the regression to a bug in commit 
 956a3e6bb7386de48b642d4fee11f7f86a2fcf9a

 In the patch the outport of the keyboard controller and ioport 0x92 are made 
 the same.

 this cannot work:

 a) both share bit 1 to enable a20_gate. 1=enable, 0=disable - ok so far
 b) both implement a fast reset option through bit 0, but with inverse logic!!!
 the keyboard controller resets if bit 0 is lowered, the ioport 0x92 resets if 
 bit 0 is raised.
 c) all other bits have nothing in common at all.

 see: http://www.brokenthorn.com/Resources/OSDev9.html

 I have a proposed patch attached. Comments appreciated. The state of the A20 
 Gate is still
 shared between ioport 0x92 and outport of the keyboard controller, but all 
 other bits are ignored.
 They might be used in the future to emulate e.g. hdd led activity or other 
 usage of ioport 0x92.

 I have tested the attached patch. memtest works again as expected. I think it 
 crashed because it uses
 ioport 0x92 directly to enable the a20 gate.

 Peter

 ---

 --- qemu-0.13.0/hw/pckbd.c      2010-10-15 22:56:09.0 +0200
 +++ qemu-0.13.0-fix/hw/pckbd.c  2010-12-26 19:38:35.835114033 +0100
 @@ -212,13 +212,16 @@
 static void ioport92_write(void *opaque, uint32_t addr, uint32_t val)
 {
    KBDState *s = opaque;
 -
 -    DPRINTF(kbd: write outport=0x%02x\n, val);
 -    s-outport = val;
 -    if (s-a20_out) {
 -        qemu_set_irq(*s-a20_out, (val  1)  1);
 +    if (val  0x02) { // bit 1: enable/disable A20
 +       if (s-a20_out) qemu_irq_raise(*s-a20_out);
 +       s-outport |= KBD_OUT_A20;
 +    }
 +    else
 +    {
 +       if (s-a20_out) qemu_irq_lower(*s-a20_out);
 +       s-outport = ~KBD_OUT_A20;
    }
 -    if (!(val  1)) {
 +    if ((val  1)) { // bit 0: raised - fast reset
        qemu_system_reset_request();
    }
 }
 @@ -226,11 +229,8 @@
 static uint32_t ioport92_read(void *opaque, uint32_t addr)
 {
    KBDState *s = opaque;
 -    uint32_t ret;
 -
 -    ret = s-outport;
 -    DPRINTF(kbd: read outport=0x%02x\n, ret);
 -    return ret;
 +    return (s-outport  0x02); // only bit 1 (KBD_OUT_A20) of port 0x92 is 
 identical to s-outport
 +    /* XXX: bit 0 is fast reset, bits 6-7 hdd activity */
 }

 static void kbd_write_command(void *opaque, uint32_t addr, uint32_t val)
 @@ -340,7 +340,9 @@
        kbd_queue(s, val, 1);
        break;
    case KBD_CCMD_WRITE_OUTPORT:
 -        ioport92_write(s, 0, val);
 +        ioport92_write(s, 0, (ioport92_read(s,0)  0xfc) // copy bits 2-7 of 
 0x92
 +                             | (val  0x02) // bit 1 (enable a20)
 +                             | (~val  0x01)); // bit 0 (fast reset) of port 
 0x92 has inverse logic
        break;
    case KBD_CCMD_WRITE_MOUSE:
        ps2_write_mouse(s-mouse, val);



I just replied to the original thread.  I think we should separate
0x92

Re: [Qemu-devel] possible regression in qemu-kvm 0.13.0 (memtest)

2010-12-22 Thread Stefan Hajnoczi
On Wed, Dec 22, 2010 at 10:02 AM, Peter Lieven p...@dlh.net wrote:
 If I start a VM with the following parameters
 qemu-kvm-0.13.0 -m 2048 -smp 2 -monitor tcp:0:4014,server,nowait -vnc :14 
 -name 'ubuntu.test'  -boot order=dc,menu=off  -cdrom 
 ubuntu-10.04.1-desktop-amd64.iso -k de

 and select memtest in the Ubuntu CD Boot Menu, the VM immediately resets. 
 After this reset there happen several errors including graphic corruption or 
 the qemu-kvm binary
 aborting with error 134.

 Exactly the same scenario on the same machine with qemu-kvm-0.12.5 works 
 flawlessly.

 Any ideas?

You could track down the commit which broke this using git-bisect(1).
The steps are:

$ git bisect start v0.13.0 v0.12.5

Then:

$ ./configure [...]  make
$ x86_64-softmmu/qemu-system-x86_64 -m 2048 -smp 2 -monitor
tcp:0:4014,server,nowait -vnc :14 -name 'ubuntu.test'  -boot
order=dc,menu=off  -cdrom ubuntu-10.04.1-desktop-amd64.iso -k de

If memtest runs as expected:
$ git bisect good
otherwise:
$ git bisect bad

Keep repeating this and you should end up at the commit that introduced the bug.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().

2010-12-18 Thread Stefan Hajnoczi
On Fri, Dec 17, 2010 at 4:19 PM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/12/17 Stefan Hajnoczi stefa...@gmail.com:
 On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/12/16 Michael S. Tsirkin m...@redhat.com:
 On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
 2010/11/28 Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp:
  2010/11/28 Michael S. Tsirkin m...@redhat.com:
  On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
  Record ioport event to replay it upon failover.
 
  Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 
  Interesting. This will have to be extended to support ioeventfd.
  Since each eventfd is really just a binary trigger
  it should be enough to read out the fd state.
 
  Haven't thought about eventfd yet.  Will try doing it in the next
  spin.

 Hi Michael,

 I looked into eventfd and realized it's only used with vhost now.

 There are patches on list to use it for block/userspace net.

 Thanks.  Now I understand.
 In that case, inserting an even-tap function to the following code
 should be appropriate?

 int event_notifier_test_and_clear(EventNotifier *e)
 {
    uint64_t value;
    int r = read(e-fd, value, sizeof(value));
    return r == sizeof(value);
 }


  However, I
 believe vhost bypass the net layer in qemu, and there is no way for 
 Kemari to
 detect the outputs.  To me, it doesn't make sense to extend this patch to
 support eventfd...

 Here is the userspace ioeventfd patch series:
 http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html

 Instead of switching to QEMU userspace to handle the virtqueue kick
 pio write, we signal the eventfd inside the kernel and resume guest
 code execution.  The I/O thread can then process the virtqueue kick in
 parallel to guest code execution.

 I think this can still be tied into Kemari.  If you are switching to a
 pure net/block-layer event tap instead of pio/mmio, then I think it
 should just work.

 That should take a while until we solve how to set correct
 callbacks to the secondary upon failover.  BTW, do you have a
 plan to move the eventfd framework to the upper layer as
 pio/mmio.  Not only Kemari works for free, other emulators should
 be able to benefit from it.

I'm not sure I understand the question but I have considered making
ioeventfd a first-class interface like register_ioport_write().  In
some ways that would be cleaner than the way we use ioeventfd in vhost
and virtio-pci today.

 For vhost it would be more difficult to integrate with Kemari.

 At this point, it's impossible.  As Michael said, I should
 prevent starting Kemari when vhost=on.

If you add some functionality to vhost it might be possible, although
that would slow it down.  So perhaps for the near future using vhost
with Kemari is pointless anyway since you won't be able to reach the
performance that vhost-net can achieve.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().

2010-12-16 Thread Stefan Hajnoczi
On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/12/16 Michael S. Tsirkin m...@redhat.com:
 On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
 2010/11/28 Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp:
  2010/11/28 Michael S. Tsirkin m...@redhat.com:
  On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
  Record ioport event to replay it upon failover.
 
  Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 
  Interesting. This will have to be extended to support ioeventfd.
  Since each eventfd is really just a binary trigger
  it should be enough to read out the fd state.
 
  Haven't thought about eventfd yet.  Will try doing it in the next
  spin.

 Hi Michael,

 I looked into eventfd and realized it's only used with vhost now.

 There are patches on list to use it for block/userspace net.

 Thanks.  Now I understand.
 In that case, inserting an even-tap function to the following code
 should be appropriate?

 int event_notifier_test_and_clear(EventNotifier *e)
 {
    uint64_t value;
    int r = read(e-fd, value, sizeof(value));
    return r == sizeof(value);
 }


  However, I
 believe vhost bypass the net layer in qemu, and there is no way for Kemari 
 to
 detect the outputs.  To me, it doesn't make sense to extend this patch to
 support eventfd...

Here is the userspace ioeventfd patch series:
http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html

Instead of switching to QEMU userspace to handle the virtqueue kick
pio write, we signal the eventfd inside the kernel and resume guest
code execution.  The I/O thread can then process the virtqueue kick in
parallel to guest code execution.

I think this can still be tied into Kemari.  If you are switching to a
pure net/block-layer event tap instead of pio/mmio, then I think it
should just work.

For vhost it would be more difficult to integrate with Kemari.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ConVirt 2.0.1 Open Source released.

2010-12-15 Thread Stefan Hajnoczi
On Wed, Dec 15, 2010 at 8:28 PM, jd jdsw2...@yahoo.com wrote:
 We are pleased to announce availability of ConVirt 2.0.1 open
 source. We would like to thank ConVirt user community for their
 continuing participation and support. This release incorporates
 feedback gathered from the community over last few months.

jd: A description of ConVirt would be nice.

Here's what I've figured out from the links:

It is a management tool for Xen, KVM, and others.  Written in Python
under the GPLv2 but developed as open core software (there's an
open source edition and an enterprise edition).  It talks to KVM
using the QEMU (human) monitor.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I/O Performance Tips

2010-12-09 Thread Stefan Hajnoczi
On Thu, Dec 9, 2010 at 12:52 PM, Sebastian Nickel - Hetzner Online AG
sebastian.nic...@hetzner.de wrote:
 here is the qemu command line we are using (or which libvirt generates):

 /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 512 -smp
 1,sockets=1,cores=1,threads=1 -name vm-933 -uuid
 0d737610-e59b-012d-f453-32287f7402ab -nodefaults -chardev
 socket,id=monitor,path=/var/lib/libvirt/qemu/vm-933.monitor,server,nowait 
 -mon chardev=monitor,mode=readline -rtc base=utc -boot nc -drive 
 file=/dev/vg0/934,if=none,id=drive-ide0-0-0,boot=on,format=raw,cache=writeback
  -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device 
 rtl8139,vlan=0,id=net0,mac=00:1c:14:01:03:67,bus=pci.0,addr=0x3 -net 
 tap,fd=23,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device 
 isa-serial,chardev=serial0 -usb -vga cirrus -device 
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

 We are explicit using writeback in cache settings.

 The first backtrace is odd.  You are using logical volumes for the
 guest but the backtrace shows kjournald is blocked.  I believe logical
 volumes should not directly affect kjournald at all (they don't use
 journalling).  Perhaps this is a deadlock.
 The dmesg output was just an example. Most of the time I can see the tasks 
 kjournald
 and flush.
 I recently saw kvm,kthreadd,rsyslogd and others in such outputs. I 
 thought
 that sometimes /proc/sys/vm/dirty_ratio gets exceeded and so all processes 
 are blocked
 for writes to the cache (kvm processes too). Could this be the case?
 I set dirty_background_ratio to 5 to constantly flush the cache to the 
 disk, but this
 did not help.

 About the flush-251:0:505 hang, please cat /proc/partitions on the
 host to see which block device has major number 251 and minor number 0
 is.
 This is our logical volume root partition of the physical host.

 The fact that your host is having problems suggests the issue is not
 in qemu-kvm (it's just a userspace process).  Are you sure disk I/O is
 working under load on this machine without KVM?
 I do not think that kvm generates this issue (as you said it is a normal
 user space process). I thought that perhaps somebody knows how to handle
 this situation, because the kvm developers have much more experience with
 kvm than I do. Perhaps there are some tuning tips for this or anybody knows
 why only OpenSuse sets the filesystem read only if there are disk timeouts
 in the guest? This behavior appeared on almost all hosts (20) so I can
 eliminate a single machine HW failure.

Christoph any pointers on how to debug this?

The backtraces from the original email are below:

 Am Donnerstag, den 09.12.2010, 10:30 + schrieb Stefan Hajnoczi:
 On Thu, Dec 9, 2010 at 8:10 AM, Sebastian Nickel - Hetzner Online AG
 sebastian.nic...@hetzner.de wrote:
  Hello,
  we have got some issues with I/O in our kvm environment. We are using
  kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
  are using ksm, too. Recently we noticed that sometimes the guest systems
  (mainly OpenSuse guest systems) suddenly have a read only filesystem.
  After some inspection we found out that the guest system generates some
  ata errors due to timeouts (mostly in flush cache situations). On the
  physical host there are always the same kernel messages when this
  happens:
 
  
  [1508127.195469] INFO: task kjournald:497 blocked for more than 120
  seconds.
  [1508127.212828] echo 0  /proc/sys/kernel/hung_task_timeout_secs
  disables this message.
  [1508127.246841] kjournald     D      0   497      2
  0x
  [1508127.246848]  88062128dba0 0046 00015bc0
  00015bc0
  [1508127.246855]  880621089ab0 88062128dfd8 00015bc0
  8806210896f0
  [1508127.246862]  00015bc0 88062128dfd8 00015bc0
  880621089ab0
  [1508127.246868] Call Trace:
  [1508127.246880]  [8116e500] ? sync_buffer+0x0/0x50
  [1508127.246889]  [81557d87] io_schedule+0x47/0x70
  [1508127.246893]  [8116e545] sync_buffer+0x45/0x50
  [1508127.246897]  [8155825a] __wait_on_bit_lock+0x5a/0xc0
  [1508127.246901]  [8116e500] ? sync_buffer+0x0/0x50
  [1508127.246905]  [81558338] out_of_line_wait_on_bit_lock
  +0x78/0x90
  [1508127.246911]  [810850d0] ? wake_bit_function+0x0/0x40
  [1508127.246915]  [8116e6c6] __lock_buffer+0x36/0x40
  [1508127.246920]  [81213d11] journal_submit_data_buffers
  +0x311/0x320
  [1508127.246924]  [81213ff2] journal_commit_transaction
  +0x2d2/0xe40
  [1508127.246931]  [810397a9] ? default_spin_lock_flags
  +0x9/0x10
  [1508127.246935]  [81076c7c] ? lock_timer_base+0x3c/0x70
  [1508127.246939]  [81077719] ? try_to_del_timer_sync+0x79/0xd0
  [1508127.246943]  [81217f0d] kjournald+0xed/0x250
  [1508127.246947]  [81085090] ? autoremove_wake_function
  +0x0/0x40
  [1508127.246951]  [81217e20] ? kjournald+0x0/0x250

Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-12-01 Thread Stefan Hajnoczi
On Mon, Nov 15, 2010 at 11:20 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Sun, Nov 14, 2010 at 12:19 PM, Avi Kivity a...@redhat.com wrote:
 On 11/14/2010 01:05 PM, Avi Kivity wrote:

 I agree, but let's enable virtio-ioeventfd carefully because bad code
 is out there.


 Sure.  Note as long as the thread waiting on ioeventfd doesn't consume too
 much cpu, it will awaken quickly and we won't have the transaction per
 timeslice effect.

 btw, what about virtio-blk with linux-aio?  Have you benchmarked that with
 and without ioeventfd?


 And, what about efficiency?  As in bits/cycle?

 We are running benchmarks with this latest patch and will report results.

Full results here (thanks to Khoa Huynh):

http://wiki.qemu.org/Features/VirtioIoeventfd

The host CPU utilization is scaled to 16 CPUs so a 2-3% reduction is
actually in the 32-48% range for a single CPU.

The guest CPU utilization numbers include an efficiency metric: %vcpu
per MB/sec.  Here we see significant improvements too.  Guests that
previously couldn't get more CPU work done now have regained some
breathing space.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-12-01 Thread Stefan Hajnoczi
On Wed, Dec 1, 2010 at 12:30 PM, Avi Kivity a...@redhat.com wrote:
 On 12/01/2010 01:44 PM, Stefan Hajnoczi wrote:

 
   And, what about efficiency?  As in bits/cycle?
 
   We are running benchmarks with this latest patch and will report
  results.

 Full results here (thanks to Khoa Huynh):

 http://wiki.qemu.org/Features/VirtioIoeventfd

 The host CPU utilization is scaled to 16 CPUs so a 2-3% reduction is
 actually in the 32-48% range for a single CPU.

 The guest CPU utilization numbers include an efficiency metric: %vcpu
 per MB/sec.  Here we see significant improvements too.  Guests that
 previously couldn't get more CPU work done now have regained some
 breathing space.

 Thanks for those numbers.  The guest improvements were expected, but the
 host numbers surprised me.  Do you have an explanation as to why total host
 load should decrease?

The first vcpu does virtqueue kick - it holds the guest driver
vblk-lock across kick.  Before this kick completes a second vcpu
tries to acquire vblk-lock, finds it is contended, and spins.  So
we're burning CPU due to the long vblk-lock hold times.

With virtio-ioeventfd those kick times are reduced an there is less
contention on vblk-lock.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/21] Introduce event-tap.

2010-11-30 Thread Stefan Hajnoczi
On Tue, Nov 30, 2010 at 9:50 AM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/29 Stefan Hajnoczi stefa...@gmail.com:
 On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 event-tap controls when to start FT transaction, and provides proxy
 functions to called from net/block devices.  While FT transaction, it
 queues up net/block requests, and flush them when the transaction gets
 completed.

 Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
 ---
  Makefile.target |    1 +
  block.h         |    9 +
  event-tap.c     |  794 
 +++
  event-tap.h     |   34 +++
  net.h           |    4 +
  net/queue.c     |    1 +
  6 files changed, 843 insertions(+), 0 deletions(-)
  create mode 100644 event-tap.c
  create mode 100644 event-tap.h

 event_tap_state is checked at the beginning of several functions.  If
 there is an unexpected state the function silently returns.  Should
 these checks really be assert() so there is an abort and backtrace if
 the program ever reaches this state?

Fancier error handling would work too.  For example cleaning up,
turning off Kemari, and producing an error message with
error_report().  In that case we need to think through the state of
the environment carefully and make sure we don't cause secondary
failures (like memory leaks).

 BTW, I would like to ask a question regarding this.  There is a
 callback which net/block calls after processing the requests, and
 is there a clean way to set this callback on the failovered
 host upon replay?

I think this is a limitation in the current design.  If requests are
re-issued by Kemari at the net/block level, how will the higher layers
know about these requests?  How will they be prepared to accept
callbacks?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Nov 30

2010-11-30 Thread Stefan Hajnoczi
On Tue, Nov 30, 2010 at 12:58 PM, Dor Laor dl...@redhat.com wrote:
 Please send in any agenda items you are interested in covering.

Juan already has a thread for agenda items.  It includes:

As I forgot to put the call for agenda befor, Anthony already suggested:
- 2011 kvm conference
- 0.14.0 release plan
- infrastructure changes (irc channel migration, git tree migration)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limiting guest block i/o for qos

2010-11-29 Thread Stefan Hajnoczi
On Mon, Nov 29, 2010 at 2:00 AM, T Johnson tjohnso...@gmail.com wrote:
 Hello,

 On Thu, Nov 25, 2010 at 3:33 AM, Nikola Ciprich extmaill...@linuxbox.cz 
 wrote:
 Hello Thomas,
 I t hink blkio-cgroup really can't help You here, but since NFS is
 network protocol,
 why not just consider some kind of network shaping?
 n.

 I thought about this, but it's rather imprecise I imagine if I try to
 limit the number of packets per second and hope that matches reads or
 writes per second. Secondly, I have many guests running to the same
 NFS server which makes limiting per kvm guest somewhat impossible when
 the network tools I know if would limit per NFS server.

Perhaps iptables/tc can mark the stream based on the client process
ID?  Each VM has a qemu-kvm userspace process that will issue file
I/O.  Someone with more networking knowledge could confirm whether or
not it is possible to mark based on the process ID using the in-kernel
NFS client.

You don't need to limit based on packets per second.  You can do
bandwidth-based traffic shaping with tc.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2

2010-11-29 Thread Stefan Hajnoczi
On Sat, Nov 27, 2010 at 1:11 PM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/27 Stefan Hajnoczi stefa...@gmail.com:
 On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/27 Stefan Hajnoczi stefa...@gmail.com:
 On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/27 Blue Swirl blauwir...@gmail.com:
 On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 Somehow I find some similarities to instrumentation patches. Perhaps
 the instrumentation framework could be used (maybe with some changes)
 for Kemari as well? That could be beneficial to both.

 Yes.  I had the same idea but I'm not sure how tracing works.  I think
 Stefan Hajnoczi knows it better.

 Stefan, is it possible to call arbitrary functions from the trace
 points?

 Yes, if you add code to ./tracetool.  I'm not sure I see the
 connection between Kemari and tracing though.

 The connection is that it may be possible to remove Kemari
 specific hook point like in ioport.c and exec.c, and let tracing
 notify Kemari instead.

 I actually think the other way.  Tracing just instruments and stashes
 away values.  It does not change inputs or outputs, it does not change
 control flow, it does not affect state.

 Going down the route of side-effects mixes two different things:
 hooking into a subsystem and instrumentation.  For hooking into a
 subsystem we should define proper interfaces.  That interface can
 explicitly support modifying inputs/outputs or changing control flow.

 Tracing is much more ad-hoc and not a clean interface.  It's also
 based on a layer of indirection via the tracetool code generator.
 That's okay because it doesn't affect the code it is called from and
 you don't need to debug trace events (they are simple and have almost
 no behavior).

 Hooking via tracing is just taking advantage of the cheap layer of
 indirection in order to get at interesting events in a subsystem.
 It's easy to hook up and quick to develop, but it's not a proper
 interface and will be hard to understand for other developers.

 One question I have about Kemari is whether it adds new constraints to
 the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
 - everyone writing device emulation or core QEMU code may need to be
 aware of new constraints.  For example, you are not allowed to
 release I/O operations to the outside world directly, instead you need
 to go through Kemari code which makes I/O transactional and
 communicates with the passive host.  You have converted e1000,
 virtio-net, and virtio-blk.  How do we make sure new devices that are
 merged into qemu.git don't break Kemari?  How do we go about
 supporting the existing hw/* devices?

 Whether Kemari adds constraints such as you mentioned, yes.  If
 the devices (including existing ones) don't call Kemari code,
 they would certainly break Kemari.  Altough using proxies looks
 explicit, to make it unaware from people writing device
 emulation, it's possible to remove proxies and put changes only
 into the block/net layer as Blue suggested.

 Anything that makes it hard to violate the constraints is good.
 Otherwise Kemari might get broken in the future and no one will know
 until a failover behaves incorrectly.

 Blue and Paul prefer to put it into block/net layer, and you
 think it's better to provide API.

Sorry, I wasn't clear.  I agree that event tap behavior should be in
generic block and net layer code.  That way we're guaranteeing that
all net and block I/O goes through event tap.

 Could you formulate the constraints so developers are aware of them in
 the future and can protect the codebase.  How about expanding the
 Kemari wiki pages?

 If you like the idea above, I'm happy to make the list also on
 the wiki page.

Here's a different question: what requirements must an emulated device
meet in order to be added to the Kemari supported whitelist?  That's
what I want to know so that I don't break existing devices and can add
new devices that work with Kemari :).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/21] Introduce event-tap.

2010-11-29 Thread Stefan Hajnoczi
On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 event-tap controls when to start FT transaction, and provides proxy
 functions to called from net/block devices.  While FT transaction, it
 queues up net/block requests, and flush them when the transaction gets
 completed.

 Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
 ---
  Makefile.target |    1 +
  block.h         |    9 +
  event-tap.c     |  794 
 +++
  event-tap.h     |   34 +++
  net.h           |    4 +
  net/queue.c     |    1 +
  6 files changed, 843 insertions(+), 0 deletions(-)
  create mode 100644 event-tap.c
  create mode 100644 event-tap.h

event_tap_state is checked at the beginning of several functions.  If
there is an unexpected state the function silently returns.  Should
these checks really be assert() so there is an abort and backtrace if
the program ever reaches this state?

 +typedef struct EventTapBlkReq {
 +    char *device_name;
 +    int num_reqs;
 +    int num_cbs;
 +    bool is_multiwrite;

Is multiwrite logging necessary?  If event tap is called from within
the block layer then multiwrite is turned into one or more
bdrv_aio_writev() calls.

 +static void event_tap_replay(void *opaque, int running, int reason)
 +{
 +    EventTapLog *log, *next;
 +
 +    if (!running) {
 +        return;
 +    }
 +
 +    if (event_tap_state != EVENT_TAP_LOAD) {
 +        return;
 +    }
 +
 +    event_tap_state = EVENT_TAP_REPLAY;
 +
 +    QTAILQ_FOREACH(log, event_list, node) {
 +        EventTapBlkReq *blk_req;
 +
 +        /* event resume */
 +        switch (log-mode  ~EVENT_TAP_TYPE_MASK) {
 +        case EVENT_TAP_NET:
 +            event_tap_net_flush(log-net_req);
 +            break;
 +        case EVENT_TAP_BLK:
 +            blk_req = log-blk_req;
 +            if ((log-mode  EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
 +                switch (log-ioport.index) {
 +                case 0:
 +                    cpu_outb(log-ioport.address, log-ioport.data);
 +                    break;
 +                case 1:
 +                    cpu_outw(log-ioport.address, log-ioport.data);
 +                    break;
 +                case 2:
 +                    cpu_outl(log-ioport.address, log-ioport.data);
 +                    break;
 +                }
 +            } else {
 +                /* EVENT_TAP_MMIO */
 +                cpu_physical_memory_rw(log-mmio.address,
 +                                       log-mmio.buf,
 +                                       log-mmio.len, 1);
 +            }
 +            break;

Why are net tx packets replayed at the net level but blk requests are
replayed at the pio/mmio level?

I expected everything to replay either as pio/mmio or as net/block.

 +static void event_tap_blk_load(QEMUFile *f, EventTapBlkReq *blk_req)
 +{
 +    BlockRequest *req;
 +    ram_addr_t page_addr;
 +    int i, j, len;
 +
 +    len = qemu_get_byte(f);
 +    blk_req-device_name = qemu_malloc(len + 1);
 +    qemu_get_buffer(f, (uint8_t *)blk_req-device_name, len);
 +    blk_req-device_name[len] = '\0';
 +    blk_req-num_reqs = qemu_get_byte(f);
 +
 +    for (i = 0; i  blk_req-num_reqs; i++) {
 +        req = blk_req-reqs[i];
 +        req-sector = qemu_get_be64(f);
 +        req-nb_sectors = qemu_get_be32(f);
 +        req-qiov = qemu_malloc(sizeof(QEMUIOVector));

It would make sense to have common QEMUIOVector load/save functions
instead of inlining this code here.

 +static int event_tap_load(QEMUFile *f, void *opaque, int version_id)
 +{
 +    EventTapLog *log, *next;
 +    int mode;
 +
 +    event_tap_state = EVENT_TAP_LOAD;
 +
 +    QTAILQ_FOREACH_SAFE(log, event_list, node, next) {
 +        QTAILQ_REMOVE(event_list, log, node);
 +        event_tap_free_log(log);
 +    }
 +
 +    /* loop until EOF */
 +    while ((mode = qemu_get_byte(f)) != 0) {
 +        EventTapLog *log = event_tap_alloc_log();
 +
 +        log-mode = mode;
 +        switch (log-mode  EVENT_TAP_TYPE_MASK) {
 +        case EVENT_TAP_IOPORT:
 +            event_tap_ioport_load(f, log-ioport);
 +            break;
 +        case EVENT_TAP_MMIO:
 +            event_tap_mmio_load(f, log-mmio);
 +            break;
 +        case 0:
 +            DPRINTF(No event\n);
 +            break;
 +        default:
 +            fprintf(stderr, Unknown state %d\n, log-mode);
 +            return -1;

log is leaked here...

 +        }
 +
 +        switch (log-mode  ~EVENT_TAP_TYPE_MASK) {
 +        case EVENT_TAP_NET:
 +            event_tap_net_load(f, log-net_req);
 +            break;
 +        case EVENT_TAP_BLK:
 +            event_tap_blk_load(f, log-blk_req);
 +            break;
 +        default:
 +            fprintf(stderr, Unknown state %d\n, log-mode);
 +            return -1;

...and here.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2

2010-11-29 Thread Stefan Hajnoczi
On Mon, Nov 29, 2010 at 3:00 PM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/29 Paul Brook p...@codesourcery.com:
  If devices incorrectly claim support for live migration, then that should
  also be fixed, either by removing the broken code or by making it work.

 I totally agree with you.

  AFAICT your current proposal is just feeding back the results of some
  fairly specific QA testing.  I'd rather not get into that game.  The
  correct response in the context of upstream development is to file a bug
  and/or fix the code. We already have config files that allow third party
  packagers to remove devices they don't want to support.

 Sorry, I didn't get what you're trying to tell me.  My plan would
 be to initially start from a subset of devices, and gradually
 grow the number of devices that Kemari works with.  While this
 process, it'll include what you said above, file a but and/or fix
 the code.  Am I missing what you're saying?

 My point is that the whitelist shouldn't exist at all.  Devices either 
 support
 migration or they don't.  Having some sort of separate whitelist is the wrong
 way to determine which devices support migration.

 Alright!

 Then if a user encounters a problem with Kemari, we'll fix Kemari
 or the devices or both. Correct?

Is this a fair summary: any device that supports live migration workw
under Kemari?

(If such a device does not work under Kemari then this is a bug that
needs to be fixed in live migration, Kemari, or the device.)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2

2010-11-27 Thread Stefan Hajnoczi
On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/27 Stefan Hajnoczi stefa...@gmail.com:
 On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/27 Blue Swirl blauwir...@gmail.com:
 On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 Somehow I find some similarities to instrumentation patches. Perhaps
 the instrumentation framework could be used (maybe with some changes)
 for Kemari as well? That could be beneficial to both.

 Yes.  I had the same idea but I'm not sure how tracing works.  I think
 Stefan Hajnoczi knows it better.

 Stefan, is it possible to call arbitrary functions from the trace
 points?

 Yes, if you add code to ./tracetool.  I'm not sure I see the
 connection between Kemari and tracing though.

 The connection is that it may be possible to remove Kemari
 specific hook point like in ioport.c and exec.c, and let tracing
 notify Kemari instead.

I actually think the other way.  Tracing just instruments and stashes
away values.  It does not change inputs or outputs, it does not change
control flow, it does not affect state.

Going down the route of side-effects mixes two different things:
hooking into a subsystem and instrumentation.  For hooking into a
subsystem we should define proper interfaces.  That interface can
explicitly support modifying inputs/outputs or changing control flow.

Tracing is much more ad-hoc and not a clean interface.  It's also
based on a layer of indirection via the tracetool code generator.
That's okay because it doesn't affect the code it is called from and
you don't need to debug trace events (they are simple and have almost
no behavior).

Hooking via tracing is just taking advantage of the cheap layer of
indirection in order to get at interesting events in a subsystem.
It's easy to hook up and quick to develop, but it's not a proper
interface and will be hard to understand for other developers.

 One question I have about Kemari is whether it adds new constraints to
 the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
 - everyone writing device emulation or core QEMU code may need to be
 aware of new constraints.  For example, you are not allowed to
 release I/O operations to the outside world directly, instead you need
 to go through Kemari code which makes I/O transactional and
 communicates with the passive host.  You have converted e1000,
 virtio-net, and virtio-blk.  How do we make sure new devices that are
 merged into qemu.git don't break Kemari?  How do we go about
 supporting the existing hw/* devices?

 Whether Kemari adds constraints such as you mentioned, yes.  If
 the devices (including existing ones) don't call Kemari code,
 they would certainly break Kemari.  Altough using proxies looks
 explicit, to make it unaware from people writing device
 emulation, it's possible to remove proxies and put changes only
 into the block/net layer as Blue suggested.

Anything that makes it hard to violate the constraints is good.
Otherwise Kemari might get broken in the future and no one will know
until a failover behaves incorrectly.

Could you formulate the constraints so developers are aware of them in
the future and can protect the codebase.  How about expanding the
Kemari wiki pages?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Loading snapshot with -loadvm?

2010-11-26 Thread Stefan Hajnoczi
On Fri, Nov 26, 2010 at 8:47 AM, Jun Koi junkoi2...@gmail.com wrote:
 this created a snapshot named test on my image. then i tried to start
 the snapshot VM, like below:

 qemu-system-x86_64 -m 2000 -vga std -usb -usbdevice tablet -localtime
 -loadvm 2 -hda img.qcow2.win7_x64

 but then i have a problem: the Qemu window shows up, with [Stopped]
 at window caption. it stays forever there, and doesnt proceed.

 is this a bug, or did i do something wrong?

Solution: Switch to the QEMU monitor (Ctrl+Alt+2) and type 'c' to
continue the VM.  To switch back to the VM's display use Ctrl+Alt+1.

I checked that it is expected behavior:
if (loadvm) {
if (load_vmstate(loadvm)  0) {
autostart = 0;
}
}

autostart = 0 means that your VM will be stopped.

I'm not sure why -loadvm implies the VM will be stopped, there's
already a different command-line option to keep the VM stopped (-S).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Loading snapshot with -loadvm?

2010-11-26 Thread Stefan Hajnoczi
On Fri, Nov 26, 2010 at 9:26 AM, Jun Koi junkoi2...@gmail.com wrote:
 On Fri, Nov 26, 2010 at 5:18 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Nov 26, 2010 at 8:47 AM, Jun Koi junkoi2...@gmail.com wrote:
 this created a snapshot named test on my image. then i tried to start
 the snapshot VM, like below:

 qemu-system-x86_64 -m 2000 -vga std -usb -usbdevice tablet -localtime
 -loadvm 2 -hda img.qcow2.win7_x64

 but then i have a problem: the Qemu window shows up, with [Stopped]
 at window caption. it stays forever there, and doesnt proceed.

 is this a bug, or did i do something wrong?

 Solution: Switch to the QEMU monitor (Ctrl+Alt+2) and type 'c' to
 continue the VM.  To switch back to the VM's display use Ctrl+Alt+1.


 yes, i tried that, but the problem is that the Qemu window is not
 responsive. looks like it hangs up ...

 i even tried with -monitor stdio, then at console type c, but the
 monitor is not responsive, either.

 I checked that it is expected behavior:
    if (loadvm) {
        if (load_vmstate(loadvm)  0) {
            autostart = 0;
        }
    }


 autostart = 0 means that your VM will be stopped.

 but that happens when load_vmstate()  0, which means something is wrong?

 i must look at that code more closely.

Me too, I missed the  0.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Memory leaks in virtio drivers?

2010-11-26 Thread Stefan Hajnoczi
On Fri, Nov 26, 2010 at 7:19 PM, Freddie Cash fjwc...@gmail.com wrote:
 Within 2 weeks of booting, the host machine is using 2 GB of swap, and
 disk I/O wait is through the roof.  Restarting all of the VMs will
 free up RAM, but restarting the whole box is the only way to get
 performance back up.

 A guest configured to use 8 GB of RAM will have 9 GB virt and 7.5 GB
 res shown in top.  In fact, every single VM shows virt above the limit
 set for the VM.  Usually by close to 25%.

Not sure about specific known issues with those Debian package versions, but...

Virtual memory does not mean much.  For example, a 64-bit process can
map in 32 GB and never touch it.  The virt number will be 32 GB but
actually no RAM is being used.  Or it could be a memory mapped file,
which is backed by the disk and can pages can dropped if physical
memory runs low.  Looking at the virtual memory figure is not that
useful.

Also remember that qemu-kvm itself requires memory to perform the
device emulation and virtualization.  If you have an 8 GB VM, plan for
more than 8 GB to be used.  Clearly this memory overhead should be
kept low, is your 25% virtual memory overhead figure from a small VM
because 9 GB virtual / 8 GB VM is 12.5% not 25%?

What is the sum of all VMs' RAM?  I'm guessing you may have
overcommitted resources (e.g. 2 x 8 GB VM on a 16 GB machine).  If you
don't leave host Linux system some resources you will get bad VM
performance.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Memory leaks in virtio drivers?

2010-11-26 Thread Stefan Hajnoczi
On Fri, Nov 26, 2010 at 8:16 PM, Freddie Cash fjwc...@gmail.com wrote:
 On Fri, Nov 26, 2010 at 12:04 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Nov 26, 2010 at 7:19 PM, Freddie Cash fjwc...@gmail.com wrote:
 Within 2 weeks of booting, the host machine is using 2 GB of swap, and
 disk I/O wait is through the roof.  Restarting all of the VMs will
 free up RAM, but restarting the whole box is the only way to get
 performance back up.

 A guest configured to use 8 GB of RAM will have 9 GB virt and 7.5 GB
 res shown in top.  In fact, every single VM shows virt above the limit
 set for the VM.  Usually by close to 25%.

 Not sure about specific known issues with those Debian package versions, 
 but...

 Virtual memory does not mean much.  For example, a 64-bit process can
 map in 32 GB and never touch it.  The virt number will be 32 GB but
 actually no RAM is being used.  Or it could be a memory mapped file,
 which is backed by the disk and can pages can dropped if physical
 memory runs low.  Looking at the virtual memory figure is not that
 useful.

 Also remember that qemu-kvm itself requires memory to perform the
 device emulation and virtualization.  If you have an 8 GB VM, plan for
 more than 8 GB to be used.  Clearly this memory overhead should be
 kept low, is your 25% virtual memory overhead figure from a small VM
 because 9 GB virtual / 8 GB VM is 12.5% not 25%?

 What is the sum of all VMs' RAM?  I'm guessing you may have
 overcommitted resources (e.g. 2 x 8 GB VM on a 16 GB machine).  If you
 don't leave host Linux system some resources you will get bad VM
 performance.

 Nope, not overcommitted.  Sum of RAM for all VMs (in MB):
 512 + 768 + 1024 + 512 + 512 + 1024 +  1024 + 768 + 8192 = 14226
 Leaving a little under 2 GB for the host.

How do those VM RAM numbers stack up with ps -eo rss,args | grep kvm?

If the rss reveals the qemu-kvm processes are 15 GB RAM then it might
be worth giving them more breathing room.

 Doing further googling, could it be a caching issue in the host?  We
 currently have no cache= settings for any of our virtual disks.  I
 believe the default is still write-through? so the host is trying to
 cache everything.

Yes, the default is writethrough.  cache=none would reduce buffered
file pages so it's worth a shot.

 Anyone know how to force libvirt to use cache='none' in the driver
 block?  libvirt-bin 0.8.3 and virt-manager 0.8.4 ignore it if I edit
 the domain.xml file directly, and there's nowhere to set it in the
 virt-manager GUI.  (Only 1 of the VMs is managed via libvirt
 currently.)

A hack you can do if your libvirt does not support the driver
cache=none/ attribute is to move /usr/bin/qemu-kvm out of the way
and replace it with a shell script that does
s/if=virtio/if=virtio,cache=none/ on its arguments before invoking the
real /usr/bin/qemu-kvm.  (Perhaps the cleaner way is editing the
domain XML for emulator/usr/bin/kvm_cache_none.sh/emulator but I
haven't tested it.)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph/rbd block driver for qemu-kvm (v8)

2010-11-26 Thread Stefan Hajnoczi
On Fri, Nov 26, 2010 at 9:59 PM, Christian Brunner
c.m.brun...@gmail.com wrote:
 Thanks for the review. What am I supposed to do now?

Kevin is the block maintainer.  His review is the next step, I have
CCed him.  After that rbd would be ready to merge.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2

2010-11-26 Thread Stefan Hajnoczi
On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 2010/11/27 Blue Swirl blauwir...@gmail.com:
 On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
 tamura.yoshi...@lab.ntt.co.jp wrote:
 Hi,

 This patch series is a revised version of Kemari for KVM, which
 applied comments for the previous post and KVM Forum 2010.  The
 current code is based on qemu.git
 f711df67d611e4762966a249742a5f7499e19f99.

 For general information about Kemari, I've made a wiki page at
 qemu.org.

 http://wiki.qemu.org/Features/FaultTolerance

 The changes from v0.1.1 - v0.2 are:

 - Introduce a queue in event-tap to make VM sync live.
 - Change transaction receiver to a state machine for async receiving.
 - Replace net/block layer functions with event-tap proxy functions.
 - Remove dirty bitmap optimization for now.
 - convert DPRINTF() in ft_trans_file to trace functions.
 - convert fprintf() in ft_trans_file to error_report().
 - improved error handling in ft_trans_file.
 - add a tmp pointer to qemu_del_vm_change_state_handler.

 The changes from v0.1 - v0.1.1 are:

 - events are tapped in net/block layer instead of device emulation layer.
 - Introduce a new option for -incoming to accept FT transaction.
 - Removed writev() support to QEMUFile and FdMigrationState for now.  I 
 would
  post this work in a different series.
 - Modified virtio-blk save/load handler to send inuse variable to
  correctly replay.
 - Removed configure --enable-ft-mode.
 - Removed unnecessary check for qemu_realloc().

 The first 6 patches modify several functions of qemu to prepare
 introducing Kemari specific components.

 The next 6 patches are the components of Kemari.  They introduce
 event-tap and the FT transaction protocol file based on buffered file.
 The design document of FT transaction protocol can be found at,
 http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf

 Then the following 4 patches modifies dma-helpers, virtio-blk
 virtio-net and e1000 to replace net/block layer functions with
 event-tap proxy functions.  Please note that if Kemari is off,
 event-tap will just passthrough, and there is most no intrusion to
 exisiting functions including normal live migration.

 Would it be possible to make the changes only in the block/net layer,
 so that the devices are not modified at all? That is, the proxy
 function would always replaces the unproxied version.

 I understand the benefit of your suggestion.  However it seems a bit
 tricky.  It's because event-tap uses functions of emulators and net,
 but block.c is also linked for utilities like qemu-img that doesn't
 need emulators or net.  In the previous version, I added function
 pointers to get around.

 http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html

 I wasn't confident of this approach and discussed it at KVM Forum, and
 decided to give a try to replace emulator functions with proxies.
 Suggestions are welcomed of course.

 Somehow I find some similarities to instrumentation patches. Perhaps
 the instrumentation framework could be used (maybe with some changes)
 for Kemari as well? That could be beneficial to both.

 Yes.  I had the same idea but I'm not sure how tracing works.  I think
 Stefan Hajnoczi knows it better.

 Stefan, is it possible to call arbitrary functions from the trace
 points?

Yes, if you add code to ./tracetool.  I'm not sure I see the
connection between Kemari and tracing though.

One question I have about Kemari is whether it adds new constraints to
the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
- everyone writing device emulation or core QEMU code may need to be
aware of new constraints.  For example, you are not allowed to
release I/O operations to the outside world directly, instead you need
to go through Kemari code which makes I/O transactional and
communicates with the passive host.  You have converted e1000,
virtio-net, and virtio-blk.  How do we make sure new devices that are
merged into qemu.git don't break Kemari?  How do we go about
supporting the existing hw/* devices?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Nov 23

2010-11-23 Thread Stefan Hajnoczi
On Tue, Nov 23, 2010 at 2:37 PM, Kevin Wolf kw...@redhat.com wrote:
 Am 22.11.2010 14:55, schrieb Stefan Hajnoczi:
 On Mon, Nov 22, 2010 at 1:38 PM, Juan Quintela quint...@redhat.com wrote:

 Please send in any agenda items you are interested in covering.

 QCOW2 performance roadmap:
 * What can be done to achieve near-raw image format performance?
 * Benchmark results from an ideal QCOW2 model.

Performance figures from a series of I/O scenarios:
http://wiki.qemu.org/Qcow2/PerformanceRoadmap

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Nov 23

2010-11-22 Thread Stefan Hajnoczi
On Mon, Nov 22, 2010 at 1:38 PM, Juan Quintela quint...@redhat.com wrote:

 Please send in any agenda items you are interested in covering.

QCOW2 performance roadmap:
* What can be done to achieve near-raw image format performance?
* Benchmark results from an ideal QCOW2 model.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on virtio frontend backend drivers

2010-11-22 Thread Stefan Hajnoczi
On Mon, Nov 22, 2010 at 2:05 PM, Prasad Joshi
p.g.jo...@student.reading.ac.uk wrote:
 I was under the impression that the each virtio driver will have a frontend 
 and backend part. The frontend part would be loaded in the Guest OS and the 
 backend driver will be loaded in the Host OS. These two drivers will 
 communicate with each other. The backend driver will then retransmit the 
 actual request to correct driver.

 But seems like my understanding is wrong.
 I attached a virtio disk to the Guest OS. When the Guest was booted, after 
 creating a file system on the attached disk I mounted it.

 [pra...@prasad-fedora12-vm ~]$ lsmod | grep -i virtio
 virtio_blk              7352  1
 virtio_pci              8680  0
 virtio_ring             6080  1 virtio_pci
 virtio                  5220  2 virtio_blk,virtio_pci

 But on the host machine no backend driver was loaded

 r...@prasad-desktop:~/VMDisks# lsmod | grep -i virtio
 r...@prasad-desktop:~/VMDisks#

 Does this mean there is no explicit backend driver?

A virtio device is a PCI adapter in the guest.  That's why you see virtio_pci.

The userspace QEMU process (called qemu-kvm or qemu) does device
emulation and contains the virtio code you are looking for.  See
hw/virtio-blk.c in qemu-kvm.git.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph/rbd block driver for qemu-kvm (v8)

2010-11-18 Thread Stefan Hajnoczi
Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-16 Thread Stefan Hajnoczi
On Tue, Nov 16, 2010 at 4:02 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Fri, Nov 12, 2010 at 01:24:28PM +, Stefan Hajnoczi wrote:
 Virtqueue notify is currently handled synchronously in userspace virtio.  
 This
 prevents the vcpu from executing guest code while hardware emulation code
 handles the notify.

 On systems that support KVM, the ioeventfd mechanism can be used to make
 virtqueue notify a lightweight exit by deferring hardware emulation to the
 iothread and allowing the VM to continue execution.  This model is similar to
 how vhost receives virtqueue notifies.

 The result of this change is improved performance for userspace virtio 
 devices.
 Virtio-blk throughput increases especially for multithreaded scenarios and
 virtio-net transmit throughput increases substantially.

 Some virtio devices are known to have guest drivers which expect a notify to 
 be
 processed synchronously and spin waiting for completion.  Only enable 
 ioeventfd
 for virtio-blk and virtio-net for now.

 Care must be taken not to interfere with vhost-net, which already uses
 ioeventfd host notifiers.  The following list shows the behavior implemented 
 in
 this patch and is designed to take vhost-net into account:

  * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, 
 qemu_set_fd_handler(virtio_pci_host_notifier_read)
  * !VIRTIO_CONFIG_S_DRIVER_OK - qemu_set_fd_handler(NULL), deassign host 
 notifiers
  * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL)
  * virtio_pci_set_host_notifier(false) - 
 qemu_set_fd_handler(virtio_pci_host_notifier_read)

 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 ---
  hw/virtio-pci.c |  152 
 ++
  hw/virtio.c     |   14 -
  hw/virtio.h     |   13 +
  3 files changed, 153 insertions(+), 26 deletions(-)

 Now toggles host notifiers based on VIRTIO_CONFIG_S_DRIVER_OK status changes.
 The cleanest way I could see was to introduce pre and a post set_status()
 callbacks.  They allow a binding to hook status changes, including the status
 change from virtio_reset().

 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index 549118d..117e855 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -83,6 +83,11 @@
  /* Flags track per-device state like workarounds for quirks in older 
 guests. */
  #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG  (1  0)

 +/* Performance improves when virtqueue kick processing is decoupled from the
 + * vcpu thread using ioeventfd for some devices. */
 +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT 1
 +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD   (1  
 VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT)
 +
  /* QEMU doesn't strictly need write barriers since everything runs in
   * lock-step.  We'll leave the calls to wmb() in though to make it obvious 
 for
   * KVM or if kqemu gets SMP support.
 @@ -179,12 +184,125 @@ static int virtio_pci_load_queue(void * opaque, int 
 n, QEMUFile *f)
      return 0;
  }

 +static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy,
 +                                                  int n, bool assign)
 +{
 +    VirtQueue *vq = virtio_get_queue(proxy-vdev, n);
 +    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
 +    int r;
 +    if (assign) {
 +        r = event_notifier_init(notifier, 1);
 +        if (r  0) {
 +            return r;
 +        }
 +        r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
 +                                       proxy-addr + 
 VIRTIO_PCI_QUEUE_NOTIFY,
 +                                       n, assign);
 +        if (r  0) {
 +            event_notifier_cleanup(notifier);
 +        }
 +    } else {
 +        r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
 +                                       proxy-addr + 
 VIRTIO_PCI_QUEUE_NOTIFY,
 +                                       n, assign);
 +        if (r  0) {
 +            return r;
 +        }
 +        event_notifier_cleanup(notifier);
 +    }
 +    return r;
 +}
 +
 +static void virtio_pci_host_notifier_read(void *opaque)
 +{
 +    VirtQueue *vq = opaque;
 +    EventNotifier *n = virtio_queue_get_host_notifier(vq);
 +    if (event_notifier_test_and_clear(n)) {
 +        virtio_queue_notify_vq(vq);
 +    }
 +}
 +
 +static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy,
 +                                                    int n, bool assign)
 +{
 +    VirtQueue *vq = virtio_get_queue(proxy-vdev, n);
 +    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
 +    if (assign) {
 +        qemu_set_fd_handler(event_notifier_get_fd(notifier),
 +                            virtio_pci_host_notifier_read, NULL, vq);
 +    } else {
 +        qemu_set_fd_handler(event_notifier_get_fd(notifier),
 +                            NULL, NULL, NULL);
 +    }
 +}
 +
 +static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign)
 +{
 +    int n, r;
 +
 +    for (n = 0; n  VIRTIO_PCI_QUEUE_MAX

Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-15 Thread Stefan Hajnoczi
On Sun, Nov 14, 2010 at 12:19 PM, Avi Kivity a...@redhat.com wrote:
 On 11/14/2010 01:05 PM, Avi Kivity wrote:

 I agree, but let's enable virtio-ioeventfd carefully because bad code
 is out there.


 Sure.  Note as long as the thread waiting on ioeventfd doesn't consume too
 much cpu, it will awaken quickly and we won't have the transaction per
 timeslice effect.

 btw, what about virtio-blk with linux-aio?  Have you benchmarked that with
 and without ioeventfd?


 And, what about efficiency?  As in bits/cycle?

We are running benchmarks with this latest patch and will report results.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-14 Thread Stefan Hajnoczi
On Sun, Nov 14, 2010 at 10:34 AM, Avi Kivity a...@redhat.com wrote:
 On 11/12/2010 11:20 AM, Stefan Hajnoczi wrote:

   Who guarantees that less common virtio-blk and virtio-net guest drivers
   for non-Linux OSes are fine with it?  Maybe you should add a feature
  flag
   that the guest has to ACK to enable it.

 Virtio-blk and virtio-net are fine.  Both of those devices are
 expected to operate asynchronously.  SeaBIOS and gPXE virtio-net
 drivers spin but they expect to and it is okay in those environments.
 They already burn CPU today.

 Virtio-console expects synchronous virtqueue kick.  In Linux,
 virtio_console.c __send_control_msg() and send_buf() will spin.  Qemu
 userspace is able to complete those requests synchronously so that the
 guest never actually burns CPU (e.g.
 hw/virtio-serial-bus.c:send_control_msg()).  I don't want to burn CPU
 in places where we previously didn't.

 This is a horrible bug.  virtio is an asynchronous API.  Some hypervisor
 implementations cannot even provide synchronous notifications.

 It's good that QEMU can decide whether or not to handle virtqueue kick
 in the vcpu thread.  For high performance asynchronous devices like
 virtio-net and virtio-blk it makes sense to use ioeventfd.  For others
 it may not be useful.  I'm not sure a feature bit that exposes this
 detail to the guest would be useful.

 The guest should always assume that virtio devices are asynchronous.

I agree, but let's enable virtio-ioeventfd carefully because bad code
is out there.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-12 Thread Stefan Hajnoczi
On Thu, Nov 11, 2010 at 3:53 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote:
 Care must be taken not to interfere with vhost-net, which already uses
 ioeventfd host notifiers.  The following list shows the behavior implemented 
 in
 this patch and is designed to take vhost-net into account:

  * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, 
 qemu_set_fd_handler(virtio_pci_host_notifier_read)

 we should also deassign when VIRTIO_CONFIG_S_DRIVER_OK is cleared
 by io write or bus master bit?

You're right, I'll fix the lifecycle to trigger symmetrically on
status bit changes rather than VIRTIO_CONFIG_S_DRIVER_OK/reset.

 +static void virtio_pci_reset_vdev(VirtIOPCIProxy *proxy)
 +{
 +    /* Poke virtio device so it deassigns its host notifiers (if any) */
 +    virtio_set_status(proxy-vdev, 0);

 Hmm. virtio_reset already sets status to 0.
 I guess it should just be fixed to call virtio_set_status?

This part is ugly.  The problem is that virtio_reset() calls
virtio_set_status(vdev, 0) but doesn't give the transport binding a
chance clean up after the virtio device has cleaned up.  Since
virtio-net will spot status=0 and deassign its host notifier, we need
to perform our own clean up after vhost.

What makes this slightly less of a hack is the fact that virtio-pci.c
was already causing virtio_set_status(vdev, 0) to be invoked twice
during reset.  When 0 is written to the VIRTIO_PCI_STATUS register, we
do virtio_set_status(proxy-vdev, val  0xFF) and then
virtio_reset(proxy-vdev).  So the status byte callback already gets
invoked twice today.

I've just split this out into virtio_pci_reset_vdev() and (ab)used it
to correctly clean up virtqueue ioeventfd.

The alternative is to add another callback from virtio.c so we are
notified after the vdev's reset callback has finished.

 @@ -223,10 +322,16 @@ static void virtio_ioport_write(void *opaque, uint32_t 
 addr, uint32_t val)
          virtio_queue_notify(vdev, val);
          break;
      case VIRTIO_PCI_STATUS:
 -        virtio_set_status(vdev, val  0xFF);
 -        if (vdev-status == 0) {
 -            virtio_reset(proxy-vdev);
 -            msix_unuse_all_vectors(proxy-pci_dev);
 +        if ((val  VIRTIO_CONFIG_S_DRIVER_OK) 
 +            !(vdev-status  VIRTIO_CONFIG_S_DRIVER_OK) 
 +            (proxy-flags  VIRTIO_PCI_FLAG_USE_IOEVENTFD)) {
 +            virtio_pci_set_host_notifiers(proxy, true);
 +        }

 So we set host notifiers to true from here, but to false
 only on reset? This seems strange. Should not we disable
 notifiers when driver clears OK status?
 How about on bus master disable?

You're right, this needs to be fixed.

 @@ -714,6 +803,8 @@ static PCIDeviceInfo virtio_info[] = {
          .exit       = virtio_net_exit_pci,
          .romfile    = pxe-virtio.bin,
          .qdev.props = (Property[]) {
 +            DEFINE_PROP_UINT32(flags, VirtIOPCIProxy, flags,
 +                               VIRTIO_PCI_FLAG_USE_IOEVENTFD),
              DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 3),
              DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
              DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),

 This ties interface to an internal macro value.  Further, user gets to
 tweak other fields in this integer which we don't want.  Finally, the
 interface is extremely unfriendly.
 Please use a bit property instead: DEFINE_PROP_BIT.

Will fix in v3.

 diff --git a/hw/virtio.c b/hw/virtio.c
 index a2a657e..f588e29 100644
 --- a/hw/virtio.c
 +++ b/hw/virtio.c
 @@ -582,6 +582,11 @@ void virtio_queue_notify(VirtIODevice *vdev, int n)
      }
  }

 +void virtio_queue_notify_vq(VirtQueue *vq)
 +{
 +    virtio_queue_notify(vq-vdev, vq - vq-vdev-vq);

 Let's implement virtio_queue_notify in terms of virtio_queue_notify_vq.
 Not the other way around.

Will fix in v3.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-12 Thread Stefan Hajnoczi
On Thu, Nov 11, 2010 at 4:45 PM, Christoph Hellwig h...@infradead.org wrote:
 On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote:
 Some virtio devices are known to have guest drivers which expect a notify to 
 be
 processed synchronously and spin waiting for completion.  Only enable 
 ioeventfd
 for virtio-blk and virtio-net for now.

 Who guarantees that less common virtio-blk and virtio-net guest drivers
 for non-Linux OSes are fine with it?  Maybe you should add a feature flag
 that the guest has to ACK to enable it.

Virtio-blk and virtio-net are fine.  Both of those devices are
expected to operate asynchronously.  SeaBIOS and gPXE virtio-net
drivers spin but they expect to and it is okay in those environments.
They already burn CPU today.

Virtio-console expects synchronous virtqueue kick.  In Linux,
virtio_console.c __send_control_msg() and send_buf() will spin.  Qemu
userspace is able to complete those requests synchronously so that the
guest never actually burns CPU (e.g.
hw/virtio-serial-bus.c:send_control_msg()).  I don't want to burn CPU
in places where we previously didn't.

It's good that QEMU can decide whether or not to handle virtqueue kick
in the vcpu thread.  For high performance asynchronous devices like
virtio-net and virtio-blk it makes sense to use ioeventfd.  For others
it may not be useful.  I'm not sure a feature bit that exposes this
detail to the guest would be useful.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-12 Thread Stefan Hajnoczi
On Fri, Nov 12, 2010 at 9:25 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Fri, Nov 12, 2010 at 09:18:48AM +, Stefan Hajnoczi wrote:
 On Thu, Nov 11, 2010 at 3:53 PM, Michael S. Tsirkin m...@redhat.com wrote:
  On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote:
  Care must be taken not to interfere with vhost-net, which already uses
  ioeventfd host notifiers.  The following list shows the behavior 
  implemented in
  this patch and is designed to take vhost-net into account:
 
   * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, 
  qemu_set_fd_handler(virtio_pci_host_notifier_read)
 
  we should also deassign when VIRTIO_CONFIG_S_DRIVER_OK is cleared
  by io write or bus master bit?

 You're right, I'll fix the lifecycle to trigger symmetrically on
 status bit changes rather than VIRTIO_CONFIG_S_DRIVER_OK/reset.

  +static void virtio_pci_reset_vdev(VirtIOPCIProxy *proxy)
  +{
  +    /* Poke virtio device so it deassigns its host notifiers (if any) */
  +    virtio_set_status(proxy-vdev, 0);
 
  Hmm. virtio_reset already sets status to 0.
  I guess it should just be fixed to call virtio_set_status?

 This part is ugly.  The problem is that virtio_reset() calls
 virtio_set_status(vdev, 0) but doesn't give the transport binding a
 chance clean up after the virtio device has cleaned up.  Since
 virtio-net will spot status=0 and deassign its host notifier, we need
 to perform our own clean up after vhost.

 What makes this slightly less of a hack is the fact that virtio-pci.c
 was already causing virtio_set_status(vdev, 0) to be invoked twice
 during reset.  When 0 is written to the VIRTIO_PCI_STATUS register, we
 do virtio_set_status(proxy-vdev, val  0xFF) and then
 virtio_reset(proxy-vdev).  So the status byte callback already gets
 invoked twice today.

 I've just split this out into virtio_pci_reset_vdev() and (ab)used it
 to correctly clean up virtqueue ioeventfd.

 The alternative is to add another callback from virtio.c so we are
 notified after the vdev's reset callback has finished.

 Oh, likely not worth it. Mabe put the above explanation in the comment.
 Will this go away now that you move to set notifiers on status write?

For v3 I have switched to a bindings callback.  I wish it wasn't
necessary but the only other ways I can think of catching status
writes are hacks which depend on side-effects too much.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-12 Thread Stefan Hajnoczi
Virtqueue notify is currently handled synchronously in userspace virtio.  This
prevents the vcpu from executing guest code while hardware emulation code
handles the notify.

On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to the
iothread and allowing the VM to continue execution.  This model is similar to
how vhost receives virtqueue notifies.

The result of this change is improved performance for userspace virtio devices.
Virtio-blk throughput increases especially for multithreaded scenarios and
virtio-net transmit throughput increases substantially.

Some virtio devices are known to have guest drivers which expect a notify to be
processed synchronously and spin waiting for completion.  Only enable ioeventfd
for virtio-blk and virtio-net for now.

Care must be taken not to interfere with vhost-net, which already uses
ioeventfd host notifiers.  The following list shows the behavior implemented in
this patch and is designed to take vhost-net into account:

 * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, 
qemu_set_fd_handler(virtio_pci_host_notifier_read)
 * !VIRTIO_CONFIG_S_DRIVER_OK - qemu_set_fd_handler(NULL), deassign host 
notifiers
 * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL)
 * virtio_pci_set_host_notifier(false) - 
qemu_set_fd_handler(virtio_pci_host_notifier_read)

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 hw/virtio-pci.c |  152 ++
 hw/virtio.c |   14 -
 hw/virtio.h |   13 +
 3 files changed, 153 insertions(+), 26 deletions(-)

Now toggles host notifiers based on VIRTIO_CONFIG_S_DRIVER_OK status changes.
The cleanest way I could see was to introduce pre and a post set_status()
callbacks.  They allow a binding to hook status changes, including the status
change from virtio_reset().

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 549118d..117e855 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -83,6 +83,11 @@
 /* Flags track per-device state like workarounds for quirks in older guests. */
 #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG  (1  0)
 
+/* Performance improves when virtqueue kick processing is decoupled from the
+ * vcpu thread using ioeventfd for some devices. */
+#define VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT 1
+#define VIRTIO_PCI_FLAG_USE_IOEVENTFD   (1  
VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT)
+
 /* QEMU doesn't strictly need write barriers since everything runs in
  * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
  * KVM or if kqemu gets SMP support.
@@ -179,12 +184,125 @@ static int virtio_pci_load_queue(void * opaque, int n, 
QEMUFile *f)
 return 0;
 }
 
+static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy,
+  int n, bool assign)
+{
+VirtQueue *vq = virtio_get_queue(proxy-vdev, n);
+EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+int r;
+if (assign) {
+r = event_notifier_init(notifier, 1);
+if (r  0) {
+return r;
+}
+r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
+   proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY,
+   n, assign);
+if (r  0) {
+event_notifier_cleanup(notifier);
+}
+} else {
+r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
+   proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY,
+   n, assign);
+if (r  0) {
+return r;
+}
+event_notifier_cleanup(notifier);
+}
+return r;
+}
+
+static void virtio_pci_host_notifier_read(void *opaque)
+{
+VirtQueue *vq = opaque;
+EventNotifier *n = virtio_queue_get_host_notifier(vq);
+if (event_notifier_test_and_clear(n)) {
+virtio_queue_notify_vq(vq);
+}
+}
+
+static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy,
+int n, bool assign)
+{
+VirtQueue *vq = virtio_get_queue(proxy-vdev, n);
+EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+if (assign) {
+qemu_set_fd_handler(event_notifier_get_fd(notifier),
+virtio_pci_host_notifier_read, NULL, vq);
+} else {
+qemu_set_fd_handler(event_notifier_get_fd(notifier),
+NULL, NULL, NULL);
+}
+}
+
+static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign)
+{
+int n, r;
+
+for (n = 0; n  VIRTIO_PCI_QUEUE_MAX; n++) {
+if (!virtio_queue_get_num(proxy-vdev, n)) {
+continue;
+}
+
+if (assign) {
+r = virtio_pci_set_host_notifier_ioeventfd(proxy, n, true);
+if (r  0) {
+goto assign_error

[PATCH v3 3/3] virtio-pci: Don't use ioeventfd on old kernels

2010-11-12 Thread Stefan Hajnoczi
There used to be a limit of 6 KVM io bus devices inside the kernel.  On
such a kernel, don't use ioeventfd for virtqueue host notification since
the limit is reached too easily.  This ensures that existing vhost-net
setups (which always use ioeventfd) have ioeventfds available so they
can continue to work.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 hw/virtio-pci.c |4 
 kvm-all.c   |   46 ++
 kvm-stub.c  |5 +
 kvm.h   |1 +
 4 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 117e855..d3a7a9c 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -661,6 +661,10 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, 
VirtIODevice *vdev,
 pci_register_bar(proxy-pci_dev, 0, size, PCI_BASE_ADDRESS_SPACE_IO,
virtio_map);
 
+if (!kvm_has_many_ioeventfds()) {
+proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+}
+
 virtio_bind_device(vdev, virtio_pci_bindings, proxy);
 proxy-host_features |= 0x1  VIRTIO_F_NOTIFY_ON_EMPTY;
 proxy-host_features |= 0x1  VIRTIO_F_BAD_FEATURE;
diff --git a/kvm-all.c b/kvm-all.c
index 37b99c7..ba302bc 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -28,6 +28,11 @@
 #include kvm.h
 #include bswap.h
 
+/* This check must be after config-host.h is included */
+#ifdef CONFIG_EVENTFD
+#include sys/eventfd.h
+#endif
+
 /* KVM uses PAGE_SIZE in it's definition of COALESCED_MMIO_MAX */
 #define PAGE_SIZE TARGET_PAGE_SIZE
 
@@ -72,6 +77,7 @@ struct KVMState
 int irqchip_in_kernel;
 int pit_in_kernel;
 int xsave, xcrs;
+int many_ioeventfds;
 };
 
 static KVMState *kvm_state;
@@ -441,6 +447,39 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 return ret;
 }
 
+static int kvm_check_many_ioeventfds(void)
+{
+/* Older kernels have a 6 device limit on the KVM io bus.  Find out so we
+ * can avoid creating too many ioeventfds.
+ */
+#ifdef CONFIG_EVENTFD
+int ioeventfds[7];
+int i, ret = 0;
+for (i = 0; i  ARRAY_SIZE(ioeventfds); i++) {
+ioeventfds[i] = eventfd(0, EFD_CLOEXEC);
+if (ioeventfds[i]  0) {
+break;
+}
+ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true);
+if (ret  0) {
+close(ioeventfds[i]);
+break;
+}
+}
+
+/* Decide whether many devices are supported or not */
+ret = i == ARRAY_SIZE(ioeventfds);
+
+while (i--  0) {
+kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false);
+close(ioeventfds[i]);
+}
+return ret;
+#else
+return 0;
+#endif
+}
+
 static void kvm_set_phys_mem(target_phys_addr_t start_addr,
 ram_addr_t size,
 ram_addr_t phys_offset)
@@ -717,6 +756,8 @@ int kvm_init(int smp_cpus)
 kvm_state = s;
 cpu_register_phys_memory_client(kvm_cpu_phys_memory_client);
 
+s-many_ioeventfds = kvm_check_many_ioeventfds();
+
 return 0;
 
 err:
@@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void)
 return kvm_state-xcrs;
 }
 
+int kvm_has_many_ioeventfds(void)
+{
+return kvm_state-many_ioeventfds;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 if (!kvm_has_sync_mmu()) {
diff --git a/kvm-stub.c b/kvm-stub.c
index 5384a4b..33d4476 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void)
 return 0;
 }
 
+int kvm_has_many_ioeventfds(void)
+{
+return 0;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 }
diff --git a/kvm.h b/kvm.h
index 60a9b42..ce08d42 100644
--- a/kvm.h
+++ b/kvm.h
@@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void);
 int kvm_has_debugregs(void);
 int kvm_has_xsave(void);
 int kvm_has_xcrs(void);
+int kvm_has_many_ioeventfds(void);
 
 #ifdef NEED_CPU_H
 int kvm_init_vcpu(CPUState *env);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/3] virtio-pci: Rename bugs field to flags

2010-11-12 Thread Stefan Hajnoczi
The VirtIOPCIProxy bugs field is currently used to enable workarounds
for older guests.  Rename it to flags so that other per-device behavior
can be tracked.

A later patch uses the flags field to remember whether ioeventfd should
be used for virtqueue host notification.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 hw/virtio-pci.c |   15 +++
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 729917d..549118d 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -80,9 +80,8 @@
  * 12 is historical, and due to x86 page size. */
 #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12
 
-/* We can catch some guest bugs inside here so we continue supporting older
-   guests. */
-#define VIRTIO_PCI_BUG_BUS_MASTER  (1  0)
+/* Flags track per-device state like workarounds for quirks in older guests. */
+#define VIRTIO_PCI_FLAG_BUS_MASTER_BUG  (1  0)
 
 /* QEMU doesn't strictly need write barriers since everything runs in
  * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
@@ -95,7 +94,7 @@
 typedef struct {
 PCIDevice pci_dev;
 VirtIODevice *vdev;
-uint32_t bugs;
+uint32_t flags;
 uint32_t addr;
 uint32_t class_code;
 uint32_t nvectors;
@@ -159,7 +158,7 @@ static int virtio_pci_load_config(void * opaque, QEMUFile 
*f)
in ready state. Then we have a buggy guest OS. */
 if ((proxy-vdev-status  VIRTIO_CONFIG_S_DRIVER_OK) 
 !(proxy-pci_dev.config[PCI_COMMAND]  PCI_COMMAND_MASTER)) {
-proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER;
+proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG;
 }
 return 0;
 }
@@ -185,7 +184,7 @@ static void virtio_pci_reset(DeviceState *d)
 VirtIOPCIProxy *proxy = container_of(d, VirtIOPCIProxy, pci_dev.qdev);
 virtio_reset(proxy-vdev);
 msix_reset(proxy-pci_dev);
-proxy-bugs = 0;
+proxy-flags = 0;
 }
 
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
@@ -235,7 +234,7 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
some safety checks. */
 if ((val  VIRTIO_CONFIG_S_DRIVER_OK) 
 !(proxy-pci_dev.config[PCI_COMMAND]  PCI_COMMAND_MASTER)) {
-proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER;
+proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG;
 }
 break;
 case VIRTIO_MSI_CONFIG_VECTOR:
@@ -403,7 +402,7 @@ static void virtio_write_config(PCIDevice *pci_dev, 
uint32_t address,
 
 if (PCI_COMMAND == address) {
 if (!(val  PCI_COMMAND_MASTER)) {
-if (!(proxy-bugs  VIRTIO_PCI_BUG_BUS_MASTER)) {
+if (!(proxy-flags  VIRTIO_PCI_FLAG_BUS_MASTER_BUG)) {
 virtio_set_status(proxy-vdev,
   proxy-vdev-status  
~VIRTIO_CONFIG_S_DRIVER_OK);
 }
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] virtio-pci: Don't use ioeventfd on old kernels

2010-11-12 Thread Stefan Hajnoczi
On Fri, Nov 12, 2010 at 1:24 PM, Stefan Hajnoczi
stefa...@linux.vnet.ibm.com wrote:
 @@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void)
     return kvm_state-xcrs;
  }

 +int kvm_has_many_ioeventfds(void)
 +{
 +    return kvm_state-many_ioeventfds;
 +}
 +

Missing if (!kvm_enabled()) { return 0; }.  Will fix in next version,
would still appreciate review comments on any other aspect of the
patch.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to start VM using COWed image

2010-11-11 Thread Stefan Hajnoczi
On Thu, Nov 11, 2010 at 12:17 PM, Prasad Joshi
p.g.jo...@student.reading.ac.uk wrote:
 Though specifying the absolute path for source image worked for me.
 Can any one please let me know the situation in which one would not want to 
 specify the absolute path?
 How does relative path help?
 Advantage of using relative path rather than absolute path.

 I think using the absolute path would always work.

Relative paths are useful when sharing images with other people.  An
absolute path won't work on another machine unless you use the same
parent directory structure.  If you send me an image file with an
absolute path in your home directory, I won't be able to use it easily
on my machine.

(Actually the new qemu-img rebase -u command can be used to fix up the
image file on the destination machine but it's an extra step and not
user-friendly.)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] virtio: Use ioeventfd for virtqueue notify

2010-11-11 Thread Stefan Hajnoczi
This is a rewrite of the virtio-ioeventfd patchset to work at the virtio-pci.c
level instead of virtio.c.  This results in better integration with the
host/guest notifier code and makes the code simpler (no more state machine).

Virtqueue notify is currently handled synchronously in userspace virtio.  This
prevents the vcpu from executing guest code while hardware emulation code
handles the notify.

On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to the
iothread and allowing the VM to continue execution.  This model is similar to
how vhost receives virtqueue notifies.

The result of this change is improved performance for userspace virtio devices.
Virtio-blk throughput increases especially for multithreaded scenarios and
virtio-net transmit throughput increases substantially.

Now that this code is in virtio-pci.c it is possible to explicitly enable
devices for which virtio-ioeventfd should be used.  Only virtio-blk and
virtio-net are enabled at this time.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-11 Thread Stefan Hajnoczi
Virtqueue notify is currently handled synchronously in userspace virtio.  This
prevents the vcpu from executing guest code while hardware emulation code
handles the notify.

On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to the
iothread and allowing the VM to continue execution.  This model is similar to
how vhost receives virtqueue notifies.

The result of this change is improved performance for userspace virtio devices.
Virtio-blk throughput increases especially for multithreaded scenarios and
virtio-net transmit throughput increases substantially.

Some virtio devices are known to have guest drivers which expect a notify to be
processed synchronously and spin waiting for completion.  Only enable ioeventfd
for virtio-blk and virtio-net for now.

Care must be taken not to interfere with vhost-net, which already uses
ioeventfd host notifiers.  The following list shows the behavior implemented in
this patch and is designed to take vhost-net into account:

 * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, 
qemu_set_fd_handler(virtio_pci_host_notifier_read)
 * reset - qemu_set_fd_handler(NULL), deassign host notifiers
 * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL)
 * virtio_pci_set_host_notifier(false) - 
qemu_set_fd_handler(virtio_pci_host_notifier_read)

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 hw/virtio-pci.c |  155 +++---
 hw/virtio.c |5 ++
 hw/virtio.h |1 +
 3 files changed, 129 insertions(+), 32 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 549118d..436fc59 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -83,6 +83,10 @@
 /* Flags track per-device state like workarounds for quirks in older guests. */
 #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG  (1  0)
 
+/* Performance improves when virtqueue kick processing is decoupled from the
+ * vcpu thread using ioeventfd for some devices. */
+#define VIRTIO_PCI_FLAG_USE_IOEVENTFD   (1  1)
+
 /* QEMU doesn't strictly need write barriers since everything runs in
  * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
  * KVM or if kqemu gets SMP support.
@@ -179,12 +183,108 @@ static int virtio_pci_load_queue(void * opaque, int n, 
QEMUFile *f)
 return 0;
 }
 
+static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy, int 
n, bool assign)
+{
+VirtQueue *vq = virtio_get_queue(proxy-vdev, n);
+EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+int r;
+if (assign) {
+r = event_notifier_init(notifier, 1);
+if (r  0) {
+return r;
+}
+r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
+   proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY,
+   n, assign);
+if (r  0) {
+event_notifier_cleanup(notifier);
+}
+} else {
+r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
+   proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY,
+   n, assign);
+if (r  0) {
+return r;
+}
+event_notifier_cleanup(notifier);
+}
+return r;
+}
+
+static void virtio_pci_host_notifier_read(void *opaque)
+{
+VirtQueue *vq = opaque;
+EventNotifier *n = virtio_queue_get_host_notifier(vq);
+if (event_notifier_test_and_clear(n)) {
+virtio_queue_notify_vq(vq);
+}
+}
+
+static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy, int 
n, bool assign)
+{
+VirtQueue *vq = virtio_get_queue(proxy-vdev, n);
+EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+if (assign) {
+qemu_set_fd_handler(event_notifier_get_fd(notifier),
+virtio_pci_host_notifier_read, NULL, vq);
+} else {
+qemu_set_fd_handler(event_notifier_get_fd(notifier),
+NULL, NULL, NULL);
+}
+}
+
+static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign)
+{
+int n, r;
+
+for (n = 0; n  VIRTIO_PCI_QUEUE_MAX; n++) {
+if (!virtio_queue_get_num(proxy-vdev, n)) {
+continue;
+}
+
+if (assign) {
+r = virtio_pci_set_host_notifier_ioeventfd(proxy, n, true);
+if (r  0) {
+goto assign_error;
+}
+
+virtio_pci_set_host_notifier_fd_handler(proxy, n, true);
+} else {
+virtio_pci_set_host_notifier_fd_handler(proxy, n, false);
+virtio_pci_set_host_notifier_ioeventfd(proxy, n, false);
+}
+}
+return 0;
+
+assign_error:
+proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+while (--n = 0) {
+virtio_pci_set_host_notifier_fd_handler(proxy, n, false);
+virtio_pci_set_host_notifier_ioeventfd

[PATCH 1/3] virtio-pci: Rename bugs field to flags

2010-11-11 Thread Stefan Hajnoczi
The VirtIOPCIProxy bugs field is currently used to enable workarounds
for older guests.  Rename it to flags so that other per-device behavior
can be tracked.

A later patch uses the flags field to remember whether ioeventfd should
be used for virtqueue host notification.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 hw/virtio-pci.c |   15 +++
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 729917d..549118d 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -80,9 +80,8 @@
  * 12 is historical, and due to x86 page size. */
 #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12
 
-/* We can catch some guest bugs inside here so we continue supporting older
-   guests. */
-#define VIRTIO_PCI_BUG_BUS_MASTER  (1  0)
+/* Flags track per-device state like workarounds for quirks in older guests. */
+#define VIRTIO_PCI_FLAG_BUS_MASTER_BUG  (1  0)
 
 /* QEMU doesn't strictly need write barriers since everything runs in
  * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
@@ -95,7 +94,7 @@
 typedef struct {
 PCIDevice pci_dev;
 VirtIODevice *vdev;
-uint32_t bugs;
+uint32_t flags;
 uint32_t addr;
 uint32_t class_code;
 uint32_t nvectors;
@@ -159,7 +158,7 @@ static int virtio_pci_load_config(void * opaque, QEMUFile 
*f)
in ready state. Then we have a buggy guest OS. */
 if ((proxy-vdev-status  VIRTIO_CONFIG_S_DRIVER_OK) 
 !(proxy-pci_dev.config[PCI_COMMAND]  PCI_COMMAND_MASTER)) {
-proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER;
+proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG;
 }
 return 0;
 }
@@ -185,7 +184,7 @@ static void virtio_pci_reset(DeviceState *d)
 VirtIOPCIProxy *proxy = container_of(d, VirtIOPCIProxy, pci_dev.qdev);
 virtio_reset(proxy-vdev);
 msix_reset(proxy-pci_dev);
-proxy-bugs = 0;
+proxy-flags = 0;
 }
 
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
@@ -235,7 +234,7 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
some safety checks. */
 if ((val  VIRTIO_CONFIG_S_DRIVER_OK) 
 !(proxy-pci_dev.config[PCI_COMMAND]  PCI_COMMAND_MASTER)) {
-proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER;
+proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG;
 }
 break;
 case VIRTIO_MSI_CONFIG_VECTOR:
@@ -403,7 +402,7 @@ static void virtio_write_config(PCIDevice *pci_dev, 
uint32_t address,
 
 if (PCI_COMMAND == address) {
 if (!(val  PCI_COMMAND_MASTER)) {
-if (!(proxy-bugs  VIRTIO_PCI_BUG_BUS_MASTER)) {
+if (!(proxy-flags  VIRTIO_PCI_FLAG_BUS_MASTER_BUG)) {
 virtio_set_status(proxy-vdev,
   proxy-vdev-status  
~VIRTIO_CONFIG_S_DRIVER_OK);
 }
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] virtio-pci: Don't use ioeventfd on old kernels

2010-11-11 Thread Stefan Hajnoczi
There used to be a limit of 6 KVM io bus devices inside the kernel.  On
such a kernel, don't use ioeventfd for virtqueue host notification since
the limit is reached too easily.  This ensures that existing vhost-net
setups (which always use ioeventfd) have ioeventfds available so they
can continue to work.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 hw/virtio-pci.c |4 
 kvm-all.c   |   46 ++
 kvm-stub.c  |5 +
 kvm.h   |1 +
 4 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 436fc59..365a26b 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -646,6 +646,10 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, 
VirtIODevice *vdev,
 pci_register_bar(proxy-pci_dev, 0, size, PCI_BASE_ADDRESS_SPACE_IO,
virtio_map);
 
+if (!kvm_has_many_ioeventfds()) {
+proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+}
+
 virtio_bind_device(vdev, virtio_pci_bindings, proxy);
 proxy-host_features |= 0x1  VIRTIO_F_NOTIFY_ON_EMPTY;
 proxy-host_features |= 0x1  VIRTIO_F_BAD_FEATURE;
diff --git a/kvm-all.c b/kvm-all.c
index 37b99c7..ba302bc 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -28,6 +28,11 @@
 #include kvm.h
 #include bswap.h
 
+/* This check must be after config-host.h is included */
+#ifdef CONFIG_EVENTFD
+#include sys/eventfd.h
+#endif
+
 /* KVM uses PAGE_SIZE in it's definition of COALESCED_MMIO_MAX */
 #define PAGE_SIZE TARGET_PAGE_SIZE
 
@@ -72,6 +77,7 @@ struct KVMState
 int irqchip_in_kernel;
 int pit_in_kernel;
 int xsave, xcrs;
+int many_ioeventfds;
 };
 
 static KVMState *kvm_state;
@@ -441,6 +447,39 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 return ret;
 }
 
+static int kvm_check_many_ioeventfds(void)
+{
+/* Older kernels have a 6 device limit on the KVM io bus.  Find out so we
+ * can avoid creating too many ioeventfds.
+ */
+#ifdef CONFIG_EVENTFD
+int ioeventfds[7];
+int i, ret = 0;
+for (i = 0; i  ARRAY_SIZE(ioeventfds); i++) {
+ioeventfds[i] = eventfd(0, EFD_CLOEXEC);
+if (ioeventfds[i]  0) {
+break;
+}
+ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true);
+if (ret  0) {
+close(ioeventfds[i]);
+break;
+}
+}
+
+/* Decide whether many devices are supported or not */
+ret = i == ARRAY_SIZE(ioeventfds);
+
+while (i--  0) {
+kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false);
+close(ioeventfds[i]);
+}
+return ret;
+#else
+return 0;
+#endif
+}
+
 static void kvm_set_phys_mem(target_phys_addr_t start_addr,
 ram_addr_t size,
 ram_addr_t phys_offset)
@@ -717,6 +756,8 @@ int kvm_init(int smp_cpus)
 kvm_state = s;
 cpu_register_phys_memory_client(kvm_cpu_phys_memory_client);
 
+s-many_ioeventfds = kvm_check_many_ioeventfds();
+
 return 0;
 
 err:
@@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void)
 return kvm_state-xcrs;
 }
 
+int kvm_has_many_ioeventfds(void)
+{
+return kvm_state-many_ioeventfds;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 if (!kvm_has_sync_mmu()) {
diff --git a/kvm-stub.c b/kvm-stub.c
index 5384a4b..33d4476 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void)
 return 0;
 }
 
+int kvm_has_many_ioeventfds(void)
+{
+return 0;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 }
diff --git a/kvm.h b/kvm.h
index 60a9b42..ce08d42 100644
--- a/kvm.h
+++ b/kvm.h
@@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void);
 int kvm_has_debugregs(void);
 int kvm_has_xsave(void);
 int kvm_has_xcrs(void);
+int kvm_has_many_ioeventfds(void);
 
 #ifdef NEED_CPU_H
 int kvm_init_vcpu(CPUState *env);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph/rbd block driver for qemu-kvm (v7)

2010-11-11 Thread Stefan Hajnoczi
On Fri, Oct 15, 2010 at 8:54 PM, Christian Brunner c...@muc.de wrote:
 Hi,

 once again, Yehuda committed fixes for all the suggestions made on the
 list (and more). Here is the next update for the ceph/rbd block driver.

 Please let us know if there are any pending issues.

 For those who didn't follow the previous postings:

 This is an block driver for the distributed file system Ceph
 (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace (Yehuda also
 wrote a driver for the linux kernel, that can be used to access
 rbd volumes as a block device).

 Kind Regards,
 Christian

 Signed-off-by: Christian Brunner c...@muc.de
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
 ---
  Makefile.objs |1 +
  block/rbd.c   | 1059 
 +
  block/rbd_types.h |   71 
  configure |   31 ++
  4 files changed, 1162 insertions(+), 0 deletions(-)
  create mode 100644 block/rbd.c
  create mode 100644 block/rbd_types.h

This patch is close now.  Just some minor issues below.


 diff --git a/Makefile.objs b/Makefile.objs
 index 6ee077c..56a13c1 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -19,6 +19,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o
  block-nested-$(CONFIG_WIN32) += raw-win32.o
  block-nested-$(CONFIG_POSIX) += raw-posix.o
  block-nested-$(CONFIG_CURL) += curl.o
 +block-nested-$(CONFIG_RBD) += rbd.o

  block-obj-y +=  $(addprefix block/, $(block-nested-y))

 diff --git a/block/rbd.c b/block/rbd.c
 new file mode 100644
 index 000..fbfb93e
 --- /dev/null
 +++ b/block/rbd.c
 @@ -0,0 +1,1059 @@
 +/*
 + * QEMU Block driver for RADOS (Ceph)
 + *
 + * Copyright (C) 2010 Christian Brunner c...@muc.de
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2.  See
 + * the COPYING file in the top-level directory.
 + *
 + */
 +
 +#include qemu-common.h
 +#include qemu-error.h
 +
 +#include rbd_types.h
 +#include block_int.h
 +
 +#include rados/librados.h
 +
 +
 +
 +/*
 + * When specifying the image filename use:
 + *
 + * rbd:poolname/devicename
 + *
 + * poolname must be the name of an existing rados pool
 + *
 + * devicename is the basename for all objects used to
 + * emulate the raw device.
 + *
 + * Metadata information (image size, ...) is stored in an
 + * object with the name devicename.rbd.
 + *
 + * The raw device is split into 4MB sized objects by default.
 + * The sequencenumber is encoded in a 12 byte long hex-string,
 + * and is attached to the devicename, separated by a dot.
 + * e.g. devicename.1234567890ab
 + *
 + */
 +
 +#define OBJ_MAX_SIZE (1UL  OBJ_DEFAULT_OBJ_ORDER)
 +
 +typedef struct RBDAIOCB {
 +BlockDriverAIOCB common;
 +QEMUBH *bh;
 +int ret;
 +QEMUIOVector *qiov;
 +char *bounce;
 +int write;
 +int64_t sector_num;
 +int aiocnt;
 +int error;
 +struct BDRVRBDState *s;
 +int cancelled;
 +} RBDAIOCB;
 +
 +typedef struct RADOSCB {
 +int rcbid;
 +RBDAIOCB *acb;
 +struct BDRVRBDState *s;
 +int done;
 +int64_t segsize;
 +char *buf;
 +int ret;
 +} RADOSCB;
 +
 +#define RBD_FD_READ 0
 +#define RBD_FD_WRITE 1
 +
 +typedef struct BDRVRBDState {
 +int fds[2];
 +rados_pool_t pool;
 +rados_pool_t header_pool;
 +char name[RBD_MAX_OBJ_NAME_SIZE];
 +char block_name[RBD_MAX_BLOCK_NAME_SIZE];
 +uint64_t size;
 +uint64_t objsize;
 +int qemu_aio_count;
 +int read_only;
 +int event_reader_pos;
 +RADOSCB *event_rcb;
 +} BDRVRBDState;
 +
 +typedef struct rbd_obj_header_ondisk RbdHeader1;
 +
 +static int rbd_next_tok(char *dst, int dst_len,
 +char *src, char delim,
 +const char *name,
 +char **p)
 +{
 +int l;
 +char *end;
 +
 +*p = NULL;
 +
 +if (delim != '\0') {
 +end = strchr(src, delim);
 +if (end) {
 +*p = end + 1;
 +*end = '\0';
 +}
 +}
 +l = strlen(src);
 +if (l = dst_len) {
 +error_report(%s too long, name);
 +return -EINVAL;
 +} else if (l == 0) {
 +error_report(%s too short, name);
 +return -EINVAL;
 +}
 +
 +pstrcpy(dst, dst_len, src);
 +
 +return 0;
 +}
 +
 +static int rbd_parsename(const char *filename,
 + char *pool, int pool_len,
 + char *snap, int snap_len,
 + char *name, int name_len)
 +{
 +const char *start;
 +char *p, *buf;
 +int ret;
 +
 +if (!strstart(filename, rbd:, start)) {
 +return -EINVAL;
 +}
 +
 +buf = qemu_strdup(start);
 +p = buf;
 +
 +ret = rbd_next_tok(pool, pool_len, p, '/', pool name, p);
 +if (ret  0 || !p) {
 +ret = -EINVAL;
 +goto done;
 +}
 +ret = rbd_next_tok(name, name_len, p, '@', object name, p);
 +if (ret  0) {
 + 

Re: Unable to start VM using COWed image

2010-11-10 Thread Stefan Hajnoczi
On Wed, Nov 10, 2010 at 10:08 AM, Prasad Joshi
p.g.jo...@student.reading.ac.uk wrote:
 Where can I get the code of the qemu-kvm program?
 I cloned the qemu-lvm git repository and compiled the code. But it looks like 
 qemu-kvm program is not part of this code.--

qemu-kvm.git contains the qemu-kvm codebase but the binary is built in
x86_64-softmmu/qemu-system-x86_64.  Distro packages typically rename
it to qemu-kvm.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM, windows 2000 and qcow2

2010-11-10 Thread Stefan Hajnoczi
On Tue, Nov 9, 2010 at 8:06 PM, RaSca ra...@miamammausalinux.org wrote:
 Today i saw this page:

 http://www.linux-kvm.org/page/Guest_Support_Status

 in which is explained how is better to run win2k on qcow2 images.

The page does not state that it is better to run Windows 2000 on
qcow2, it's probably just that the tester decided to use qcow2 on his
machine and noted his configuration.

Raw should work just like qcow2 does and you can expect better
performance with raw.  Although qcow2 does run on raw devices like
drbd or lvm logical volumes, I suspect you'll see the same issue
you're getting now.

It's worth figuring out why your current configuration freezes.  I am
interested in the same questions Jernej asked about the freeze: is the
guest crashing with a BSOD or is the KVM process consuming all CPU?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to start VM using COWed image

2010-11-10 Thread Stefan Hajnoczi
On Wed, Nov 10, 2010 at 12:40 PM, Prasad Joshi
p.g.jo...@student.reading.ac.uk wrote:
 From: Stefan Hajnoczi [stefa...@gmail.com]
 Sent: 10 November 2010 11:12
 To: Prasad Joshi
 Cc: Keqin Hong; kvm@vger.kernel.org
 Subject: Re: Unable to start VM using COWed image

 On Wed, Nov 10, 2010 at 10:08 AM, Prasad Joshi
 p.g.jo...@student.reading.ac.uk wrote:
 Where can I get the code of the qemu-kvm program?
 I cloned the qemu-lvm git repository and compiled the code. But it looks 
 like qemu-kvm program is not part of this code.--

 qemu-kvm.git contains the qemu-kvm codebase but the binary is built in
 x86_64-softmmu/qemu-system-x86_64.  Distro packages typically rename
 it to qemu-kvm.

 Thanks Stefan for your reply.

 I guess you pointed out the problem in the first mail. QEMU places a 
 restriction on location of the COWed file. The source image and COWed image 
 should be in the same drectory.

 In my case the source image was in directory /var/lib/libvirt/images/ and the 
 COWed image was in /home/prasad/Virtual directory.
 While debuging the source code using gdb I realized this limitation. It would 
 be good to fix this problem. I will see if I can solve this problem.

This behavior is a feature.  You chose to use a relative backing file
path when you used qemu-img create -b relative-path.

If you want an absolute path you need to use qemu-img create -b
/home/prasad/Virtual/... (i.e. specify an absolute path instead of a
relative path).

 One more question on the same lines,
 How does QEMU detect the file is COWed and the name of the file (not whole 
 path) from it is COWed?

COW support comes from the image file format that you choose.  A qcow
file is not just a raw image file like the kind you can dd from a real
disk.  Instead it has its own file format including a header and
metadata for tracking allocated space.  The header contains the name
of the backing file.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: Use ioeventfd for virtqueue notify

2010-11-10 Thread Stefan Hajnoczi
Hi Michael,
I have looked into the way irqfd with msix mask notifiers works.  From
what I can tell, the guest notifiers are enabled by vhost net in order
to hook up irqfds for the virtqueues.  MSIX allows vectors to be
masked so there is a mmio write notifier in qemu-kvm to toggle the
irqfd and its QEMU fd handler when the guest toggles the MSIX mask.

The irqfd is disabled but stays open as an eventfd while masked.  That
means masking/unmasking the vector does not close/open the eventfd
file descriptor itself.

I'm having trouble finding a direct parallel to virtio-ioeventfd here.
 We always want to have an ioeventfd per virtqueue unless the host
kernel does not support 6 ioeventfds per VM.

When vhost sets the host notifier we want to remove the QEMU fd
handler and allow vhost to use the event notifier's fd as it wants.
When vhost clears the host notifier we want to add the QEMU fd handler
again (unless the kernel does not support 6 ioeventfds per VM).

I think hooking in at the virtio-pci.c level instead of virtio.c is
possible but we're still going to have the same state transitions.  I
hope it can be done without adding per-virtqueue variables that track
state.

Before I go down this route, is there something I've missed and do you
think this approach will be better?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to start VM using COWed image

2010-11-09 Thread Stefan Hajnoczi
On Tue, Nov 9, 2010 at 3:36 PM, Prasad Joshi
p.g.jo...@student.reading.ac.uk wrote:
 Hello,

 I am trying to run KVM machine from the image created as COW from the 
 original image. But it not working.

 Screenshot that shows the KVM works with the original image

  [r...@prasad images]# qemu-kvm /var/lib/libvirt/images/Ubuntu.img -m 512
 ^Z
 [1]+  Stopped                 qemu-kvm /var/lib/libvirt/images/Ubuntu.img -m 
 512
 [r...@prasad images]# bg
 [1]+ qemu-kvm /var/lib/libvirt/images/Ubuntu.img -m 512 
 [r...@prasad images]# pwd
 /var/lib/libvirt/images
 [r...@prasad images]# lsmod | grep -i kvm
 kvm_intel              42122  3
 kvm                   257132  1 kvm_intel

 1. Created COW copy of the image after stoping the VM that was running

 [r...@prasad images]# pwd
 /var/lib/libvirt/images

 [r...@prasad images]# qemu-img create -b Ubuntu.img -f qcow 
 /home/prasad/Virtual/Ubuntu_copy.ovl

 2. Trying to run VM using the copy created

 [pra...@prasad Virtual]$ ls -l
 total 36
 -rw-r--r--. 1 root root 32832 Nov  9 15:33 Ubuntu_copy.ovl

Why is Ubuntu.img not visible in the ls output?

Ubuntu.img is a relative path to the backing file.  It looks like QEMU
will not be able to open the backing file.

Also, is there a reason you're using qcow and not qcow2?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk I/O stuck with KVM - no clue how to solve that

2010-11-07 Thread Stefan Hajnoczi
On Sun, Nov 7, 2010 at 4:07 PM, Hermann Himmelbauer du...@qwer.tk wrote:
 Am Samstag 06 November 2010 20:58:12 schrieb Stefan Hajnoczi:
 On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote:
  I experience strange disk I/O stucks on my Linux Host + Guest with KVM,
  which make the system (especially the guests) almost unusable. These
  stucks come periodically, e.g. every 2 to 10 seconds and last between 3
  and sometimes over 120 seconds, which trigger kernel messages like this
  (on host and/or guest):
 
  INFO: task postgres:2195 blocked for more than 120 seconds

 The fact that this happens on the host too suggests there's an issue
 with the host software/hardware and the VM is triggering it but not
 the root cause.

 Does dmesg display any other suspicious messages?

 No, there's anything that can be seen via dmesg. I at first suspected the
 hardware, too. I can think of the following reasons:

 1) Broken SATA cable / Harddisks - I changed some cables, no change, thus this
 is probably ruled out. I also can't see anything via S.M.A.R.T. Moreover, the
 problem is not bound to a specific device, instead it happens on sda - sdd,
 so I doubt it's harddisk related.

 2) Broken Power Supply / Insufficient Power - I'd expect either a complete
 crash or some error messages in this case, so I'd rather rule that out.

 3) Broken SATA-Controller - I cannot think of any way to check that, but I'd
 also expect some crashes or kernel messages. I flashed the board to the
 latest BIOS version, no change either.

 However, it seems no one except me seems to have this problem, so I'll buy a
 new, similar but different mainboard (Intel instead of Asus), hopefully this
 solves the problem.

 What do you think, any better idea?

If you have the time, you can use perf probes to trace I/O requests in
the host kernel.  Perhaps completion interrupts are being dropped.
You may wish to start by tracing requests issued and completed by the
SATA driver.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk I/O stuck with KVM - no clue how to solve that

2010-11-06 Thread Stefan Hajnoczi
On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote:
 I experience strange disk I/O stucks on my Linux Host + Guest with KVM, which
 make the system (especially the guests) almost unusable. These stucks come
 periodically, e.g. every 2 to 10 seconds and last between 3 and sometimes
 over 120 seconds, which trigger kernel messages like this (on host and/or
 guest):

 INFO: task postgres:2195 blocked for more than 120 seconds

The fact that this happens on the host too suggests there's an issue
with the host software/hardware and the VM is triggering it but not
the root cause.

Does dmesg display any other suspicious messages?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance

2010-10-28 Thread Stefan Hajnoczi
On Wed, Oct 27, 2010 at 10:05 PM, Shirley Ma mashi...@us.ibm.com wrote:
 This patch changes vhost TX used buffer signal to guest from one by
 one to up to 3/4 of vring size. This change improves vhost TX message
 size from 256 to 8K performance for both bandwidth and CPU utilization
 without inducing any regression.

Any concerns about introducing latency or does the guest not care when
TX completions come in?

 Signed-off-by: Shirley Ma x...@us.ibm.com
 ---

  drivers/vhost/net.c   |   19 ++-
  drivers/vhost/vhost.c |   31 +++
  drivers/vhost/vhost.h |    3 +++
  3 files changed, 52 insertions(+), 1 deletions(-)

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 4b4da5b..bd1ba71 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -198,7 +198,24 @@ static void handle_tx(struct vhost_net *net)
                if (err != len)
                        pr_debug(Truncated TX packet: 
                                  len %d != %zd\n, err, len);
 -               vhost_add_used_and_signal(net-dev, vq, head, 0);
 +               /*
 +                * if no pending buffer size allocate, signal used buffer
 +                * one by one, otherwise, signal used buffer when reaching
 +                * 3/4 ring size to reduce CPU utilization.
 +                */
 +               if (unlikely(vq-pend))
 +                       vhost_add_used_and_signal(net-dev, vq, head, 0);
 +               else {
 +                       vq-pend[vq-num_pend].id = head;

I don't understand the logic here: if !vq-pend then we assign to
vq-pend[vq-num_pend].

 +                       vq-pend[vq-num_pend].len = 0;
 +                       ++vq-num_pend;
 +                       if (vq-num_pend == (vq-num - (vq-num  2))) {
 +                               vhost_add_used_and_signal_n(net-dev, vq,
 +                                                           vq-pend,
 +                                                           vq-num_pend);
 +                               vq-num_pend = 0;
 +                       }
 +               }
                total_len += len;
                if (unlikely(total_len = VHOST_NET_WEIGHT)) {
                        vhost_poll_queue(vq-poll);
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 94701ff..47696d2 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
        vq-call_ctx = NULL;
        vq-call = NULL;
        vq-log_ctx = NULL;
 +       /* signal pending used buffers */
 +       if (vq-pend) {
 +               if (vq-num_pend != 0) {
 +                       vhost_add_used_and_signal_n(dev, vq, vq-pend,
 +                                                   vq-num_pend);
 +                       vq-num_pend = 0;
 +               }
 +               kfree(vq-pend);
 +       }
 +       vq-pend = NULL;
  }

  static int vhost_worker(void *data)
 @@ -273,7 +283,13 @@ long vhost_dev_init(struct vhost_dev *dev,
                dev-vqs[i].heads = NULL;
                dev-vqs[i].dev = dev;
                mutex_init(dev-vqs[i].mutex);
 +               dev-vqs[i].num_pend = 0;
 +               dev-vqs[i].pend = NULL;
                vhost_vq_reset(dev, dev-vqs + i);
 +               /* signal 3/4 of ring size used buffers */
 +               dev-vqs[i].pend = kmalloc((dev-vqs[i].num -
 +                                          (dev-vqs[i].num  2)) *
 +                                          sizeof *vq-peed, GFP_KERNEL);

Has this patch been compile tested?  vq-peed?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance

2010-10-28 Thread Stefan Hajnoczi
On Thu, Oct 28, 2010 at 9:57 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
Just read the patch 1/1 discussion and it looks like you're already on
it.  Sorry for the noise.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qcow2: Fix segfault when qcow2 preallocate fails

2010-10-26 Thread Stefan Hajnoczi
When an image is created with -o preallocate, ensure that we only call
preallocate() if the image was indeed opened successfully.  Also use
bdrv_delete() instead of bdrv_close() to avoid leaking the
BlockDriverState structure.

This fixes the segfault reported at
https://bugzilla.redhat.com/show_bug.cgi?id=646538.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
Here's a fix for the segfault.

 block/qcow2.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ee3481b..0fceb0d 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1059,9 +1059,11 @@ exit:
 BlockDriverState *bs;
 BlockDriver *drv = bdrv_find_format(qcow2);
 bs = bdrv_new();
-bdrv_open(bs, filename, BDRV_O_CACHE_WB | BDRV_O_RDWR, drv);
-ret = preallocate(bs);
-bdrv_close(bs);
+ret = bdrv_open(bs, filename, BDRV_O_CACHE_WB | BDRV_O_RDWR, drv);
+if (ret == 0) {
+ret = preallocate(bs);
+}
+bdrv_delete(bs);
 }
 
 return ret;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qcow2: Fix segfault when qcow2 preallocate fails

2010-10-26 Thread Stefan Hajnoczi
On Tue, Oct 26, 2010 at 2:48 PM, Kevin Wolf kw...@redhat.com wrote:
 Am 26.10.2010 15:23, schrieb Stefan Hajnoczi:
 When an image is created with -o preallocate, ensure that we only call
 preallocate() if the image was indeed opened successfully.  Also use
 bdrv_delete() instead of bdrv_close() to avoid leaking the
 BlockDriverState structure.

 This fixes the segfault reported at
 https://bugzilla.redhat.com/show_bug.cgi?id=646538.

 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

 Looks good for stable-0.13. In master we'll have the new qcow_create2
 implementation as soon as Anthony pulls, so it doesn't apply there.

I forgot about that :).  Thanks Kevin.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: Use ioeventfd for virtqueue notify

2010-10-25 Thread Stefan Hajnoczi
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote:
Apologies if you receive this twice, the original message either
disappeared or was delayed somehow.

 My main concern is with the fact that we add more state
 in notifiers that can easily get out of sync with users.
 If we absolutely need this state, let's try to at least
 document the state machine, and make the API
 for state transitions more transparent.

I'll try to describe how it works.  If you're happy with the design in
principle then I can rework the code.  Otherwise we can think about a
different design.

The goal is to use ioeventfd instead of the synchronous pio emulation
path that userspace virtqueues use today.  Both virtio-blk and
virtio-net increase performance with this approach because it does not
block the vcpu from executing guest code while the I/O operation is
initiated.

We want to automatically create an event notifier and setup ioeventfd
for each initialized virtqueue.

Vhost already uses ioeventfd so it is important not to interfere with
devices that have enabled vhost.  If vhost is enabled, then the device's
virtqueues are off-limits and should not be tampered with.

Furthermore, older kernels limit you to 6 ioeventfds per guest.  On such
systems it is risky to automatically use ioeventfd for userspace
virtqueues, since that could take a precious ioeventfd away from another
virtio device using vhost.  Existing guest configurations would break so
it is simplest to avoid using ioeventfd for userspace virtqueues on such
hosts.

The design adds logic into hw/virtio.c to automatically use ioeventfd
for userspace virtqueues.  Specific virtio devices like blk and net
require no modification.  The logic sits below the set_host_notifier()
function that vhost uses.

This design stays in sync because it speaks two interfaces that allow it
to accurately track whether or not to use ioeventfd:
1. virtio_set_host_notifier() is used by vhost.  When vhost enables the
   host notifier we stay out of the way.
2. virtio_reset()/virtio_set_status()/virtio_load() define the device
   life-cycle and transition the state machine appropriately.  Migration
   is supported.

Here is the state machine that tracks a virtqueue:

 assigned
 ^ /  \ ^
 e. / / c.  g. \ \ b.
   / /  \ \
  / v   f.   v \   a.
 offlimits   --- deassigned -- start
 ---
d.

a. The virtqueue starts deassigned with no ioeventfd.

b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to
assign an ioeventfd to each virtqueue, except if the 6 ioeventfd
limitation is present.

c, d. The virtqueue becomes offlimits if vhost enables the host notifier.

e. The ioeventfd becomes assigned again when the host notifier is
disabled by vhost.
f. Except when the 6 ioeventfd limitation is present, then the ioeventfd
becomes unassigned because we want to avoid using ioeventfd.

g. When the device is reset its virtqueues become deassigned again.

Does this make sense?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: Use ioeventfd for virtqueue notify

2010-10-25 Thread Stefan Hajnoczi
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote:
 My main concern is with the fact that we add more state
 in notifiers that can easily get out of sync with users.
 If we absolutely need this state, let's try to at least
 document the state machine, and make the API
 for state transitions more transparent.

I'll try to describe how it works.  If you're happy with the design in
principle then I can rework the code.  Otherwise we can think about a
different design.

The goal is to use ioeventfd instead of the synchronous pio emulation
path that userspace virtqueues use today.  Both virtio-blk and
virtio-net increase performance with this approach because it does not
block the vcpu from executing guest code while the I/O operation is
initiated.

We want to automatically create an event notifier and setup ioeventfd
for each initialized virtqueue.

Vhost already uses ioeventfd so it is important not to interfere with
devices that have enabled vhost.  If vhost is enabled, then the device's
virtqueues are off-limits and should not be tampered with.

Furthermore, older kernels limit you to 6 ioeventfds per guest.  On such
systems it is risky to automatically use ioeventfd for userspace
virtqueues, since that could take a precious ioeventfd away from another
virtio device using vhost.  Existing guest configurations would break so
it is simplest to avoid using ioeventfd for userspace virtqueues on such
hosts.

The design adds logic into hw/virtio.c to automatically use ioeventfd
for userspace virtqueues.  Specific virtio devices like blk and net
require no modification.  The logic sits below the set_host_notifier()
function that vhost uses.

This design stays in sync because it speaks two interfaces that allow it
to accurately track whether or not to use ioeventfd:
1. virtio_set_host_notifier() is used by vhost.  When vhost enables the
   host notifier we stay out of the way.
2. virtio_reset()/virtio_set_status()/virtio_load() define the device
   life-cycle and transition the state machine appropriately.  Migration
   is supported.

Here is the state machine that tracks a virtqueue:

 assigned
 ^ /  \ ^
 e. / / c.  g. \ \ b.
   / /  \ \
  / v   f.   v \   a.
 offlimits   --- deassigned -- start
 ---
d.

a. The virtqueue starts deassigned with no ioeventfd.

b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to
assign an ioeventfd to each virtqueue, except if the 6 ioeventfd
limitation is present.

c, d. The virtqueue becomes offlimits if vhost enables the host notifier.

e. The ioeventfd becomes assigned again when the host notifier is disabled by 
vhost.
f. Except when the 6 ioeventfd limitation is present, then the ioeventfd
becomes unassigned because we want to avoid using ioeventfd.

g. When the device is reset its virtqueues become deassigned again.

Does this make sense?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VM with two interfaces

2010-10-22 Thread Stefan Hajnoczi
On Thu, Oct 21, 2010 at 11:49 PM, Nirmal Guhan vavat...@gmail.com wrote:
 On Thu, Oct 21, 2010 at 9:23 AM, Nirmal Guhan vavat...@gmail.com wrote:
 Hi,

 Am trying to create a VM using qemu-kvm with two interfaces(fedora12
 is the host and vm) and running into an issue. Given below is the
 command :

 qemu-kvm -net nic,macaddr=$macaddress,model=pcnet -net
 tap,script=/etc/qemu-ifup -net nic,model=pcnet -net
 tap,script=/etc/qemu-ifup -m 1024 -hda ./vdisk.img -kernel
 ./bzImage-1019 -append ip=x.y.z.w:a.b.c.d:p.q.r.s:a.b.c.d
 ip=x.y.z.u:a.b.c.d:p.q.r.s:a.b.c.d root=/dev/nfs rw
 nfsroot=x.y.z.v:/blahblahblah

 On boot, both eth0 and eth1 come up but the vm tries to send dhcp and
 rarp requests instead of using the command line IP addresses. DHCP
 would fail in my case.

 With just one interface, dhcp is not attempted and nfs mount of root works 
 fine.

 Any clue on what could be wrong here?

 Thanks,
 Nirmal

 Can someone help please? Hard pressed on time... sorry

Try the #qemu or #kvm IRC channels on chat.freenode.net.  Often people
will respond and debug interactively with you there.

Your problem does not seem QEMU/KVM related.  You'll need to debug the
guest's boot process and network configuration just like a physical
machine.  In fact, I bet with an identical setup on a physical machine
you'd see the same problem.

Double-check the kernel parameters documentation.
See if you have a network configuration in the initramfs or the root
filesystem that would cause it to DHCP and how to pull the
pre-configured settings from the kernel.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Build error with the io-thread option

2010-10-22 Thread Stefan Hajnoczi
2010/10/22 Jean-Philippe Menil jean-philippe.me...@univ-nantes.fr:
 i encounter the following problem, when i attempt to build qemu-kvm with the
 --enable-io-thread option:

--enable-io-thread doesn't build in qemu-kvm.git.  qemu-kvm.git has an
equivalent implemented and used automatically so you don't need to set
--enable-io-thread.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CD-ROM size not updated when switching CD-ROM images.

2010-10-19 Thread Stefan Hajnoczi
On Tue, Oct 19, 2010 at 4:43 AM, Alex Davis alex14...@yahoo.com wrote:
 Steps to reproduce:
 ) Download the first two Slackware-13.1 32-bit CD-ROM ISO images.
 ) Start KVM with the following command
   qemu-system-x86_64 -m 1024M \
     -cdrom  full path of 1st install disk  \
     -boot d
 ) Hit return when prompted for extra boot parameters.
 ) Hit return when asked to select a keyboard map.
 ) Hit return at the login prompt.
 ) cat /sys/block/sr0/size: this should return 1209360.
 ) Press Alt-Ctrl-2 to access the monitor
 ) eject ide1-cd0
 ) change ide1-cd0  full path name of 2nd install disk 
 ) Press Alt-Ctrl-1 to return to the guest.
 ) dd if=/dev/sr0 of=/dev/null bs=512 skip=1209360 count=3
  this should return
   3+0 records in
   3+0 records out.
  instead it returns 0+0
 ) cat /sys/block/sr0/size: this still returns 1209360; it should return 
 1376736.

 Oddly, when mount /dev/sr0  is  executed in the guest, ls of the 
 mounted directory shows the correct contents for the 2nd CD.

After changing the CD-ROM, does running blockdev --rereadpt /dev/sr0
update the size as expected?

You ejected the CD-ROM on the QEMU side, the guest doesn't necessarily
know about the medium change.  What happens when you use eject
/dev/sr0 inside the guest instead?

I don't know how CD-ROM media change works on real hardware, but that
is the behavior that QEMU should be following.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: Use ioeventfd for virtqueue notify

2010-10-19 Thread Stefan Hajnoczi
On Thu, Sep 30, 2010 at 03:01:52PM +0100, Stefan Hajnoczi wrote:
 Virtqueue notify is currently handled synchronously in userspace virtio.
 This prevents the vcpu from executing guest code while hardware
 emulation code handles the notify.
 
 On systems that support KVM, the ioeventfd mechanism can be used to make
 virtqueue notify a lightweight exit by deferring hardware emulation to
 the iothread and allowing the VM to continue execution.  This model is
 similar to how vhost receives virtqueue notifies.
 
 The result of this change is improved performance for userspace virtio
 devices.  Virtio-blk throughput increases especially for multithreaded
 scenarios and virtio-net transmit throughput increases substantially.
 Full numbers are below.
 
 This patch employs ioeventfd virtqueue notify for all virtio devices.
 Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care
 must be taken so that vhost-net, the other ioeventfd user in QEMU, is
 able to function.  On such kernels ioeventfd virtqueue notify will not
 be used.
 
 Khoa Huynh k...@us.ibm.com collected the following data for
 virtio-blk with cache=none,aio=native:
 
 FFSB Test  Threads  Unmodified  Patched
 (MB/s)  (MB/s)
 Large file create  121.721.8
8101.0   118.0
16   119.0   157.0
 
 Sequential reads   121.923.2
8114.0   139.0
16   143.0   178.0
 
 Random reads   13.3 3.6
823.025.4
16   43.347.8
 
 Random writes  122.223.0
893.1111.6
16   110.5   132.0
 
 Sridhar Samudrala s...@us.ibm.com collected the following data for
 virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest.
 
 Guest to Host TCP_STREAM throughput(Mb/sec)
 ---
 Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
 65536 127556430  7590
 16384  84993084  5764
  4096  47231578  3659
  1024  1827 981  2060
 
 Host to Guest TCP_STREAM throughput(Mb/sec)
 ---
 Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
 65536 111565790  5853
 16384 107875575  5691
  4096 104525556  4277
  1024  44373671  5277
 
 Guest to Host TCP_RR latency(transactions/sec)
 --
 
 Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
 1  99033459  3425
  4096  71851931  1899
 16384  61082102  1923
 65536  31611610  1744
 
 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 ---
 Small changes are required for qemu-kvm.git.  I will send them once qemu.git
 has virtio-ioeventfd support.
 
  hw/vhost.c  |6 ++--
  hw/virtio.c |  106 
 +++
  hw/virtio.h |9 +
  kvm-all.c   |   39 +
  kvm-stub.c  |5 +++
  kvm.h   |1 +
  6 files changed, 156 insertions(+), 10 deletions(-)

Is there anything stopping this patch from being merged?

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: KVM call agenda for Oct 19

2010-10-19 Thread Stefan Hajnoczi
On Tue, Oct 19, 2010 at 2:33 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 10/19/2010 08:27 AM, Avi Kivity wrote:

  On 10/19/2010 03:22 PM, Anthony Liguori wrote:

 I had assumed that this would involve:

 qemu -hda windows.img

 (qemu) snapshot ide0-disk0 snap0.img

 1) create snap0.img internally by doing the equivalent of `qemu-img
 create -f qcow2 -b windows.img snap0.img'
 2) bdrv_flush('ide0-disk0')
 3) bdrv_open(snap0.img)
 4) bdrv_close(windows.img)
 5) rename('windows.img', 'windows.img.tmp')
 6) rename('snap0.img', 'windows.img')
 7) rename('windows.img.tmp', 'snap0.img')


 Looks reasonable.

 Would be interesting to look at this as a use case for the threading work.
  We should eventually be able to create a snapshot without stalling vcpus
 (stalling I/O of course allowed).

 If we had another block-level command, like bdrv_aio_freeze(), that queued
 all pending requests until the given callback completed, it would be very
 easy to do this entirely asynchronously.  For instance:

 bdrv_aio_freeze(create_snapshot)

 create_snapshot():
  bdrv_aio_flush(done_flush)

 done_flush():
  bdrv_open(...)
  bdrv_close(...)
  ...

 Of course, closing a device while it's being frozen is probably a recipe for
 disaster but you get the idea :-)

bdrv_aio_freeze() or any mechanism to deal with pending requests in
the generic block code would be a good step for future live support
of other operations like truncate.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: Use ioeventfd for virtqueue notify

2010-10-19 Thread Stefan Hajnoczi
On Tue, Oct 19, 2010 at 2:35 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote:
 On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote:
 Is there anything stopping this patch from being merged?

 Michael, any objections?  If not, I'll merge it.

 I don't really understand what's going on there.  The extra state in
 notifiers especially scares me. If you do and are comfortable with the
 code, go ahead :)

I'm happy to address your comments.  The state machine was a bit icky
but I don't see a way around it.  Will follow up to your review email.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v6)

2010-10-13 Thread Stefan Hajnoczi
On Wed, Oct 13, 2010 at 12:18 AM, Christian Brunner c...@muc.de wrote:
 +static int rbd_set_snapc(rados_pool_t pool, const char *snap, RbdHeader1 
 *header)
 +{
 +    uint32_t snap_count = header-snap_count;
 +    rados_snap_t *snaps = NULL;
 +    rados_snap_t seq;
 +    uint32_t i;
 +    uint64_t snap_names_len = header-snap_names_len;
 +    int r;
 +    rados_snap_t snapid = 0;
 +
 +    cpu_to_le32s(snap_count);
 +    cpu_to_le64s(snap_names_len);

It is clearer to do byteswapping immediately, rather than having the
variable take on different endianness at different times:
uint32_t snap_count = cpu_to_le32(header-snap_count);
uint64_t snap_names_len = cpu_to_le64(header-snap_names_len);

 +    if (snap_count) {
 +        const char *header_snap = (const char *)header-snaps[snap_count];
 +        const char *end = header_snap + snap_names_len;

snap_names_len is little-endian.  This won't work on big-endian hosts.
 Did you mean le64_to_cpu() instead of cpu_to_le64()?

 +        snaps = qemu_malloc(sizeof(rados_snap_t) * header-snap_count);

snaps is allocated here...

 +
 +        for (i=0; i  snap_count; i++) {
 +            snaps[i] = (uint64_t)header-snaps[i].id;
 +            cpu_to_le64s(snaps[i]);
 +
 +            if (snap  strcmp(snap, header_snap) == 0) {
 +                snapid = snaps[i];
 +            }
 +
 +            header_snap += strlen(header_snap) + 1;
 +            if (header_snap  end) {
 +                error_report(bad header, snapshot list broken);
 +            }
 +        }
 +    }
 +
 +    if (snap  !snapid) {
 +        error_report(snapshot not found);
 +        return -ENOENT;

...but never freed here.

 +    }
 +    seq = header-snap_seq;
 +    cpu_to_le32s((uint32_t *)seq);
 +
 +    r = rados_set_snap_context(pool, seq, snaps, snap_count);
 +
 +    rados_set_snap(pool, snapid);
 +
 +    qemu_free(snaps);
 +
 +    return r;
 +}

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH -v5] ceph/rbd block driver for qemu-kvm

2010-10-09 Thread Stefan Hajnoczi
On Fri, Oct 8, 2010 at 8:00 PM, Yehuda Sadeh yeh...@hq.newdream.net wrote:
No flush operation is supported.  Can the guest be sure written data
is on stable storage when it receives completion?

 +/*
 + * This aio completion is being called from rbd_aio_event_reader() and
 + * runs in qemu context. It schedules a bh, but just in case the aio
 + * was not cancelled before.

Cancellation looks unsafe to me because acb is freed for cancel but
then accessed here!  Also see my comment on aio_cancel() below.

 +/*
 + * Cancel aio. Since we don't reference acb in a non qemu threads,
 + * it is safe to access it here.
 + */
 +static void rbd_aio_cancel(BlockDriverAIOCB *blockacb)
 +{
 +    RBDAIOCB *acb = (RBDAIOCB *) blockacb;
 +    qemu_bh_delete(acb-bh);
 +    acb-bh = NULL;
 +    qemu_aio_release(acb);

Any pending librados completions are still running here and will then
cause acb to be accessed after they complete.  If there is no safe way
to cancel then wait for the request to complete.

 +}
 +
 +static AIOPool rbd_aio_pool = {
 +    .aiocb_size = sizeof(RBDAIOCB),
 +    .cancel = rbd_aio_cancel,
 +};
 +
 +/*
 + * This is the callback function for rados_aio_read and _write
 + *
 + * Note: this function is being called from a non qemu thread so
 + * we need to be careful about what we do here. Generally we only
 + * write to the block notification pipe, and do the rest of the
 + * io completion handling from rbd_aio_event_reader() which
 + * runs in a qemu context.

Do librados threads have all signals blocked?  QEMU uses signals so it
is important that this signal not get sent to a librados thread and
discarded.  I have seen this issue in the past when using threaded
libraries in QEMU.

 + */
 +static void rbd_finish_aiocb(rados_completion_t c, RADOSCB *rcb)
 +{
 +    rcb-ret = rados_aio_get_return_value(c);
 +    rados_aio_release(c);
 +    if (write(rcb-s-fds[RBD_FD_WRITE], (void *)rcb, sizeof(rcb))  0) {

You are writing RADOSCB* so sizeof(rcb) should be used.

 +        error_report(failed writing to acb-s-fds\n);
 +        qemu_free(rcb);
 +    }
 +}
 +
 +/* Callback when all queued rados_aio requests are complete */
 +
 +static void rbd_aio_bh_cb(void *opaque)
 +{
 +    RBDAIOCB *acb = opaque;
 +
 +    if (!acb-write) {
 +        qemu_iovec_from_buffer(acb-qiov, acb-bounce, acb-qiov-size);
 +    }
 +    qemu_vfree(acb-bounce);
 +    acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
 +    qemu_bh_delete(acb-bh);
 +    acb-bh = NULL;
 +
 +    qemu_aio_release(acb);
 +}
 +
 +static BlockDriverAIOCB *rbd_aio_rw_vector(BlockDriverState *bs,
 +                                           int64_t sector_num,
 +                                           QEMUIOVector *qiov,
 +                                           int nb_sectors,
 +                                           BlockDriverCompletionFunc *cb,
 +                                           void *opaque, int write)
 +{
 +    RBDAIOCB *acb;
 +    RADOSCB *rcb;
 +    rados_completion_t c;
 +    char n[RBD_MAX_SEG_NAME_SIZE];
 +    int64_t segnr, segoffs, segsize, last_segnr;
 +    int64_t off, size;
 +    char *buf;
 +
 +    BDRVRBDState *s = bs-opaque;
 +
 +    acb = qemu_aio_get(rbd_aio_pool, bs, cb, opaque);
 +    acb-write = write;
 +    acb-qiov = qiov;
 +    acb-bounce = qemu_blockalign(bs, qiov-size);
 +    acb-aiocnt = 0;
 +    acb-ret = 0;
 +    acb-error = 0;
 +    acb-s = s;
 +
 +    if (!acb-bh) {
 +        acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb);
 +    }

When do you expect acb-bh to be non-NULL?

 +
 +    if (write) {
 +        qemu_iovec_to_buffer(acb-qiov, acb-bounce);
 +    }
 +
 +    buf = acb-bounce;
 +
 +    off = sector_num * BDRV_SECTOR_SIZE;
 +    size = nb_sectors * BDRV_SECTOR_SIZE;
 +    segnr = off / s-objsize;
 +    segoffs = off % s-objsize;
 +    segsize = s-objsize - segoffs;
 +
 +    last_segnr = ((off + size - 1) / s-objsize);
 +    acb-aiocnt = (last_segnr - segnr) + 1;
 +
 +    s-qemu_aio_count += acb-aiocnt; /* All the RADOSCB */
 +
 +    if (write  s-read_only) {
 +        acb-ret = -EROFS;
 +        return NULL;
 +    }

block.c:bdrv_aio_writev() will reject writes to read-only block
devices.  This check can be eliminated and it also prevents leaking
acb here.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: Use ioeventfd for virtqueue notify

2010-10-04 Thread Stefan Hajnoczi
On Sun, Oct 3, 2010 at 12:01 PM, Avi Kivity a...@redhat.com wrote:
  On 09/30/2010 04:01 PM, Stefan Hajnoczi wrote:

 Virtqueue notify is currently handled synchronously in userspace virtio.
 This prevents the vcpu from executing guest code while hardware
 emulation code handles the notify.

 On systems that support KVM, the ioeventfd mechanism can be used to make
 virtqueue notify a lightweight exit by deferring hardware emulation to
 the iothread and allowing the VM to continue execution.  This model is
 similar to how vhost receives virtqueue notifies.

 Note that this is a tradeoff.  If an idle core is available and the
 scheduler places the iothread on that core, then the heavyweight exit is
 replaced by a lightweight exit + IPI.  If the iothread is co-located with
 the vcpu, then we'll take a heavyweight exit in any case.

 The first case is very likely if the host cpu is undercommitted and there is
 heavy I/O activity.  This is a typical subsystem benchmark scenario (as
 opposed to a system benchmark like specvirt).  My feeling is that total
 system throughput will be decreased unless the scheduler is clever enough to
 place the iothread and vcpu on the same host cpu when the system is
 overcommitted.

 We can't balance feeling against numbers, especially when we have a
 precedent in vhost-net, so I think this should go in.  But I think we should
 also try to understand the effects of the extra IPIs and cacheline bouncing
 that this creates.  While virtio was designed to minimize this, we know it
 has severe problems in this area.

Right, there is a danger of optimizing for subsystem benchmark cases
rather than real world usage.  I have posted some results that we've
gathered but more scrutiny is welcome.

 Khoa Huynhk...@us.ibm.com  collected the following data for
 virtio-blk with cache=none,aio=native:

 FFSB Test          Threads  Unmodified  Patched
                             (MB/s)      (MB/s)
 Large file create  1        21.7        21.8
                    8        101.0       118.0
                    16       119.0       157.0

 Sequential reads   1        21.9        23.2
                    8        114.0       139.0
                    16       143.0       178.0

 Random reads       1        3.3         3.6
                    8        23.0        25.4
                    16       43.3        47.8

 Random writes      1        22.2        23.0
                    8        93.1        111.6
                    16       110.5       132.0

 Impressive numbers.  Can you also provide efficiency (bytes per host cpu
 seconds)?

Khoa, do you have the host CPU numbers for these benchmark runs?

 How many guest vcpus were used with this?  With enough vcpus, there is also
 a reduction in cacheline bouncing, since the virtio state in the host gets
 to stay on one cpu (especially with aio=native).

Guest: 2 vcpu, 4 GB RAM
Host: 16 cpus, 12 GB RAM

Khoa, is this correct?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disk image snapshot functionality

2010-09-30 Thread Stefan Hajnoczi
Try this:
http://wiki.qemu.org/download/qemu-doc.html#vm_005fsnapshots

To list the snapshots in your QCOW2 image:
qemu-img snapshot -l myimage.qcow2

To revert the disk to a saved state:
qemu-img snapshot -a snapshot-name myimage.qcow2

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio: Use ioeventfd for virtqueue notify

2010-09-30 Thread Stefan Hajnoczi
Virtqueue notify is currently handled synchronously in userspace virtio.
This prevents the vcpu from executing guest code while hardware
emulation code handles the notify.

On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to
the iothread and allowing the VM to continue execution.  This model is
similar to how vhost receives virtqueue notifies.

The result of this change is improved performance for userspace virtio
devices.  Virtio-blk throughput increases especially for multithreaded
scenarios and virtio-net transmit throughput increases substantially.
Full numbers are below.

This patch employs ioeventfd virtqueue notify for all virtio devices.
Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care
must be taken so that vhost-net, the other ioeventfd user in QEMU, is
able to function.  On such kernels ioeventfd virtqueue notify will not
be used.

Khoa Huynh k...@us.ibm.com collected the following data for
virtio-blk with cache=none,aio=native:

FFSB Test  Threads  Unmodified  Patched
(MB/s)  (MB/s)
Large file create  121.721.8
   8101.0   118.0
   16   119.0   157.0

Sequential reads   121.923.2
   8114.0   139.0
   16   143.0   178.0

Random reads   13.3 3.6
   823.025.4
   16   43.347.8

Random writes  122.223.0
   893.1111.6
   16   110.5   132.0

Sridhar Samudrala s...@us.ibm.com collected the following data for
virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest.

Guest to Host TCP_STREAM throughput(Mb/sec)
---
Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
65536 127556430  7590
16384  84993084  5764
 4096  47231578  3659
 1024  1827 981  2060

Host to Guest TCP_STREAM throughput(Mb/sec)
---
Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
65536 111565790  5853
16384 107875575  5691
 4096 104525556  4277
 1024  44373671  5277

Guest to Host TCP_RR latency(transactions/sec)
--

Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
1  99033459  3425
 4096  71851931  1899
16384  61082102  1923
65536  31611610  1744

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
Small changes are required for qemu-kvm.git.  I will send them once qemu.git
has virtio-ioeventfd support.

 hw/vhost.c  |6 ++--
 hw/virtio.c |  106 +++
 hw/virtio.h |9 +
 kvm-all.c   |   39 +
 kvm-stub.c  |5 +++
 kvm.h   |1 +
 6 files changed, 156 insertions(+), 10 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 1b8624d..f127a07 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -517,7 +517,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 goto fail_guest_notifier;
 }
 
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, idx, true);
+r = virtio_set_host_notifier(vdev, idx, true);
 if (r  0) {
 fprintf(stderr, Error binding host notifier: %d\n, -r);
 goto fail_host_notifier;
@@ -539,7 +539,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 
 fail_call:
 fail_kick:
-vdev-binding-set_host_notifier(vdev-binding_opaque, idx, false);
+virtio_set_host_notifier(vdev, idx, false);
 fail_host_notifier:
 vdev-binding-set_guest_notifier(vdev-binding_opaque, idx, false);
 fail_guest_notifier:
@@ -575,7 +575,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
 }
 assert (r = 0);
 
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, idx, false);
+r = virtio_set_host_notifier(vdev, idx, false);
 if (r  0) {
 fprintf(stderr, vhost VQ %d host cleanup failed: %d\n, idx, r);
 fflush(stderr);
diff --git a/hw/virtio.c b/hw/virtio.c
index fbef788..f075b3a 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -16,6 +16,7 @@
 #include trace.h
 #include virtio.h
 #include sysemu.h
+#include kvm.h
 
 /* The alignment to use between consumer and producer parts of vring.
  * x86 pagesize again. */
@@ -77,6 +78,11 @@ struct VirtQueue
 VirtIODevice *vdev;
 EventNotifier guest_notifier;
 EventNotifier host_notifier;
+enum {
+HOST_NOTIFIER_DEASSIGNED

Re: disk image snapshot functionality

2010-09-30 Thread Stefan Hajnoczi
On Thu, Sep 30, 2010 at 2:49 PM, Peter Doherty
dohe...@hkl.hms.harvard.edu wrote:
 On Sep 30, 2010, at 04:31 , Stefan Hajnoczi wrote:

 Try this:
 http://wiki.qemu.org/download/qemu-doc.html#vm_005fsnapshots

 To list the snapshots in your QCOW2 image:
 qemu-img snapshot -l myimage.qcow2

 To revert the disk to a saved state:
 qemu-img snapshot -a snapshot-name myimage.qcow2

 Stefan


 Thanks,

 It looks like the savevm and loadvm commands are pretty close to what I'm
 looking for.
 Any idea what version of qemu that's available in, and when we might be
 seeing that version available in RHEL/CentOS?

The snapshot command was added in January 2009.  I don't have a RHEL 5
handy for checking, sorry.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disk image snapshot functionality

2010-09-29 Thread Stefan Hajnoczi
On Tue, Sep 28, 2010 at 9:24 PM, Peter Doherty
dohe...@hkl.hms.harvard.edu wrote:
 I thought I could do this with the qcow2 images.
 I've used:
 qemu-img snapshot -c snapname disk_image.qcow2
 to create the snapshot.

 It doesn't work.  The snapshots claim to be created, but if I shut down the
 guest, apply the snapshot
 ( qemu-img snapshot -a snapname disk_image.qcow2 )
 the guest either:
 a.) no longer boots (No bootable disk found)
 b.) boots, but is just how it was when I shut it down (it hasn't reverted
 back to what it was like when the snapshot was made)

It is not possible to use qemu-img on the qcow2 image file of a
running guest.  It might work sometimes but chances are you'll corrupt
the image or end up with random behavior.  Compare this to mounting a
filesystem from the guest and the host at the same time - it is not
safe to do this!

You can take snapshots using the savevm command inside QEMU but that
will pause the guest.

If you use a logical volume instead of a qcow2 file then you can
create an LVM snapshot.  Try to ensure that the guest is in a
reasonably idle disk I/O state, otherwise the snapshot may catch the
guest at a bad time.  On boot up the filesystem may need to perform
some recovery (e.g. journal rollback).

Live disk snapshots could be supported at the QEMU block layer.  Once
a snapshot request is issued, all following I/O requests are queued
and not started yet.  Once existing requests have finished (the block
device is quiesced), the snapshot can be taken.  When the snapshot
completes, the queued requests are started and operation resumes as
normal.

Qcow2 is unique because it supports internal snapshots.  Disk
snapshots are part of the image file itself and blocks are shared
between snapshots using reference counting and copy-on-write.  Other
image formats only support external snapshots via backing files.
Examples of this are QCOW1, VMDK, and LVM (using LVM snapshot
commands).  In order to take an external snapshot you create a new
image file that uses the snapshot as a backing file.  New writes after
the snapshot go to the new file.  The old file is the snapshot and
should stay unmodified in the future.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk XFS corruption

2010-09-25 Thread Stefan Hajnoczi
On Sat, Sep 25, 2010 at 2:43 PM, Peter Lieven p...@dlh.net wrote:
 we experience filesystem corruption using virtio-blk on some guest systems 
 togehter with XFS. We still use qemu-kvm 0.12.4.
[...]
 It seems that 64-bit Ubuntu LTS 10.04.1 is affected as well as an older 
 openSuse 11.1 system with kernel 2.6.27.45-0.1-pae. Suprisingly we have
 an openSuse 11.1 with 2.6.27.45-0.1-default working absolutely stable for 
 months.

Affected guests: 64-bit Ubuntu LTS 10.04.1, openSuse 11.1 2.6.27.45-0.1-pae
Unaffected guests: openSuse 11.1 2.6.27.45-0.1-default
qemu-kvm version: 0.12.4
qemu-kvm command-line: ?
Disk image format: ?
Steps to reproduce: ?

Can you please provide information on the unknown items above?

Does this happen with IDE (-drive file=myimage.img,if=ide) or only
with virtio-blk?

 The only thing I have seen in the syslog of the 64-bit Ubuntu LTS 10.04.1 is 
 the following:

 [19001.346897] Filesystem vda1: XFS internal error xfs_trans_cancel at line 
 1162 of file /build/buildd/linux-2.6.32/fs/xfs/xfs_trans.c.  Caller 
 0xa013091d
 [19001.346897]
 [19002.174492] Pid: 1210, comm: diablo Not tainted 2.6.32-24-server #43-Ubuntu
 [19002.174492] Call Trace:
 [19002.174492]  [a010f403] xfs_error_report+0x43/0x50 [xfs]
 [19002.174492]  [a013091d] ? xfs_create+0x1dd/0x5f0 [xfs]
 [19002.174492]  [a012bb35] xfs_trans_cancel+0xf5/0x120 [xfs]
 [19002.174492]  [a013091d] xfs_create+0x1dd/0x5f0 [xfs]

XFS has given up because of an error in xfs_create() but I don't think
this output gives enough information to determine what the error was.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Growing qcow2 files during block migration ?

2010-09-23 Thread Stefan Hajnoczi
On Thu, Sep 23, 2010 at 3:40 PM, Christoph Adomeit
christoph.adom...@gatworks.de wrote:
 Lets say my source machine has a qcow2 file with virtual size of 60 GB but 
 only 2 GB are in use. So the qcow2 file only has a size of 2 GB.

 After block migration the resulting qcow2 file on the target machine has a 
 size of 60 GB.

 Also the migration is quite slow because it seems to transfer 60 GB instead 
 of 2 GB.

 Are there any workarounds, ideas/plans to optimize this ?

Yes.  Although this isn't currently possible in mainline QEMU it is
getting attention.  There have been several recent threads on QED,
image streaming, and block migration.  Here is one which touches on
zero sectors:
http://www.spinics.net/linux/fedora/libvir/msg28144.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tracing KVM with Systemtap

2010-09-22 Thread Stefan Hajnoczi
On Wed, Sep 22, 2010 at 1:11 PM, Rayson Ho r...@redhat.com wrote:
 On Tue, 2010-09-21 at 14:33 +0100, Stefan Hajnoczi wrote:
  I will see what other probes are useful for the end users. Also, are
  there developer documentations for KVM? (I googled but found a lot of
  presentations about KVM but not a lot of info about the internals.)

 Not really.  I suggest grabbing the source and following vl.c:main()
 to the main KVM execution code.

 I was looking for the hardware interfacing code earlier this morning --
 QEMU has the hardware specific directories (e.g. target-i386/ ,
 target-ppc/ ), and I was trying to understand the execution environment
 when the host and guest are running on the same architecture.

 I believe cpu_gen_code() and other related functions are what I should
 dig into...

KVM does not generate code.  Almost all the emulation code in the
source tree is part of the Tiny Code Generator (TCG) used when KVM is
not enabled (e.g. to emulate an ARM board on an x86-64 host).

If you follow the life-cycle in vl.c it will take you through cpus.c
and into kvm-all.c:kvm_cpu_exec().  Note that the details differ
slightly between qemu.git and qemu-kvm.git, and I have described
qemu.git.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tracing KVM with Systemtap

2010-09-22 Thread Stefan Hajnoczi
On Wed, Sep 22, 2010 at 1:42 PM, Rayson Ho r...@redhat.com wrote:
 On Wed, 2010-09-22 at 13:33 +0100, Stefan Hajnoczi wrote:
 KVM does not generate code.  Almost all the emulation code in the
 source tree is part of the Tiny Code Generator (TCG) used when KVM is
 not enabled (e.g. to emulate an ARM board on an x86-64 host).

 Thanks, that's what I thought too. Otherwise it would be really slow to
 run KVM :)

 But if KVM is not used, and QEMU host  guest are running on the same
 architecture, is TCG off? (Hmm, I guess I can find that answer myself by
 reading the code).

TCG is unused when KVM is enabled.  There has been discussion about
building without it for KVM-only builds and qemu-kvm.git can do that
today with a ./configure option.

 Stefan, are you accepting patches? If so, I will create a patch with the
 Systemtap framework  other probes.

I am not a qemu.git or qemu-kvm.git committer but I review patches in
areas that I work in, like tracing.  I'll be happy to give you
feedback.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tracing KVM with Systemtap

2010-09-21 Thread Stefan Hajnoczi
On Tue, Sep 21, 2010 at 1:58 PM, Rayson Ho r...@redhat.com wrote:
 On Mon, 2010-09-20 at 14:36 +0100, Stefan Hajnoczi wrote:
 Right now there are few pre-defined probes (trace events in QEMU
 tracing speak).  As I develop I try to be mindful of new ones I create
 and whether they would be generally useful.  I intend to contribute
 more probes and hope others will too!

 I am still looking at/hacking the QEMU code. I have looked at the
 following places in the code that I think can be useful to have
 statistics gathered:

 net.c qemu_deliver_packet(), etc - network statistics

Yes.

 CPU Arch/op_helper.c global_cpu_lock(), tlb_fill() - lock  unlock,
 and TLB refill statistics

These are not relevant to KVM, they are only used when running with
KVM disabled (TCG mode).

 balloon.c, hw/virtio-balloon.c - ballooning information.

Prerna added a balloon event which is in qemu.git trace-events.  Does
that one do what you need?

 I will see what other probes are useful for the end users. Also, are
 there developer documentations for KVM? (I googled but found a lot of
 presentations about KVM but not a lot of info about the internals.)

Not really.  I suggest grabbing the source and following vl.c:main()
to the main KVM execution code.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tracing KVM with Systemtap

2010-09-20 Thread Stefan Hajnoczi
On Mon, Sep 20, 2010 at 2:19 PM, Rayson Ho r...@redhat.com wrote:
 On Wed, 2010-09-08 at 15:08 +0100, Stefan Hajnoczi wrote:
 Hi Rayson,
 For the KVM kernel module Linux trace events are already used.  For
 example, see arch/x86/kvm/trace.h and check out
 /sys/kernel/debug/tracing/events/kvm/*.  There is a set of useful
 static trace points for vm_exit/vm_enter, pio, mmio, etc.

 For the KVM guest there is perf-kvm(1).  This allows perf(1) to look
 up addresses inside the guest (kernel only?).  It produces system-wide
 performance profiles including guests.  Perhaps someone can comment on
 perf-kvm's full feature set and limitations?

 For QEMU userspace Prerna Saxena and I are proposing a static tracing
 patchset.  It abstracts the trace backend (SystemTap, LTTng UST,
 DTrace, etc) from the actual tracepoints so that portability can be
 achieved.  There is a built-in trace backend that has a basic feature
 set but isn't as fancy as SystemTap.  I have implemented LTTng
 Userspace Tracer support, perhaps you'd like to add SystemTap/DTrace
 support with sdt.h?

 Thanks Stefan for the reply!

 I've looked at the tracing additions in QEMU, including the Simple
 trace backend (simpletrace.c) and the tracetool script, and I think
 the SystemTap version can be implemented in a straightforward way.

 One thing I was wondering, there seems to be not a lot of probes (except
 the examples?) in the QEMU code, are we expected to see more probes in
 the next release, or this work will be a long-term project that will not
 be added to the official QEMU code in the near future?

 (I believe if we can get the tracing framework integrated, then specific
 probes can be added on-demand -- but of course that is just my own
 opinion :-D )

Right now there are few pre-defined probes (trace events in QEMU
tracing speak).  As I develop I try to be mindful of new ones I create
and whether they would be generally useful.  I intend to contribute
more probes and hope others will too!

Prerna is also looking at adding useful probes.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtio with Debian GNU/Linux Etch

2010-09-17 Thread Stefan Hajnoczi
On Fri, Sep 17, 2010 at 5:56 PM, Daniel Bareiro daniel-lis...@gmx.net wrote:
 I have some installations with Debian GNU/Linux Etch I'm migrating to
 KVM. I just installed a kernel 2.6.26 from backports to use Virtio.

 But when I try to boot the operating system, it can not find the vd*
 device to mount the root filesystem. I made sure to change the
 /etc/fstab using partitions vd* and using in the 'root' GRUB parameter
 the vd corresponding partition. Just in case it were not loading the
 module, I added 'virtioi-blk' in /etc/modules, but I keep getting the
 same problem.

 What could be the problem?

1. Check that your VM has the virtio PCI adapters present.  You can do
this by running lspci from a livecd/installer or grep 1af4
/proc/bus/pci/devices.  Look for PCI adapters with vendor ID 1af4 (Red
Hat).
2. Check that the virtio kernel modules are loaded by your initramfs.
Either check the kernel messages as it boots for virtio-pci or vd*
related messages.  Or get a debug shell in the initramfs and grep
virtio-pci /proc/bus/pci/devices.
3. Did you really mean 'virtoi-blk' in /etc/modules?  You can gunzip
and cpio extract the initramfs to check that virtio, virtio_pci,
virtio_ring, virtio_blk are present.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: .img on nfs, relative on ram, consuming mass ram

2010-09-16 Thread Stefan Hajnoczi
2010/9/16 Andre Przywara andre.przyw...@amd.com:
 TOURNIER Frédéric wrote:

 Ok, thanks for taking time.
 I'll dig into your answers.

 So as i run relative.img on diskless systems with original.img on nfs,
 what are the best practice/tips i can use ?

 I thinks it is -snapshot you are looking for.
 This will put the backing store into normal RAM, and you can later commit
 it to the original image if needed. See the qemu manpage for more details.
 In a nutshell you just specify the original image and add -snapshot to the
 command line.

-snapshot creates a temporary qcow2 image in /tmp whose backing file
is your original image.  I'm not sure what you mean by This will put
the backing store into normal RAM?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] Add support for a USB audio device model

2010-09-11 Thread Stefan Hajnoczi
On Fri, Sep 10, 2010 at 10:47 PM, H. Peter Anvin h...@linux.intel.com wrote:
 diff --git a/hw/usb-audio.c b/hw/usb-audio.c
 new file mode 100644
 index 000..d4cf488
 --- /dev/null
 +++ b/hw/usb-audio.c
 @@ -0,0 +1,702 @@
 +/*
 + * QEMU USB Net devices
 + *
 + * Copyright (c) 2006 Thomas Sailer
 + * Copyright (c) 2008 Andrzej Zaborowski

Want to update this for usb-audio?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tracing KVM with Systemtap

2010-09-08 Thread Stefan Hajnoczi
On Wed, Sep 8, 2010 at 2:20 PM, Rayson Ho r...@redhat.com wrote:
 Hi all,

 I am a developer of Systemtap. I am looking into tracing KVM (the kernel
 part and QEMU) and also the KVM guests with Systemtap. I googled and
 found references to Xenprobes and xdt+dtrace, and I was wondering if
 someone is working on the dynamic tracing interface for KVM?

 I've read the KVM kernel code and I think some expensive operations
 (things that need to be trapped back to the host kernel - eg. loading of
 control registers on x86/x64) can be interesting spots for adding an SDT
 (static marker), and I/O operations performed for the guests can be
 useful information to collect.

 I know that KVM guests run like a userspace process and thus techniques
 for tracing Xen might be overkilled, and also gdb can be used to trace
 KVM guests. However, are that anything special I need to be aware of
 before I go further into the development of the Systemtap KVM probes?

 (Opinions / Suggestions / Criticisms welcome!)

Hi Rayson,
For the KVM kernel module Linux trace events are already used.  For
example, see arch/x86/kvm/trace.h and check out
/sys/kernel/debug/tracing/events/kvm/*.  There is a set of useful
static trace points for vm_exit/vm_enter, pio, mmio, etc.

For the KVM guest there is perf-kvm(1).  This allows perf(1) to look
up addresses inside the guest (kernel only?).  It produces system-wide
performance profiles including guests.  Perhaps someone can comment on
perf-kvm's full feature set and limitations?

For QEMU userspace Prerna Saxena and I are proposing a static tracing
patchset.  It abstracts the trace backend (SystemTap, LTTng UST,
DTrace, etc) from the actual tracepoints so that portability can be
achieved.  There is a built-in trace backend that has a basic feature
set but isn't as fancy as SystemTap.  I have implemented LTTng
Userspace Tracer support, perhaps you'd like to add SystemTap/DTrace
support with sdt.h?

http://www.mail-archive.com/qemu-de...@nongnu.org/msg41323.html
http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/tracing_v3

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread

2010-09-01 Thread Stefan Hajnoczi
On Tue, Aug 31, 2010 at 11:32 PM, Alex Williamson
alex.william...@redhat.com wrote:
 On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote:
 On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote:
  The bottom half handler shows big improvements over the timer
  with few downsides, default to it when the iothread is enabled.
 
  Using the following tests, with the guest and host connected
  via tap+bridge:
 
  guest netperf -t TCP_STREAM -H $HOST
  host netperf -t TCP_STREAM -H $GUEST
  guest netperf -t UDP_STREAM -H $HOST
  host netperf -t UDP_STREAM -H $GUEST
  guest netperf -t TCP_RR -H $HOST
 
  Results: base throughput, exits/throughput -
                     patched throughput, exits/throughput
 
  --enable-io-thread
  TCP guest-host 2737.77, 47.82  - 6767.09, 29.15 = 247%, 61%
  TCP host-guest 2231.33, 74.00  - 4125.80, 67.61 = 185%, 91%
  UDP guest-host 6281.68, 14.66  - 12569.27, 1.98 = 200%, 14%
  UDP host-guest 275.91,  289.22 - 264.80, 293.53 = 96%, 101%
  interations/s   1949.65, 82.97  - 7417.56, 84.31 = 380%, 102%
 
  No --enable-io-thread
  TCP guest-host 3041.57, 55.11 - 1038.93, 517.57 = 34%, 939%
  TCP host-guest 2416.03, 76.67 - 5655.92, 55.52  = 234%, 72%
  UDP guest-host 12255.82, 6.11 - 7775.87, 31.32  = 63%, 513%
  UDP host-guest 587.92, 245.95 - 611.88, 239.92  = 104%, 98%
  interations/s   1975.59, 83.21 - 8935.50, 88.18  = 452%, 106%
 
  Signed-off-by: Alex Williamson alex.william...@redhat.com

 parameter having different settings based on config
 options might surprise some users. I don't think
 we really need a parameter here ...

 I'm not a bit fan of this either, but I'd also prefer not to introduce a
 regression for a performance difference we know about in advance.  It
 gets even more complicated when we factor in qemu-kvm, as it doesn't
 build with iothread enabled, but seems to get and even better boost in
 performance across the board thanks largely to the kvm-irqchip.  Should
 we instead make this a configure option?  --enable-virtio-net-txbh?
 Thanks,

 Alex

qemu-kvm uses its own iothread implementation by default.  It doesn't
need --enable-io-thread because it already uses a similar model.

Stefan


  ---
 
   hw/s390-virtio-bus.c |    3 ++-
   hw/syborg_virtio.c   |    3 ++-
   hw/virtio-pci.c      |    3 ++-
   hw/virtio.h          |    6 ++
   4 files changed, 12 insertions(+), 3 deletions(-)
 
  diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
  index 1483362..985f99a 100644
  --- a/hw/s390-virtio-bus.c
  +++ b/hw/s390-virtio-bus.c
  @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = {
       .qdev.size = sizeof(VirtIOS390Device),
       .qdev.props = (Property[]) {
           DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic),
  -        DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, 1),
  +        DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer,
  +                           TXTIMER_DEFAULT),
           DEFINE_PROP_INT32(txburst, VirtIOS390Device, txburst, 256),
           DEFINE_PROP_END_OF_LIST(),
       },
  diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
  index 7b76972..ee5746d 100644
  --- a/hw/syborg_virtio.c
  +++ b/hw/syborg_virtio.c
  @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = {
       .qdev.props = (Property[]) {
           DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic),
           DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features),
  -        DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, 1),
  +        DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer,
  +                           TXTIMER_DEFAULT),
           DEFINE_PROP_INT32(txburst, SyborgVirtIOProxy, txburst, 256),
           DEFINE_PROP_END_OF_LIST(),
       }
  diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
  index e025c09..9740f57 100644
  --- a/hw/virtio-pci.c
  +++ b/hw/virtio-pci.c
  @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = {
               DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 3),
               DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
               DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
  -            DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, 1),
  +            DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer,
  +                               TXTIMER_DEFAULT),
               DEFINE_PROP_INT32(txburst, VirtIOPCIProxy, txburst, 256),
               DEFINE_PROP_END_OF_LIST(),
           },
  diff --git a/hw/virtio.h b/hw/virtio.h
  index 4051889..a1a17a2 100644
  --- a/hw/virtio.h
  +++ b/hw/virtio.h
  @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev);
   void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
                           void *opaque);
 
  +#ifdef CONFIG_IOTHREAD
  + #define TXTIMER_DEFAULT 0
  +#else
  + #define TXTIMER_DEFAULT 1
  +#endif
  +

 Add a comment explaning that this is just a performance optimization?

   /* Base devices.  */
   VirtIODevice 

Re: Tracing KVM with LTTng

2010-08-07 Thread Stefan Hajnoczi
On Fri, Aug 6, 2010 at 4:42 PM, Julien Desfossez j...@klipix.org wrote:
 On [2] you can see a closer look of the state of the kvm threads (blue
 means syscall, red means running in VM mode (vm_entry), dark yellow
 means waiting for CPU).

This is a great visualization.  It shows the state of the entire
system in a clear way and makes the wait times easier to understand.

 In the next days, I will send my patches to the official LTTng git.
 My next step is to synchronise the traces collected from the host with
 the traces collected from the guest (by finding an efficient way to
 share the TSC_OFFSET) to have some infos of what is happening during the
 time the VM has the control.

I'd like to try this patch out, will be checking the LTTng mailing list.

 The reason I post these screenshots now is that I will be at Linuxcon
 next week, and I would really appreciate some feedbacks and ideas for
 future improvements from the KVM community.
 So if you are interested, contact me directly and if you are there we'll
 try to meet.

I'd like to find out more about what you're doing.  Unfortunately I
will not be at LinuxCon/KVM Forum this year.

Prerna Saxena and I have been working on static trace events for QEMU
userspace.  Here is the commit to add LTTng Userspace Tracer support:

http://repo.or.cz/w/qemu/stefanha.git/commitdiff/5560c202f4c5cc37692d35e53f784c74d65c

Using the tracing branch above you can place static trace events in
QEMU userspace and collect the trace with LTTng.  See the trace-events
file for sample trace event definitions.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    3   4   5   6   7   8   9   >