from:"Stefan Hajnoczi"

Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-15 Thread Stefan Hajnoczi

On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote:
 于 2013-8-14 15:53, Stefan Hajnoczi 写道:
  On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia xiaw...@linux.vnet.ibm.com 
  wrote:
  于 2013-8-13 16:21, Stefan Hajnoczi 写道:
 
  On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia xiaw...@linux.vnet.ibm.com
  wrote:
 
  于 2013-8-12 19:33, Stefan Hajnoczi 写道:
 
  On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh a...@alex.org.uk wrote:
 
 
  --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi stefa...@gmail.com
  wrote:
 
  The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
  capture the state of guest RAM and then send it back to the parent
  process.  The guest is only paused for a brief instant during fork(2)
  and can continue to run afterwards.
 
 
 
 
  How would you capture the state of emulated hardware which might not
  be in the guest RAM?
 
 
 
  Exactly the same way vmsave works today.  It calls the device's save
  functions which serialize state to file.
 
  The difference between today's vmsave and the fork(2) approach is that
  QEMU does not need to wait for guest RAM to be written to file before
  resuming the guest.
 
  Stefan
 
  I have a worry about what glib says:
 
  On Unix, the GLib mainloop is incompatible with fork(). Any program
  using the mainloop must either exec() or exit() from the child without
  returning to the mainloop. 
 
 
  This is fine, the child just writes out the memory pages and exits.
  It never returns to the glib mainloop.
 
  There is another way to do it: intercept the write in kvm.ko(or other
  kernel code). Since the key is intercept the memory change, we can do
  it in userspace in TCG mode, thus we can add the missing part in KVM
  mode. Another benefit of this way is: the used memory can be
  controlled. For example, with ioctl(), set a buffer of a fixed size
  which keeps the intercepted write data by kernel code, which can avoid
  frequently switch back to user space qemu code. when it is full always
  return back to userspace's qemu code, let qemu code save the data into
  disk. I haven't check the exactly behavior of Intel guest mode about
  how to handle page fault, so can't estimate the performance caused by
  switching of guest mode and root mode, but it should not be worse than
  fork().
 
 
  The fork(2) approach is portable, covers both KVM and TCG, and doesn't
  require kernel changes.  A kvm.ko kernel change also won't be
  supported on existing KVM hosts.  These are big drawbacks and the
  kernel approach would need to be significantly better than plain old
  fork(2) to make it worthwhile.
 
  Stefan
 
 I think advantage is memory usage is predictable, so memory usage
  peak can be avoided, by always save the changed pages first. fork()
  does not know which pages are changed. I am not sure if this would
  be a serious issue when server's memory is consumed much, for example,
  24G host emulate 11G*2 guest to provide powerful virtual server.
  
  Memory usage is predictable but guest uptime is unpredictable because
  it waits until memory is written out.  This defeats the point of
  live savevm.  The guest may be stalled arbitrarily.
  
   I think it is adjustable. There is no much difference with
 fork(), except get more precise control about the changed pages.
   Kernel intercept the change, and stores the changed page in another
 page, similar to fork(). When userspace qemu code execute, save some
 pages to disk. Buffer can be used like some lubricant. When Buffer =
 MAX, it equals to fork(), guest runs more lively. When Buffer = 0,
 guest runs less lively. I think it allows user to find a good balance
 point with a parameter.
   It is harder to implement, just want to show the idea.

You are right.  You could set a bigger buffer size to increase guest
uptime.

  The fork child can minimize the chance of out-of-memory by using
  madvise(MADV_DONTNEED) after pages have been written out.
   It seems no way to make sure the written out page is the changed
 pages, so it have a good chance the written one is the unchanged and
 still used by the other qemu process.

The KVM dirty log tells you which pages were touched.  The fork child
process could give priority to the pages which have been touched by the
guest.  They must be written out and marked madvise(MADV_DONTNEED) as
soon as possible.

I haven't looked at the vmsave data format yet to see if memory pages
can be saved in random order, but this might work.  It reduces the
likelihood of copy-on-write memory growth.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-14 Thread Stefan Hajnoczi

On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia xiaw...@linux.vnet.ibm.com wrote:
 于 2013-8-13 16:21, Stefan Hajnoczi 写道:

 On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia xiaw...@linux.vnet.ibm.com
 wrote:

 于 2013-8-12 19:33, Stefan Hajnoczi 写道:

 On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh a...@alex.org.uk wrote:


 --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi stefa...@gmail.com
 wrote:

 The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
 capture the state of guest RAM and then send it back to the parent
 process.  The guest is only paused for a brief instant during fork(2)
 and can continue to run afterwards.




 How would you capture the state of emulated hardware which might not
 be in the guest RAM?



 Exactly the same way vmsave works today.  It calls the device's save
 functions which serialize state to file.

 The difference between today's vmsave and the fork(2) approach is that
 QEMU does not need to wait for guest RAM to be written to file before
 resuming the guest.

 Stefan

I have a worry about what glib says:

 On Unix, the GLib mainloop is incompatible with fork(). Any program
 using the mainloop must either exec() or exit() from the child without
 returning to the mainloop. 


 This is fine, the child just writes out the memory pages and exits.
 It never returns to the glib mainloop.

There is another way to do it: intercept the write in kvm.ko(or other
 kernel code). Since the key is intercept the memory change, we can do
 it in userspace in TCG mode, thus we can add the missing part in KVM
 mode. Another benefit of this way is: the used memory can be
 controlled. For example, with ioctl(), set a buffer of a fixed size
 which keeps the intercepted write data by kernel code, which can avoid
 frequently switch back to user space qemu code. when it is full always
 return back to userspace's qemu code, let qemu code save the data into
 disk. I haven't check the exactly behavior of Intel guest mode about
 how to handle page fault, so can't estimate the performance caused by
 switching of guest mode and root mode, but it should not be worse than
 fork().


 The fork(2) approach is portable, covers both KVM and TCG, and doesn't
 require kernel changes.  A kvm.ko kernel change also won't be
 supported on existing KVM hosts.  These are big drawbacks and the
 kernel approach would need to be significantly better than plain old
 fork(2) to make it worthwhile.

 Stefan

   I think advantage is memory usage is predictable, so memory usage
 peak can be avoided, by always save the changed pages first. fork()
 does not know which pages are changed. I am not sure if this would
 be a serious issue when server's memory is consumed much, for example,
 24G host emulate 11G*2 guest to provide powerful virtual server.

Memory usage is predictable but guest uptime is unpredictable because
it waits until memory is written out.  This defeats the point of
live savevm.  The guest may be stalled arbitrarily.

The fork child can minimize the chance of out-of-memory by using
madvise(MADV_DONTNEED) after pages have been written out.

The way fork handles memory overcommit on Linux is configurable, but I
guess in a situation where memory runs out the Out-of-Memory Killer
will kill a process (probably QEMU since it is hogging so much
memory).

The risk of OOM can be avoided by running the traditional vmsave which
stops the guest instead of using live vmsave.

The other option is to live migrate to file but the disadvantage there
is that you cannot choose exactly when the state it saved, it happens
sometime after live migration is initiated.

There are trade-offs with all the approaches, it depends on what is
most important to you.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Block Device Driver

2013-08-14 Thread Stefan Hajnoczi

On Wed, Aug 14, 2013 at 10:40:06AM +0800, Fam Zheng wrote:
 On Tue, 08/13 16:13, Spensky, Chad - 0559 - MITLL wrote:
  Hi All,
  
I'm working with some disk introspection on KVM, and we trying to create
  a shadow image of the disk.  We've hooked the functions in block.c, in
  particular bdrv_aio_writev.  However we are seeing writes go through,
  pausing the VM, and the comparing our shadow image with the actual VM
  image, and they aren't 100% synced up.  The first 1-2 sectors appear to be
  always be correct, however, after that, there are sometimes some
  discrepancies.  I believe we have exhausted most obvious bugs (malloc
  bugs, incorrect size calculations etc.).  Has anyone had any experience
  with this or have any insights?
  
  Our methodology is as follows:
   1. Boot the VM.
   2. Pause VM.
   3. Copy the disk to our shadow image.
 
 How do you copy the disk, from guest or host?
 
   4. Perform very few reads/writes.
 
 Did you flush to disk?
 
   5. Pause VM.
   6. Compare shadow copy with active vm disk.
  
   And this is where we are seeing discrepancies.  Any help is much
  appreciated!  We are running on Ubuntu 12.04 with a modified Debian build.
  
   - Chad
  
  -- 
  Chad S. Spensky
  
 
 I think drive-backup command does just what you want, it creates a image
 and copy-on-write date from guest disk to the target, without pausing
 VM.

Or perhaps drive-mirror.

Maybe Chad can explain what the use case is.  There is probably an
existing command that does this or that could be extended to do this
safely.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Block Device Driver

2013-08-14 Thread Stefan Hajnoczi

On Wed, Aug 14, 2013 at 07:29:53AM -0400, Spensky, Chad - 0559 - MITLL wrote:
   We are trying to keep an active shadow copy while the system is running
 without any need for pausing.  More precisely we want to log every
 individual access to the drive into a database so that the entire stream
 of accesses could be replayed at a later time.

CCing Wolfgang Richter who was previously interested in block I/O
tracing:
https://lists.nongnu.org/archive/html/qemu-devel/2013-05/msg01725.html

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Oracle RAC in libvirt+KVM environment

2013-08-14 Thread Stefan Hajnoczi

On Wed, Aug 14, 2013 at 04:40:44PM +0800, Timon Wang wrote:
 I found a article about Hyper-V virtual Fiber Channel, I think this
 will make Failover Cluster work if KVM has the same feature.
 http://technet.microsoft.com/en-us/library/hh831413.aspx
 
 Hyper-V uses NPIV for virtual Fiber Channel, I have read some article
 about KVM NPIV, but how can I config it with libvirt? Any body can
 show me some example?

A web search turns up this:

https://docs.fedoraproject.org/en-US/Fedora/18/html/Virtualization_Administration_Guide/sect-Technical_Papers-Identifying_HBAs_in_a_Host_System-Confirming_That_IO_Traffic_is_Going_through_an_NPIV_HBA.html

You can use this if the host has a supported Fibre Channel HBA and your
image is on a SAN LUN.

From my limited knowledge about this, NPIV itself won't make clustering
possible.  RAC or Failure Cluster probably still require specific SCSI
commands in order to work (like persistent reservations) and that's what
needs to be investigated in order to figure out a solution.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-13 Thread Stefan Hajnoczi

On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia xiaw...@linux.vnet.ibm.com wrote:
 于 2013-8-12 19:33, Stefan Hajnoczi 写道:

 On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh a...@alex.org.uk wrote:

 --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi stefa...@gmail.com
 wrote:

 The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
 capture the state of guest RAM and then send it back to the parent
 process.  The guest is only paused for a brief instant during fork(2)
 and can continue to run afterwards.



 How would you capture the state of emulated hardware which might not
 be in the guest RAM?


 Exactly the same way vmsave works today.  It calls the device's save
 functions which serialize state to file.

 The difference between today's vmsave and the fork(2) approach is that
 QEMU does not need to wait for guest RAM to be written to file before
 resuming the guest.

 Stefan

   I have a worry about what glib says:

 On Unix, the GLib mainloop is incompatible with fork(). Any program
 using the mainloop must either exec() or exit() from the child without
 returning to the mainloop. 

This is fine, the child just writes out the memory pages and exits.
It never returns to the glib mainloop.

   There is another way to do it: intercept the write in kvm.ko(or other
 kernel code). Since the key is intercept the memory change, we can do
 it in userspace in TCG mode, thus we can add the missing part in KVM
 mode. Another benefit of this way is: the used memory can be
 controlled. For example, with ioctl(), set a buffer of a fixed size
 which keeps the intercepted write data by kernel code, which can avoid
 frequently switch back to user space qemu code. when it is full always
 return back to userspace's qemu code, let qemu code save the data into
 disk. I haven't check the exactly behavior of Intel guest mode about
 how to handle page fault, so can't estimate the performance caused by
 switching of guest mode and root mode, but it should not be worse than
 fork().

The fork(2) approach is portable, covers both KVM and TCG, and doesn't
require kernel changes.  A kvm.ko kernel change also won't be
supported on existing KVM hosts.  These are big drawbacks and the
kernel approach would need to be significantly better than plain old
fork(2) to make it worthwhile.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Oracle RAC in libvirt+KVM environment

2013-08-13 Thread Stefan Hajnoczi

On Mon, Aug 12, 2013 at 06:17:51PM +0800, Timon Wang wrote:
 Yes, SCSI bus likes pass through a shared LUN to the vm, and I am
 using a shared LUN for 'share' purpose.
 
 I found a post that vmware use lsilogic bus for the shared disk, but
 my qemu/kvm version can't support lsilogic bus.
 
 I'm tring to update qemu/kvm version for lsilogic bus support.

Use virtio-scsi.  The emulated LSI SCSI controller has known bugs and is
not actively developed - don't be surprised if you hit issues with it.

The question is still what commands RAC or Failover Clustering use.  If
you find that the software refuses to run, it could be because
additional work is required to make it work on KVM.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Oracle RAC in libvirt+KVM environment

2013-08-12 Thread Stefan Hajnoczi

On Fri, Aug 02, 2013 at 01:58:24PM +0800, Timon Wang wrote:
 We wan't to setup two Oracle instance and make RAC work on them.
 Both VM are setup based on libvirt + KVM, we use a lvm lun which
 formated in qcow2 format and set the shareable properties in the disk
 driver like this:
 
 disk type='block' device='disk'
   driver name='qemu' type='qcow2' cache='none'/

qcow2 is not cluster-aware, it cannot be opened by multiple VMs at the
same time.

You must use raw.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Oracle RAC in libvirt+KVM environment

2013-08-12 Thread Stefan Hajnoczi

On Sat, Aug 10, 2013 at 11:14:39AM +0800, Timon Wang wrote:
 I have tryied change the disk bus to SCSI, add a SCSI controller whose
 model is virtio-scsi, still can't setup the RAC instance.
 
 I tried to use windows 2008 Failover Cluster feature to setup a a
 Failover Cluster instead, and I can't find any cluster disk to share
 between two nodes. So when Failover Cluster is setup, I can't add any
 Cluster disk to the Failover Cluster.
 
 Have I missed some thing?

I'm not sure what SCSI-level requirements RAC or Failover Cluster have.

If anyone knows which features are needed it would be possible to
confirm whether they are supported under KVM.

I expect this can only work if you are passing through a shared LUN.
Can you describe your configuration?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-12 Thread Stefan Hajnoczi

On Fri, Aug 09, 2013 at 10:20:49AM +, Chijianchun wrote:
 Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly 
 restrictions to users.
 
 Are there plans to achieve ram live Snapshot feature?
 
 in my mind, Snapshots can not occupy additional too much memory, So when the 
 memory needs to be changed, the old memory page is needed to flush to the 
 file first.  But flushing to file is too slower than memory,  and when 
 flushing, the vcpu or VM is need to be paused until finished flushing,  so 
 pause...resume...pause...resume., more and more slower.
 
 Is this idea feasible? Are there any other thoughts?

A few people have looked at live vmsave or guest RAM snapshots.

The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
capture the state of guest RAM and then send it back to the parent
process.  The guest is only paused for a brief instant during fork(2)
and can continue to run afterwards.

The child process is a simple loop that sends the contents of guest RAM
back to the parent process over a pipe or writes the memory pages to the
save file on disk.  It performs no logic besides writing out guest RAM.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-12 Thread Stefan Hajnoczi

On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh a...@alex.org.uk wrote:
 --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi stefa...@gmail.com
 wrote:

 The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
 capture the state of guest RAM and then send it back to the parent
 process.  The guest is only paused for a brief instant during fork(2)
 and can continue to run afterwards.


 How would you capture the state of emulated hardware which might not
 be in the guest RAM?

Exactly the same way vmsave works today.  It calls the device's save
functions which serialize state to file.

The difference between today's vmsave and the fork(2) approach is that
QEMU does not need to wait for guest RAM to be written to file before
resuming the guest.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FAQ on linux-kvm.org has broken link

2013-08-06 Thread Stefan Hajnoczi

On Mon, Aug 05, 2013 at 10:59:45PM +0200, folkert wrote:
  Two approaches to get closer to the source of the problem:
  1. Try the latest vanilla kernel on the host (Linux 3.10.5).  This way
 you can rule out fixed bugs in vhost_net or tap.
  2. Get the system into the bad state and then do some deeper.  Start
 with outgoing ping, instrument guest driver and host vhost_net
 functions to see what the drivers are doing, inspect the transmit
 vring, etc.
  
  #1 is probably the best next step.  If it fails and you still have time
  to work on a solution we can start digging deeper with #2.
 
 I can upgrade now to 3.10.3 as that is the current version in debian.

Sounds good.  That way you'll also have access to the latest perf for
instrumenting vhost_net if it still fails.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FAQ on linux-kvm.org has broken link

2013-08-05 Thread Stefan Hajnoczi

On Fri, Aug 02, 2013 at 08:06:58PM +0200, folkert wrote:
  A couple of questions:
  Please post the QEMU command-line from the host (ps aux | grep qemu).
 
 I'll post them all:
 - UMTS-clone: this one works fine since it was created a weak ago
 - belle: this one was fine but suddenly also showed the problem
 - mauer: the problem one
 
 112   4819 1  4 Jul30 ?03:29:39 /usr/bin/kvm -S -M pc-1.1 
 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name UMTS-clone -uuid 
 e49502f1-0c74-2a60-99dc-7602da5ee640 -no-user-config -nodefaults -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/UMTS-clone.monitor,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
 file=/dev/VGNEO/LV_V_UMTS-clone,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
  -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
  -drive 
 file=/home/folkert/ISOs/wheezy.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw
  -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
 tap,fd=20,id=hostnet0,vhost=on,vhostfd=21 -device 
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:09:3b:b6,bus=pci.0,addr=0x3
  -chardev pty,id=charserial0 -device 
 isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0,password -vga 
 cirrus -device usb-host,hostbus=6,hostaddr=5,id=hostdev0 -device 
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 112  10065 1 11 Jul30 ?07:46:16 /usr/bin/kvm -S -M pc-1.1 
 -enable-kvm -m 8192 -smp 12,sockets=12,cores=1,threads=1 -name belle -uuid 
 16b704d7-5fbd-d67b-71e6-0d6b43f1bc0a -no-user-config -nodefaults -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/belle.monitor,server,nowait 
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime 
 -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
 file=/dev/VGNEO/LV_V_BELLE,if=none,id=drive-virtio-disk0,format=raw -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
  -drive 
 file=/dev/VGNEO/LV_V_BELLE_OS,if=none,id=drive-virtio-disk1,format=raw,cache=writeback
  -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk1,id=virtio-disk1
  -drive 
 file=/dev/VGJOURNAL/LV_J_BELLE,if=none,id=drive-ide0-0-0,format=raw,cache=writeback
  -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive 
 if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none -device 
 ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
 tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:75:4a:6f,bus=pci.0,addr=0x3
  -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device 
 virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:0a:6e:de,bus=pci.0,addr=0x7
  -chardev pty,id=charserial0 -device 
 isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 
 127.0.0.1:1,password -vga cirrus -device 
 intel-hda,id=sound0,bus=pci.0,addr=0x4 -device 
 hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device 
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 root 13116 12830  0 19:54 pts/800:00:00 grep qemu
 112  23453 1 57 13:16 ?03:46:51 /usr/bin/kvm -S -M pc-1.1 
 -enable-kvm -m 8192 -smp 8,maxcpus=12,sockets=12,cores=1,threads=1 -name 
 mauer -uuid 3a8452e6-81af-b185-63b6-2b32be17ed87 -no-user-config -nodefaults 
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/mauer.monitor,server,nowait 
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
 file=/dev/VGNEO/LV_V_MAUER,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
  -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
  -drive 
 file=/dev/VGJOURNAL/LV_J_MAUER,if=none,id=drive-virtio-disk1,format=raw,cache=writethrough
  -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=drive-virtio-disk1,id=virtio-disk1
  -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device 
 ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
 tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:86:d9:1f,bus=pci.0,addr=0x3
  -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device 
 virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:a3:12:8a,bus=pci.0,addr=0x4
  -netdev tap,fd=30,id=hostnet2,vhost=on,vhostfd=31 -device 
 virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:0f:54:c2,bus=pci.0,addr=0x5
  -chardev pty,id=charserial0 -device 
 isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 
 127.0.0.1:2,password -vga cirrus -device 
 intel-hda,id=sound0,bus=pci.0,addr=0x7 -device 
 hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device

Re: [Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread Stefan Hajnoczi

On Fri, Aug 02, 2013 at 11:28:45AM +, bugzilla-dae...@bugzilla.kernel.org 
wrote:
 https://bugzilla.kernel.org/show_bug.cgi?id=60620
 
 --- Comment #9 from Folkert van Heusden folk...@vanheusden.com ---
 Good news!
 If I
 
 - bring down all interfaces in the guest (ifdown eth0...)
 - rmmod virtio_net
 - modprobe virtio_net
 - bring up the interfaces again
 + it all works again!
 
 So hopefully this helps the bug hunt?

Hi Folkert,
Please post the QEMU command-line on the host (ps aux | grep qemu) and
the output of lsmod | grep vhost_net.

Since reinitializing the guest driver fixes the issue, we now need to
find out whether the guest or the host side got stuck.

I think I asked before, but please also post any relevant lines from
dmesg on the host and from the guest.  Examples would include networking
error messages, kernel backtraces, or out-of-memory errors.

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread Stefan Hajnoczi

On Fri, Aug 2, 2013 at 1:37 PM, folkert folk...@vanheusden.com wrote:
 If the result is #2, check firewalls on host and guest.  Also try the
 following inside the guest: disable the network interface, rmmod
 virtio_net, modprobe virtio_net again, and bring the network up.

 I pinged, I sniffed, I updated the bug report (it also happens with
 other guests now!).

 And the bring down interfaces / rmmod / modprobe / ifup works!
 So I think something is wrong with virtio_net!

 What shall I do now?

Hi Folkert,
I wrote a reply earlier today but it was rejected because I not have a
kernel.org bugzilla account.  If you don't mind let's continue
discussing on this mailing list - we don't know whether this is a
kernel bug yet anyway.

A couple of questions:

Please post the QEMU command-line from the host (ps aux | grep qemu).

Please confirm that vhost_net is being used on the host (lsmod | grep
vhost_net).

Please double-check both guest and host dmesg for any suspicious
messages.  It could be about networking, out-of-memory, or kernel
backtraces.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FAQ on linux-kvm.org has broken link

2013-07-31 Thread Stefan Hajnoczi

On Tue, Jul 30, 2013 at 10:45:20PM +0200, folkert wrote:
  If you keep losing network connectivity you may have a MAC or IP address
  conflict.  The symptom is that network traffic is intermittent - for
  example, ping might work but a full TCP connection does not.
 
 I submitted a bug at bugzilla a while ago which I updated today with new
 findings: https://bugzilla.kernel.org/show_bug.cgi?id=60620
 This week the system ran a couple of times for 1-2 days but tonight was
 a bit of a disaster: I had to reboot the system 18 times. Sometimes it
 was fine for half an hour but most of the times after a couple of
 minutes (sometimes even during boot) the networking on that one guest
 failed.

I can't add anything besides suggesting slightly more verbose
troubleshooting steps:

1. Wait until the guest suffers from lost network connectivity.

2. Confirm the MAC/IP addresses and run tcpdump -ni $IFACE inside the
   guest.  Ping the guest from the host and check whether tcpdump
   reports ICMP ping packets.

3. Now try pinging the host from the guest and run tcpdump -ni $IFACE on
   the host.  To determine the host-side tap interface, run the
   following:

   $ virsh domiflist mauer
   Interface  Type   Source Model   MAC
   ---
   vnet0  networkdefaultvirtio  52:54:00:b9:c8:4d

Now you have verified tap connectivity with the guest.  We now know:

1. Tap connectivity is fine (both transmit and receive are working)
2. Either transmit or receive are broken (ping doesn't work but tcpdump
   does show incoming packets on one side).
3. Tap connectivity is broken (ping fails and tcpdump shows no ICMP
   packets).

If the result is #1 then you can continue troubleshooting the next step:
the bridge or NAT configuration on the host.

If the result is #2, check firewalls on host and guest.  Also try the
following inside the guest: disable the network interface, rmmod
virtio_net, modprobe virtio_net again, and bring the network up.

If the result is #3, check firewalls on host and guest as well as dmesg
output in host and guest.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FAQ on linux-kvm.org has broken link

2013-07-30 Thread Stefan Hajnoczi

On Tue, Jul 30, 2013 at 03:18:53AM +0200, folkert wrote:
 The link at:
 http://www.linux-kvm.org/page/FAQ#My_guest_network_is_stuck_what_should_I_do.3F
 pointing to:
 http://qemu-buch.de/cgi-bin/moin.cgi/QemuNetwork
 is broken: it gives a Internal server error message.
 
 Please someone point me to the correct location as I'm struggeling with
 a VM losing connectivity all the time.

Hi Folkert,
I have updated the wiki to point to
http://qemu-project.org/Documentation/Networking.  The original link
seems to be down.

If you keep losing network connectivity you may have a MAC or IP address
conflict.  The symptom is that network traffic is intermittent - for
example, ping might work but a full TCP connection does not.

This happens when two guests are configured with identical MAC or IP
addresses on the same bridge or subnet.  They will fight over the MAC
or IP address and you will not be able to reliably communicate with
those guests.

The tool for solving networking issues is often tcpdump.  Run tcpdump
inside the guest to verify it is receiving traffic or investigate a
failed connection.

Run tcpdump on the host - especially if you are using -netdev tap - to
inspect the traffic being forwarded on behalf of the guest.

If you let libvirt set up networking for you all should be fine.  If you
run qemu manually or customized the domain XML, then it's possible you
have a misconfiguration.  Feel free to post the details so someone can
help you.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VU#976534 - How to submit security bugs?

2013-07-24 Thread Stefan Hajnoczi

On Mon, Jul 22, 2013 at 02:49:50PM -0400, CERT(R) Coordination Center wrote:
   My name is Adam Rauf and I work for the CERT Coordination Center.  We
 have a report that may affect KVM/QEMU.  How can we securely send it over to
 you?  Thanks so much!

Paolo, Gleb, Anthony: Is this already being discussed off-list?

Adam: Paolo Bonzini and Gleb Natapov are the KVM kernel maintainers and
Anthony Liguori is QEMU maintainer.  You can verify this by checking
linux.git ./MAINTAINERS and qemu.git ./MAINTAINERS.  I suggest getting
in touch with them.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disk corruption after virsh destroy

2013-07-03 Thread Stefan Hajnoczi

On Tue, Jul 02, 2013 at 10:40:11AM -0400, Brian J. Murrell wrote:
 I have a cluster of VMs setup with shared virtio-scsi disks.  The
 purpose of sharing a disk is that if a VM goes down, another can
 pick up and mount the (ext4) filesystem on shared disk a provide
 service to it.
 
 But just to be super clear, only one VM ever has a filesystem
 mounted at a time even though multiple VMs technically can access
 the device at the same time.  A VM mounting a filesystem ensures
 absolutely that no other node has it mounted before mounting it.
 
 That said, what I am finding is that when one a node dies and
 another node tries to mount the (ext4) filesystem, it is found dirty
 and needs an fsck.
 
 My understanding is that with ext{3,4}, this should not be the case
 and indeed it is my experience, on real hardware with coherent disk
 caching (i.e. no non-battery-backed caching disk controllers lying
 to the O/S about what has been written to physical disk) that this
 is the case. That is, a node failing does not leave an ext{3,4}
 filesystem dirty such that it needs an fsck.
 
 So, clearly, somewhere between the KVM VM and the physical disk,
 there is a cache that is resulting in the guest O/S believing data
 is being written to physical disk that is not actually being written
 there.  To that end, I have ensured that on these shared disks that
 I set cache=none, but this does not seem to have fixed the
 problem.

I expect journal replay and possibly fsck when an ext4 file system was
left in a mounted state and with I/O pending (e.g. due to power
failure).

A few questions:

1. Is the guest mounting the file system with barrier=0?  barrier=1 is
   the default.

2. Do the physical disks have a volatile write cache enabled (if yes,
   the guest should use barrier=1)?  If the physical disks have a
   non-volatile write cache or the write cache is disabled (then
   barrier=0 is okay).

3. Have you tested without the cluster?  Run a single VM and kill it
   while it is busy.  Then start it up again and see if there is fsck.

4. Is it possible that your previous cluster setup used tune2fs(8) to
   disable fsck in some cases?  That could explain why you didn't see
   fsck before but do now.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: i/o threads

2013-06-27 Thread Stefan Hajnoczi

On Wed, Jun 26, 2013 at 03:53:21PM +0200, folkert wrote:
 I noticed that on my 3 VMs running server, that there are 10-20 threads
 doing i/o. As the VMs are running on HDDs and not SSDs I think that is
 counterproductive: won't these threads make the HDDs seek back and forth
 constantly?

The worker threads are doing preadv()/pwritev()/fdatasync().  It's up to
the host kernel to schedule that I/O efficiently.

Exposing more I/O to the host gives it a chance to merge or reorder I/O
for optimal performance, so it's a good thing.

On the other hand, if QEMU only did 1 or 2 I/O requests at a time then
the host kernel could do nothing to improve the I/O pattern and the
disks would indeed seek back and forth constantly.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Google Summer of Code 2013 has started

2013-06-26 Thread Stefan Hajnoczi

It is a pleasure to welcome the following GSoC 2013 students to the
QEMU, KVM, and libvirt communities:

Libvirt Wireshark Dissector - Yuto KAWAMURA (kawamuray)
http://qemu-project.org/Features/LibvirtWiresharkDissector

Libvirt Introduce API to query IP addresses for given domain - Nehal
J. Wani (nehaljwani)
http://www.google-melange.com/gsoc/project/google/gsoc2013/nehaljwani/51001

Libvirt More Intelligent virsh auto-completion - Tomas Meszaros
http://www.google-melange.com/gsoc/project/google/gsoc2013/examon/13001

QEMU Integrated Copy-Paste - Ozan Çağlayan and Pallav Agrawal (pallav)
http://qemu-project.org/Features/IntegratedCopyPaste

QEMU Continuation Passing C - Charlie Shepherd (cs648)
http://qemu-project.org/Features/Continuation-Passing_C

QEMU Kconfig - Ákos Kovács
http://qemu-project.org/Features/Kconfig

QEMU USB Media Transfer Protocol emulation - a|mond
http://www.google-melange.com/gsoc/project/google/gsoc2013/almond/1001

KVM Nested Virtualization Testsuite - Arthur Chunqi Li (xelatex)
http://www.google-melange.com/gsoc/project/google/gsoc2013/xelatex/19001

Coding started on Monday, 17th of June and ends Monday, 23rd of September.

Feel free to follow these projects - feature pages are being created
with git repo and blog links.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Would a DOS on dovecot running under a VM cause host to crash?

2013-06-24 Thread Stefan Hajnoczi

On Fri, Jun 21, 2013 at 10:27:07AM +1200, Hugh Davenport wrote:
 The attack lasted around 4 minutes, in which there was 1161 lines
 in the log for a
 single attacker ip, and no other similar logs previously.
 
 Would this be enough to kill not only the VM running dovecot, but
 the underlying host
 machine?

Have you checked logs on the host?  Specifically /var/log/messages for
seg fault messages or Out-of-Memory Killer messages.

It's also worth checking /var/log/libvirt/qemu/domain.log if you are
using libvirt.  That file contains the QEMU stderr output.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cache write back barriers

2013-06-14 Thread Stefan Hajnoczi

On Thu, Jun 13, 2013 at 10:47:32AM +0200, folkert wrote:
 Hi,
 
   In virt-manager I saw that there's the option for cache writeback for
   storage devices.
   I'm wondering: does this also make kvm to ignore write barriers invoked
   by the virtual machine?
  
  No, that would be unsafe.  When the guest issues a flush then QEMU will
  ensure that data reaches the disk with -drive cache=writeback.
 
 Aha so the writeback behaves like the consume harddisks with write-cache
 on them.
 In that case maybe an extra note could be added to the virt-manager
 (excellent software by the way!) that if the client vm supports
 barriers, that write-back in that case then is safe. Agree?

CCed virt-manager mailing list so they can see your request.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cache write back barriers

2013-06-13 Thread Stefan Hajnoczi

On Wed, Jun 12, 2013 at 10:03:10AM +0200, folkert wrote:
 In virt-manager I saw that there's the option for cache writeback for
 storage devices.
 I'm wondering: does this also make kvm to ignore write barriers invoked
 by the virtual machine?

No, that would be unsafe.  When the guest issues a flush then QEMU will
ensure that data reaches the disk with -drive cache=writeback.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VirtIO and BSOD On Windows Server 2003

2013-06-04 Thread Stefan Hajnoczi

On Mon, Jun 03, 2013 at 09:56:41AM -0700, Aaron Clausen wrote:
 I recently built a new kvm server with Debian Wheezy which comes with
 KVM 1.1.2 and when I moved this guest over, I immediately started
 getting BSODs (0x007). I disabled virtio block driver and then
 attempted to upgrade to the latest with no luck.

Stop code 0x7b Inaccessible boot device?

How did you create the guest on the new server?  Perhaps the hardware
configuration changed - I suggest trying to make it as close to the
original guest as possible (including the same PCI slots).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Redirections from virtual interfaces.

2013-06-03 Thread Stefan Hajnoczi

On Fri, May 31, 2013 at 11:10:24AM -0300, Targino SIlveira wrote:
 I have an server with only one NIC, this NIC has a Public IP, this
 server is locate in a data center, I can't have more than one, but I
 can have many IP's, so I would like to know if I can redirect
 packages from virtual interface for my VM's?
 
 Examples:
 
 eth0:1 xxx.xx.xxx.xxx redirec all trafic to 192.168.122.200
 eth0:2 xxx.xx.xxx.xxy redirec all trafic to 192.168.122.150
 eth0:3 xxx.xx.xxx.xxz redirec all trafic to 192.168.122.180
 
 I'm using /etc/libvirt/hooks/qemu to write iptables rules.

Yes, look at NAT.  A lot of material probably covers NAT behind one
public IP, in this case you actually need to map public addresses onto
private addresses 1:1.

A web search for linux nat should turn up howtos.  Or check on
libvirt.org if there is libvirt configuration that automatically sets
this up for you.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: updated: kvm networking todo wiki

2013-05-30 Thread Stefan Hajnoczi

On Thu, May 30, 2013 at 7:23 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Anthony Liguori anth...@codemonkey.ws writes:
 Rusty Russell ru...@rustcorp.com.au writes:
 On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote:
 FWIW, I think what's more interesting is using vhost-net as a networking
 backend with virtio-net in QEMU being what's guest facing.

 In theory, this gives you the best of both worlds: QEMU acts as a first
 line of defense against a malicious guest while still getting the
 performance advantages of vhost-net (zero-copy).

 It would be an interesting idea if we didn't already have the vhost
 model where we don't need the userspace bounce.

 The model is very interesting for QEMU because then we can use vhost as
 a backend for other types of network adapters (like vmxnet3 or even
 e1000).

 It also helps for things like fault tolerance where we need to be able
 to control packet flow within QEMU.

 (CC's reduced, context added, Dmitry Fleytman added for vmxnet3 thoughts).

 Then I'm really confused as to what this would look like.  A zero copy
 sendmsg?  We should be able to implement that today.

 On the receive side, what can we do better than readv?  If we need to
 return to userspace to tell the guest that we've got a new packet, we
 don't win on latency.  We might reduce syscall overhead with a
 multi-dimensional readv to read multiple packets at once?

Sounds like recvmmsg(2).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: exclude ioeventfd from counting kvm_io_range limit

2013-05-27 Thread Stefan Hajnoczi

On Sat, May 25, 2013 at 06:44:15AM +0800, Amos Kong wrote:
 We can easily reach the 1000 limit by start VM with a couple
 hundred I/O devices (multifunction=on). The hardcode limit
 already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).
 
 In userspace, we already have maximum file descriptor to
 limit ioeventfd count. But kvm_io_bus devices also are used
 for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
 by maximum file descriptor.
 
 Currently only ioeventfds take too much kvm_io_bus devices,
 so just exclude it from counting kvm_io_range limit.
 
 Also fixed one indent issue in kvm_host.h
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  include/linux/kvm_host.h | 3 ++-
  virt/kvm/eventfd.c   | 2 ++
  virt/kvm/kvm_main.c  | 3 ++-
  3 files changed, 6 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: add detail error message when fail to add ioeventfd

2013-05-23 Thread Stefan Hajnoczi

On Wed, May 22, 2013 at 09:48:21PM +0800, Amos Kong wrote:
 On Wed, May 22, 2013 at 11:32:27AM +0200, Stefan Hajnoczi wrote:
  On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote:
   I try to hotplug 28 * 8 multiple-function devices to guest with
   old host kernel, ioeventfds in host kernel will be exhausted, then
   qemu fails to allocate ioeventfds for blk/nic devices.
   
   It's better to add detail error here.
   
   Signed-off-by: Amos Kong ak...@redhat.com
   ---
kvm-all.c |4 
1 files changed, 4 insertions(+), 0 deletions(-)
  
  It would be nice to make kvm bus scalable so that the hardcoded
  in-kernel I/O device limit can be lifted.
 
 I had increased kernel NR_IOBUS_DEVS to 1000 (a limitation is needed for
 security) in last Mar, and make resizing kvm_io_range array dynamical.

The maximum should not be hardcoded.  File descriptor, maximum memory,
etc are all controlled by rlimits.  And since ioeventfds are file
descriptors they are already limited by the maximum number of file
descriptors.

Why is there a need to impose a hardcoded limit?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: add detail error message when fail to add ioeventfd

2013-05-22 Thread Stefan Hajnoczi

On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote:
 I try to hotplug 28 * 8 multiple-function devices to guest with
 old host kernel, ioeventfds in host kernel will be exhausted, then
 qemu fails to allocate ioeventfds for blk/nic devices.
 
 It's better to add detail error here.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  kvm-all.c |4 
  1 files changed, 4 insertions(+), 0 deletions(-)

It would be nice to make kvm bus scalable so that the hardcoded
in-kernel I/O device limit can be lifted.

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2013 Linux Plumbers Virtualization Microconference proposal call for participation

2013-05-17 Thread Stefan Hajnoczi

On Thu, May 16, 2013 at 02:32:30PM -0600, Alex Williamson wrote:
 We'd like to hold another virtualization microconference as part of this
 year's Linux Plumbers Conference.  To do so, we need to show that
 there's enough interest, materials, and people willing to attend. 

Convenience info:

September 18-20, 2013
New Orleans, Louisiana
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how emulated disk IO translated to physical disk IO on host side

2013-05-10 Thread Stefan Hajnoczi

On Tue, May 07, 2013 at 09:58:52AM -0500, sheng qiu wrote:
 i am trying to figure out the code path which translate the emulated
 disk IO issued by VM to actual physical disk IO on host side. Can
 anyone give me a clear view about this?

For an overview of the stack:
http://events.linuxfoundation.org/slides/2011/linuxcon-japan/lcj2011_hajnoczi.pdf

 i read the kvm side code about the VMexit handling, the handle_io()
 will be called for IO exits. for IO job that cannot be handled inside
 the hyperviser, it will switch to qemu-kvm process and the handle_io()
 at qemu side will be called. finally it seems invoke the
 ioport_read()/ioport_write() which invoke the actual registered
 read/write operator. Then i get lost here, i do not know for emulated
 disk io which function will be responsible for the remaining job. i.e.
 catch the cmd for accessing the virtual disk and translate to the
 read/write to offset of the disk img file (assume we use file for
 virtual disk), and then issue the system call to the host to issue the
 real io cmd to physical disk.

If you are running with a virtio-blk PCI adapter, then QEMU's
virtio_queue_host_notifier_read() or virtio_ioport_write() is invoked
(it depends whether ioeventfd is being used or regular I/O dispatch).

Then QEMU's hw/block/virtio-blk.c will call
bdrv_aio_readv()/bdrv_aio_writev()/bdrv_aio_flush().  This enters the
QEMU block layer (see block.c and block/) where image file formats are
handled.

Eventually you get to block/raw-posix.c which issues either a
preadv()/pwritev()/fdatasync() in a worker thread or a Linux AIO
io_submit() (if -drive aio=native was used).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call minutes for 2013-04-23

2013-04-24 Thread Stefan Hajnoczi

On Tue, Apr 23, 2013 at 10:06:41AM -0600, Eric Blake wrote:
 On 04/23/2013 08:45 AM, Juan Quintela wrote:
we can change drive_mirror to use a new command to see if there
are the new features.
 
 drive-mirror changed in 1.4 to add optional buf-size parameter; right
 now, libvirt is forced to limit itself to 1.3 interface (no buf-size or
 granularity) because there is no introspection and no query-* command
 that witnesses that the feature is present.  Idea was that we need to
 add a new query-drive-mirror-capabilities (name subject to bikeshedding)
 command into 1.5 that would let libvirt know that buf-size/granularity
 is usable (done right, it would also prevent the situation of buf-size
 being a write-only interface where it is set when starting the mirror
 but can not be queried later to see what size is in use).
 
 Unclear whether anyone was signing up to tackle the addition of a query
 command counterpart for drive-mirror in time for 1.5.

Seems like the trivial solution is a query-command-capabilities QMP
command.

  query-command-capabilities drive-mirror
  = ['buf-size']

It should only be a few lines of code and can be used for other commands
that add optional parameters in the future.  In other words:

typedef struct mon_cmd_t {
...
const char **capabilities; /* drive-mirror uses [buf-size, NULL] */
};

  
if we have a stable c-api we can do test cases that work. 
 
 Having such a testsuite would make a stable C API more important.

Writing tests in Python has been productive, see qemu-iotests 041 and
friends.  The tests spawn QEMU guests and use QMP to interact:

  result = self.vm.qmp('query-block')
  self.assert_qmp(result, 'return[0]/inserted/file', target_img)

Using this XPath-style syntax it's very easy to access the JSON.

QEMU users tend not to use C, except libvirt.  Even libvirt implements
the QMP protocol dynamically and can handle optional arguments well.

I don't think a static C API makes sense when we have an extensible JSON
protocol.  Let's use the extensibility to our advantage.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: kvm

2013-04-23 Thread Stefan Hajnoczi

On Mon, Apr 22, 2013 at 10:59:25AM +0100, Gary Lloyd wrote:
 I was wondering if anyone could help me with an issue with KVM and ISCSI.
 
 If we restart a controller on our EqualLogic SAN or there are any
 network interruptions on the storage network, KVM guests throw a
 wobbler and their files systems go into read only(centos 5.9 guest
 with virtio driver).
 
 I have read a few forums that indicate you can set disk timeout values
 on the guests themselves but this is not possible using the virtio
 driver, which is what we are currently using.
 
 Is there any way we can instruct KVM to pause the vm's if there is a
 storage failure and resume them when the storage comes back online ?
 
 We are currently running Centos 6.4. There seems to be a werror='stop'
 and rerror='stop' options to achieve this but if I try to put these in
 options in the libvirt xml file for a vm, libvirt appears to be
 removing them.

Please email libvirt-us...@redhat.com for questions about libvirt in the
future.

This is a question about libvirt domain XML.  The documentation is here:
http://libvirt.org/formatdomain.html#elementsDisks

The attribute is called error_policy.  The documentation says:

  The optional error_policy attribute controls how the hypervisor will
  behave on a disk read or write error, possible values are stop,
  report, ignore, and enospace.Since 0.8.0, report since 0.9.7 The
  default setting of error_policy is report. There is also an optional
  rerror_policy that controls behavior for read errors only. Since 0.9.7.
  If no rerror_policy is given, error_policy is used for both read and
  write errors. If rerror_policy is given, it overrides the error_policy
  for read errors. Also note that enospace is not a valid policy for
  read errors, so if error_policy is set to enospace and no
  rerror_policy is given, the read error policy will be left at its
  default, which is report.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [User Question] Repeated severe performance problems on guest

2013-04-18 Thread Stefan Hajnoczi

On Wed, Apr 17, 2013 at 09:52:39PM +0200, Martin Wawro wrote:
 Hi Stefan,
 
  The host is interesting too if you suspect KVM is involved in the
  performance issue (rather than it being purely an application issue
  inside the guest).  For example, pidstat (from the sysstat package) on
  the host can tell you the guest mode CPU utilization percentage.  That's
  useful for double-checking that the guest is indeed using up a lot of
  CPU time (the guest data you posted suggests it is).
 
 I added it to the host logging to have more information next time something
 goes haywire.
 
  
  What does top or ps say about the 79% userspace CPU utilization?
  Perhaps this is unrelated to KVM and simply a buggy application going
  nuts.
  
 
 In this case, it was postgres (we have a couple of instances running on the
 guest). But it can also be another daemon process that usually behaves very 
 well,
 so no real culprit to pinpoint it to.
 We have the same setup (including OS versions and binary versions) in other
 locations (on physical machines) running for years without any problems,
 so I doubt that this is an application issue. Another hint that it is not an
 application issue is the fact, that when we shutdown the processes that 
 generate
 the load, the load average goes down for a couple of seconds and then again
 rises to sky-high values with another process consuming the load (until 
 nothing
 is left running on the machine except for syslogd :-) ).

I see.  That's a good reason to carefully monitor the host for things
that could interfere with guest performance.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [User Question] Repeated severe performance problems on guest

2013-04-18 Thread Stefan Hajnoczi

On Thu, Apr 18, 2013 at 12:00 PM, Martin Wawro martin.wa...@gmail.com wrote:
 On 04/18/2013 09:25 AM, Stefan Hajnoczi wrote:
 I see.  That's a good reason to carefully monitor the host for things
 that could interfere with guest performance.

 Stefan
 Seems that today is a bad day for our server. We had to give him the
 boot (again).
 Also the results of the pidstat output do not seem to yield much
 additional information
 on what could be the problem.

 In order to avoid spilling this mailing list, here is some data gathered
 on the host:
 http://pastebin.com/8q7UgXkJ

 ...and this is the data from the guest:
 http://pastebin.com/xLTYZjGp

No answer but some more questions.

Regarding the kvm_stat output, the exits are caused by 68,000
pagefaults/second (pf_fixed).  Perhaps someone can explain what this
means?

The host has 8 cores, the guest has 7.  Host pidstat shows qemu-kvm
consuming 263.9% CPU:

11:25:27 4017   11.13   34.65  218.12  263.90 7  qemu-kvm

Why is the guest not getting more than 3 CPUs since the host is otherwise idle?

You may want to disable ksmd on the host since you only have 1 guest,
but I doubt that will fix the main problem:

11:25:27  1000.007.890.007.89 7  ksmd

For details, see https://www.kernel.org/doc/Documentation/vm/ksm.txt.

What is the python process on the host doing?  Is it poking libvirt?

11:25:27 45584.663.550.008.21 7  python
11:25:27 36593.994.550.008.54 7  libvirtd

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [User Question] Repeated severe performance problems on guest

2013-04-18 Thread Stefan Hajnoczi

On Thu, Apr 18, 2013 at 03:27:45PM +0200, Martin Wawro wrote:
 On 04/18/2013 03:14 PM, Stefan Hajnoczi wrote:
  No answer but some more questions.
 
  Regarding the kvm_stat output, the exits are caused by 68,000
  pagefaults/second (pf_fixed).  Perhaps someone can explain what this
  means?
 
  The host has 8 cores, the guest has 7.  Host pidstat shows qemu-kvm
  consuming 263.9% CPU:
 
  11:25:27 4017   11.13   34.65  218.12  263.90 7  qemu-kvm
 
  Why is the guest not getting more than 3 CPUs since the host is otherwise 
  idle?
 
 If one waits a little longer, top shows all 7 cores under utilization
 (700%). Unfortunately
 we have to be quick with the reboots during daytime, because the system
 is in
 production use and we have not decided yet to completely replace it.

BTW does the host CPU support Intel Extended Page Tables or AMD Nested
Page Tables?  grep 'npt\|ept' /proc/cpuinfo

(I think the kvm_stat is saying EPT/NPT are not in use)

 
  For details, see https://www.kernel.org/doc/Documentation/vm/ksm.txt.
 
  What is the python process on the host doing?  Is it poking libvirt?
 
  11:25:27 45584.663.550.008.21 7  python
  11:25:27 36593.994.550.008.54 7  libvirtd
 
 That is virt-manager.py, exactly doing that.

Okay, I was wondering if something is causing libvirt and maybe QEMU to
act strangely.  If its just virt-manager then it's probably not the
issue.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [User Question] Repeated severe performance problems on guest

2013-04-17 Thread Stefan Hajnoczi

On Tue, Apr 16, 2013 at 09:49:20AM +0200, Martin Wawro wrote:
 On 04/16/2013 07:49 AM, Stefan Hajnoczi wrote:
  Besides the kvm_stat, general performance data from the host is useful
  when dealing with high load averages.
 
  Do you have vmstat or sar data for periods of time when the machine was
  slow?
 
  Stefan
 
 We do have a rather exhaustive log on the guest. As for the host, we did
 not find
 anything suspicious except  for the kvm_stat output. So we did not log
 any more
 than that.

The host is interesting too if you suspect KVM is involved in the
performance issue (rather than it being purely an application issue
inside the guest).  For example, pidstat (from the sysstat package) on
the host can tell you the guest mode CPU utilization percentage.  That's
useful for double-checking that the guest is indeed using up a lot of
CPU time (the guest data you posted suggests it is).

 Here is the output of vmstat 5 5 on the guest:
 
 procs ---memory-- ---swap-- -io -system--
 cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
 id wa
 84  0  19596 104404 60 2193261600   232   11092  7 
 2 90  1
 80  0  19596  98100 60 2193392000   106   119  854  912 79
 21  0  0
 89  0  19596  94216 60 2193276400   106   223  864  886 79
 21  0  0
 87  0  19596  95848 60 21927612008247  856  906 79
 21  0  0
 
 Load average at that time: 75 (1:20 AM)

What does top or ps say about the 79% userspace CPU utilization?
Perhaps this is unrelated to KVM and simply a buggy application going
nuts.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] provide an API to userspace doing memory snapshot

2013-04-17 Thread Stefan Hajnoczi

On Tue, Apr 16, 2013 at 03:54:15PM +0800, Wenchao Xia wrote:
 于 2013-4-16 13:51, Stefan Hajnoczi 写道:
 On Mon, Apr 15, 2013 at 09:03:36PM +0800, Wenchao Xia wrote:
I'd like to add/export an function which allow userspace program
 to take snapshot for a region of memory. Since it is not implemented yet
 I will describe it as C APIs, it is quite simple now and if it is worthy
 I'll improve the interface later:
 
 We talked about a simple approach using fork(2) on IRC yesterday.
 
 Is this email outdated?
 
 Stefan
 
   No, after the discuss on IRC, I agree that fork() is a simpler
 method to do it, which can comes to qemu fast, since user wants it.
   With a more consideration, still I think a KVM's mem snapshot would
 be an long term solution for it:
   The source of the problem comes from acceleration module, kvm.ko, when
 qemu does not use it, no troubles. This means an acceleration module
 missed a function while caller requires. My instinct idea is: when
 acceleration module replace a pure software one, it should try provide
 all parts or not stop software filling the gap, and doing so brings
 benefits, so hope to add it.
   My API description is old, the core is COW pages, maybe redesign if
 reasonable.

QEMU is a userspace process that has guest RAM mmapped.  You want to
snapshot that mmap region but there is no Linux system call to do that.
Maybe a new mremap(2) flag is what you want.

But I don't see the connection to kvm.ko which you mention.  The feature
you're wishing for has nothing to do with kvm.ko.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Perf tuning help?

2013-04-17 Thread Stefan Hajnoczi

On Tue, Apr 16, 2013 at 04:30:17PM -0400, Mason Turner wrote:
 We have an in-house app, written in c, that is not performing as well as we'd 
 hoped it would when moving to a VM. We've tried all the common tuning 
 recommendations (virtio, tap interface, cpu pining), without any change in 
 performance. Even terminating all of the other VMs on the host doesn't make a 
 difference. The VM doesn't appear to be CPU, memory or IO bound. We are 
 trying to maximize UDP-based QPS against the in-house app.
 
 I've been running strace against the app and perf kvm against the VM to try 
 to identify any bottlenecks. I would say there are a lot of kvm_exits, but 
 I'm not sure how to quantify what is acceptable and what is not.
 
 We are trying to maximize UDP queries against the app. I've read a few times 
 that the virtio network stack results in a lot of vm_exits. Unfortunately, we 
 can't use the direct PCI access with our hardware.

Can you explain the traffic characteristics more?
 * UDP packet size
 * Pattern: 1 query packet, 1 response packet or something more exotic
 * Bare metal QPS (the goal)
 * Guest QPS (what you're seeing)
 * Benchmark configuration: are packets going across a physical network?

 Is there a good resource inefficient system calls? Things that result in 
 higher than normal kvm_exits, or other performance killers?
 
 Thanks for the help.
 
 Our hypdervisor is running on
 CentOS 6.3: 2.6.32-279.22.1.el6.x86_64
 qemu-kvm 0.12.1.2  
 libvirt 0.9.10
 
 Our app is running on
 Centos 6.1: 2.6.32-131.0.15.el6.x86_64

Slightly outdated guest and host.  It might be worth trying upstream
kernels and QEMU (build from source).

 domain type='kvm'
   namething1/name
   uuidabe76ce9-60a0-4727-a7ae-cf572e5c3f21/uuid
   memory unit='KiB'16384000/memory
   currentMemory unit='KiB'16384000/currentMemory
   vcpu placement='static'6/vcpu
   cputune
 vcpupin vcpu='0' cpuset='0'/
 vcpupin vcpu='1' cpuset='2'/
 vcpupin vcpu='2' cpuset='4'/
 vcpupin vcpu='3' cpuset='6'/
 vcpupin vcpu='4' cpuset='8'/
 vcpupin vcpu='5' cpuset='10'/
   /cputune
   numatune
 memory mode='interleave' nodeset='0,2,4,6,8,10'/
   /numatune
   os
 type arch='x86_64' machine='rhel6.0.0'hvm/type
 boot dev='hd'/
   /os
   features
 acpi/
 apic/
 pae/
   /features
   clock offset='utc'/
   on_poweroffdestroy/on_poweroff
   on_rebootrestart/on_reboot
   on_crashrestart/on_crash
   devices
 emulator/usr/libexec/qemu-kvm/emulator
 disk type='file' device='disk'
   driver name='qemu' type='raw' cache='none'/
   source file='/var/lib/libvirt/images/thing1-disk0'/
   target dev='vda' bus='virtio'/
   address type='pci' domain='0x' bus='0x00' slot='0x05' 
 function='0x0'/
 /disk
 controller type='usb' index='0'
   address type='pci' domain='0x' bus='0x00' slot='0x01' 
 function='0x2'/
 /controller
 interface type='bridge'
   mac address='00:5e:e3:e1:8a:aa'/
   source bridge='virbr0'/
   model type='virtio'/
   address type='pci' domain='0x' bus='0x00' slot='0x04' 
 function='0x0'/
 /interface

Please double-check that vhost-net is being used:
http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaatbpvhostnet.htm

 serial type='pty'
   target port='0'/
 /serial
 console type='pty'
   target type='serial' port='0'/
 /console
 input type='tablet' bus='usb'/
 input type='mouse' bus='ps2'/
 graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/
 video
   model type='cirrus' vram='9216' heads='1'/
   address type='pci' domain='0x' bus='0x00' slot='0x02' 
 function='0x0'/
 /video
 memballoon model='virtio'
   address type='pci' domain='0x' bus='0x00' slot='0x06' 
 function='0x0'/
 /memballoon
   /devices
 /domain
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [User Question] Repeated severe performance problems on guest

2013-04-15 Thread Stefan Hajnoczi

On Fri, Apr 12, 2013 at 05:04:27PM +0200, Martin Wawro wrote:
 Logging the kvm_stat on the host, we obtained the following output during

Besides the kvm_stat, general performance data from the host is useful
when dealing with high load averages.

Do you have vmstat or sar data for periods of time when the machine was
slow?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] provide an API to userspace doing memory snapshot

2013-04-15 Thread Stefan Hajnoczi

On Mon, Apr 15, 2013 at 09:03:36PM +0800, Wenchao Xia wrote:
   I'd like to add/export an function which allow userspace program
 to take snapshot for a region of memory. Since it is not implemented yet
 I will describe it as C APIs, it is quite simple now and if it is worthy
 I'll improve the interface later:

We talked about a simple approach using fork(2) on IRC yesterday.

Is this email outdated?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] reply: reply: qemu crashed when starting vm(kvm) with vnc connect

2013-04-11 Thread Stefan Hajnoczi

On Mon, Apr 08, 2013 at 12:27:06PM +, Zhanghaoyu (A) wrote:
 On Sun, Apr 07, 2013 at 04:58:07AM +, Zhanghaoyu (A) wrote:
   I start a kvm VM with vnc(using the zrle protocol) connect, sometimes 
   qemu program crashed during starting period, received signal SIGABRT.
   Trying about 20 times, this crash may be reproduced.
   I guess the cause memory corruption or double free.
  
   Which version of QEMU are you running?
   
   Please try qemu.git/master.
   
   Stefan
  
  I used the QEMU download from qemu.git (http://git.qemu.org/git/qemu.git).
 
  Great, thanks!  Can you please post a backtrace?
  
  The easiest way is:
  
   $ ulimit -c unlimited
   $ qemu-system-x86_64 -enable-kvm -m 1024 ...
   ...crash...
   $ gdb -c qemu-system-x86_64.core
   (gdb) bt
  
  Depending on how your system is configured the core file might have a 
  different filename but there should be a file name *core* the current 
  working directory
 after the crash.
  
  The backtrace will make it possible to find out where the crash occurred.
  
  Thanks,
  Stefan
 
 backtrace from core file is shown as below:
 
 Program received signal SIGABRT, Aborted.
 0x7f32eda3dd95 in raise () from /lib64/libc.so.6
 (gdb) bt
 #0  0x7f32eda3dd95 in raise () from /lib64/libc.so.6
 #1  0x7f32eda3f2ab in abort () from /lib64/libc.so.6
 #2  0x7f32eda77ece in __libc_message () from /lib64/libc.so.6
 #3  0x7f32eda7dc06 in malloc_printerr () from /lib64/libc.so.6
 #4  0x7f32eda7ecda in _int_free () from /lib64/libc.so.6
 #5  0x7f32efd3452c in free_and_trace (mem=0x7f329cd0) at vl.c:2880
 #6  0x7f32efd251a1 in buffer_free (buffer=0x7f32f0c82890) at ui/vnc.c:505
 #7  0x7f32efd20c56 in vnc_zrle_clear (vs=0x7f32f0c762d0)
 at ui/vnc-enc-zrle.c:364
 #8  0x7f32efd26d07 in vnc_disconnect_finish (vs=0x7f32f0c762d0)
 at ui/vnc.c:1050
 #9  0x7f32efd275c5 in vnc_client_read (opaque=0x7f32f0c762d0)
 at ui/vnc.c:1349
 #10 0x7f32efcb397c in qemu_iohandler_poll (readfds=0x7f32f074d020,
 writefds=0x7f32f074d0a0, xfds=0x7f32f074d120, ret=1) at iohandler.c:124
 #11 0x7f32efcb46e8 in main_loop_wait (nonblocking=0) at main-loop.c:417
 #12 0x7f32efd31159 in main_loop () at vl.c:2133
 #13 0x7f32efd38070 in main (argc=46, argv=0x7fff7f5df178,
 envp=0x7fff7f5df2f0) at vl.c:4481

CCing Corentin and Gerd who are more familiar with the VNC code than me.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2013-04-09

2013-04-09 Thread Stefan Hajnoczi

Meeting notes on Abel's presentation:

Aim: improve vhost scalability

Shared vhost thread
==
Problem: Linux scheduler does not see state of virtqueues, cannot make
good scheduling decisions
Solution: Shared thread serves multiple VMs and therefore influences
I/O scheduling instead of kernel thread per vhost device

Exitless communication
=
 * Polling on host to notice guest vring updates without guest pio instruction
   * Use CPU affinity to bind vcpus to separate cores and let polling
run on dedicated cores
 * Existless Interrupt (ELI) or future hardware APIC virtualization
feature to inject virtual interrupts
without vmexit and EOI

See paper for performance results (impressive numbers):
http://domino.research.ibm.com/library/cyberdig.nsf/papers/479E3578ED05BFAC85257B4200427735/$File/h-0319.pdf

Abel will publish rebased code on GitHub but does not have time to
upstream them.

The next step: QEMU/KVM community can digest the paper + patches and
decide on ideas to upstream.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtualbox svga card in KVM

2013-04-08 Thread Stefan Hajnoczi

On Fri, Apr 05, 2013 at 04:52:05PM -0700, Sriram Murthy wrote:
 For starters, virtual box has better SVGA WDDM drivers that allows for a much 
 richer display when the VM display is local.

What does much richer display mean?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 答复: [Qemu-devel] qemu crashed when starting vm(kvm) with vnc connect

2013-04-08 Thread Stefan Hajnoczi

On Sun, Apr 07, 2013 at 04:58:07AM +, Zhanghaoyu (A) wrote:
  I start a kvm VM with vnc(using the zrle protocol) connect, sometimes qemu 
  program crashed during starting period, received signal SIGABRT.
  Trying about 20 times, this crash may be reproduced.
  I guess the cause memory corruption or double free.
 
  Which version of QEMU are you running?
  
  Please try qemu.git/master.
  
  Stefan
 
 I used the QEMU download from qemu.git (http://git.qemu.org/git/qemu.git).

Great, thanks!  Can you please post a backtrace?

The easiest way is:

  $ ulimit -c unlimited
  $ qemu-system-x86_64 -enable-kvm -m 1024 ...
  ...crash...
  $ gdb -c qemu-system-x86_64.core
  (gdb) bt

Depending on how your system is configured the core file might have a
different filename but there should be a file name *core* the current
working directory after the crash.

The backtrace will make it possible to find out where the crash
occurred.

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH uq/master v2 0/2] Add some tracepoints for clarification of the cause of troubles

2013-04-08 Thread Stefan Hajnoczi

On Fri, Mar 29, 2013 at 01:24:25PM +0900, Kazuya Saito wrote:
 This series adds tracepoints for helping us clarify the cause of
 troubles. Virtualization on Linux is composed of some components such
 as qemu, kvm, libvirt, and so on. So it is very important to clarify
 firstly and swiftly the cause of troubles is on what component of
 them. Although qemu has useful information of this because it stands
 among kvm, libvirt and guest, it doesn't output the information by
 trace or log system.
 These patches add tracepoints which lead to reduce the time of the
 clarification. We'd like to add the tracepoints as the first set
 because, based on our experience, we've found out they must be useful
 for an investigation in the future. Without those tracepoints,
 we had a really hard time investigating a problem since the problem's
 reproducibility was quite low and there was no clue in the dump of
 qemu.
 
 Changes from v1:
 Add arg to kvm_ioctl, kvm_vm_ioctl, kvm_vcpu_ioctl tracepoints.
 Add cpu_index to kvm_vcpu_ioctl, kvm_run_exit tracepoints.
 
 Kazuya Saito (2):
   kvm-all: add kvm_ioctl, kvm_vm_ioctl, kvm_vcpu_ioctl tracepoints
   kvm-all: add kvm_run_exit tracepoint
 
  kvm-all.c|5 +
  trace-events |7 +++
  2 files changed, 12 insertions(+), 0 deletions(-)
 
 
 

Thanks, applied to my tracing tree:
https://github.com/stefanha/qemu/commits/tracing

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

We've been accepted to Google Summer of Code 2013

2013-04-08 Thread Stefan Hajnoczi

Good news!  QEMU.org has been accepted to Google Summer of Code 2013.

This means students can begin considering our list of QEMU, kvm kernel
module, and libvirt project ideas:

http://qemu-project.org/Google_Summer_of_Code_2013

Student applications open April 22 at 19:00 UTC.  You can already view
the application template here:

http://www.google-melange.com/gsoc/org/google/gsoc2013/qemu

If you are an interested student, please take a look at the project
ideas and get in touch with the mentor for that project.  They can
help clarify the scope of the project and what skills are necessary.

You are invited to join the #qemu-gsoc IRC channel on irc.oftc.net
where questions about Google Summer of Code with QEMU.org are welcome.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu crashed when starting vm(kvm) with vnc connect

2013-04-05 Thread Stefan Hajnoczi

On Tue, Apr 02, 2013 at 09:02:02AM +, Zhanghaoyu (A) wrote:
 I start a kvm VM with vnc(using the zrle protocol) connect, sometimes qemu 
 program crashed during starting period, received signal SIGABRT.
 Trying about 20 times, this crash may be reproduced.
 I guess the cause memory corruption or double free.

Which version of QEMU are you running?

Please try qemu.git/master.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtualbox svga card in KVM

2013-04-05 Thread Stefan Hajnoczi

On Thu, Mar 21, 2013 at 10:53:21AM -0400, Alon Levy wrote:
   I am planning on bringing in the virtualbox svga card into kvm
   as a new svga card type (vbox probably?) so that we can load
   the VirtualBox SVGA card drivers in the guest.

I'm curious if the vbox SVGA card has features that existing QEMU
graphics cards do not provide?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2013-03-26

2013-03-26 Thread Stefan Hajnoczi

On Mon, Mar 25, 2013 at 08:13:34PM -0500, Rob Landley wrote:
 On 03/25/2013 08:17:44 AM, Juan Quintela wrote:
 
 Hi
 
 Please send in any agenda topics you are interested in.
 
 Later, Juan.
 
 If Google summer of code is still open:
 
   http://qemu-project.org/Google_Summer_of_Code_2013

Project ideas can still be added to the wiki.  They must have a mentor
who is able to commit around 5 hours per week this summer.

I'm not sure about the status of the todo list items you mentioned,
hopefully others can help.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

QEMU has applied for Google Summer of Code 2013

2013-03-22 Thread Stefan Hajnoczi

QEMU.org has applied for Google Summer of Code 2013 and also aims to
be an umbrella organization for libvirt and the KVM kernel module.

Accepted mentoring organizations will be announced on April 8 at 19:00
UTC at http://google-melange.com/.

This year we have proposed 5 QEMU project ideas, 1 KVM kernel module
project idea, and 4 libvirt project ideas:

http://qemu-project.org/Google_Summer_of_Code_2013

Thanks to everyone who has volunteered to be a mentor!  Also thanks to
Anthony Liguori for being backup org admin.

Fingers crossed,
Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-blk: Set default serial id

2013-03-20 Thread Stefan Hajnoczi

On Wed, Mar 20, 2013 at 01:56:08PM +0800, Asias He wrote:
 If user does not specify a serial id, e.g.
 
-device virtio-blk-pci,serial=serial_id
 or
-drive serial=serial_id
 
 no serial id will be assigned.
 
 Add a default serial id in this case to help identifying
 the disk in guest.
 
 Signed-off-by: Asias He as...@redhat.com
 ---
  hw/virtio-blk.c | 7 +++
  1 file changed, 7 insertions(+)

Autogenerated IDs have been proposed (for other devices?) before and I
think we should avoid them.

The serial in this patch depends on the internal counter we use for
savevm.  It is not a well-defined value that guests can depend on
remaining the same.

It can change between QEMU invocations - due to internal changes in QEMU
or because the management tool reordered -device options.

Users will be confused and their guests may stop working if they depend
on an ID like this.

The solution is to do persistent naming either by really passing -device
virtio-blk-pci,serial= or with udev inside the guest using the bus
address (PCI devfn) like the new persistent network interface naming for
Linux.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 WIP 2/3] vhost-scsi: new device supporting the tcm_vhost Linux kernel module

2013-03-19 Thread Stefan Hajnoczi

On Tue, Mar 19, 2013 at 08:34:44AM +0800, Asias He wrote:
 +static void vhost_scsi_stop(VHostSCSI *vs, VirtIODevice *vdev)
 +{
 +int ret = 0;
 +
 +if (!vdev-binding-set_guest_notifiers) {
 +ret = vdev-binding-set_guest_notifiers(vdev-binding_opaque,
 + vs-dev.nvqs, false);
 +if (ret  0) {
 +error_report(vhost guest notifier cleanup failed: %d\n, 
 ret);

Indentation.  scripts/checkpatch.pl should catch this.

 +}
 +}
 +assert(ret = 0);
 +
 +vhost_scsi_clear_endpoint(vdev);
 +vhost_dev_stop(vs-dev, vdev);
 +vhost_dev_disable_notifiers(vs-dev, vdev);
 +}
 +
 +static void vhost_scsi_set_config(VirtIODevice *vdev,
 +  const uint8_t *config)
 +{
 +VirtIOSCSIConfig *scsiconf = (VirtIOSCSIConfig *)config;
 +VHostSCSI *vs = (VHostSCSI *)vdev;
 +
 +if ((uint32_t) ldl_raw(scsiconf-sense_size) != vs-vs.sense_size ||
 +(uint32_t) ldl_raw(scsiconf-cdb_size) != vs-vs.cdb_size) {
 +error_report(vhost-scsi does not support changing the sense data 
 and CDB sizes);
 +exit(1);

Guest-triggerable exits can be used as a denial of service - especially
under nested virtualization where killing the L1 hypervisor would kill
all L2 guests!

I would just log a warning here.

 +}
 +}
 +
 +static void vhost_scsi_set_status(VirtIODevice *vdev, uint8_t val)
 +{
 +VHostSCSI *vs = (VHostSCSI *)vdev;
 +bool start = (val  VIRTIO_CONFIG_S_DRIVER_OK);
 +
 +if (vs-dev.started == start) {
 +return;
 +}
 +
 +if (start) {
 +int ret;
 +
 +ret = vhost_scsi_start(vs, vdev);
 +if (ret  0) {
 +error_report(virtio-scsi: unable to start vhost: %s\n,
 + strerror(-ret));
 +
 +/* There is no userspace virtio-scsi fallback so exit */
 +exit(1);

It's questionable whether to kill the guest or simply disable this
virtio-scsi-pci adapter.  Fine for now but we may want to allow a policy
here in the future.

 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index 39c1966..281a7e2 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -22,6 +22,7 @@
  #include hw/virtio-net.h
  #include hw/virtio-serial.h
  #include hw/virtio-scsi.h
 +#include hw/vhost-scsi.h

Can this header be included unconditionally?  It uses _IOW() which may
not be available on all host platforms.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check

2013-03-19 Thread Stefan Hajnoczi

On Tue, Mar 19, 2013 at 08:34:45AM +0800, Asias He wrote:
 ---
  hw/vhost.c | 2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/hw/vhost.c b/hw/vhost.c
 index 4d6aee3..0c52ec4 100644
 --- a/hw/vhost.c
 +++ b/hw/vhost.c
 @@ -421,10 +421,12 @@ static void vhost_set_memory(MemoryListener *listener,
  return;
  }
  
 +#if 0
  if (dev-started) {
  r = vhost_verify_ring_mappings(dev, start_addr, size);
  assert(r = 0);
  }
 +#endif

Please add a comment to explain why.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Can I bridge the loopback?

2013-03-18 Thread Stefan Hajnoczi

On Sat, Mar 16, 2013 at 12:06:30AM -0500, Steve wrote:
 Here's the issue. I want to communicate between virtual machines, second 
 Ethernet virtual port. But I would like to use the host loopback for that so 
 as to not be limited to Ethernet port speeds, for large copies, etc. Right 
 now, the machine is connected to a 10mbps switch on port 2 and would like to 
 get far faster transfer speeds when using the so called private LAN. Bridging 
 eth1 merely limits the speed to 10 Mbps. If it was bridged to the host 
 loopback, I was hoping it could achieve far faster speeds and also not 
 saturate the switch.
 
 So, could I make a br1 that is assigned to 127.0.0.1 and then each host can 
 use that as eth1?--

Guest-guest communication is not affected by physical NIC link speed.
A software bridge with the guest tap interfaces and the host's physical
interface should allow guests to communicate 10 Mbps.

Have you measured the speed of guest-guest networking and found it is
10 Mbps?

If you still experience poor performance, please post your QEMU
command-line, ifconfig -a (on host), and brctl show (on host) output.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2 WIP 2/2] vhost-scsi: new device supporting the tcm_vhost Linux kernel module

2013-03-12 Thread Stefan Hajnoczi

On Tue, Mar 12, 2013 at 02:29:42PM +0800, Asias He wrote:
 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index 39c1966..4a97ca1 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c

These changes break the build for non-Linux hosts.  Please introduce a
CONFIG_VHOST_SCSI and #ifdef appropriate sections in hw/virtio-pci.c.

CONFIG_VIRTFS does the same thing.

 +static Property vhost_scsi_properties[] = {
 +DEFINE_PROP_BIT(ioeventfd, VirtIOPCIProxy, flags, 
 VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),

This flag makes QEMU's virtqueue handling use ioeventfd.  Since the
vhost-scsi.c takes over the guest/host notifiers, we never do QEMU
virtqueue processing and the ioeventfd flag has no real meaning.  You
can drop it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: win2k guest vm won't boot under Fedora 18 KVM

2013-03-11 Thread Stefan Hajnoczi

On Sat, Mar 09, 2013 at 12:43:32PM -0700, Earl Marwil wrote:
 Hi,
 
 I'm looking for some guidance on how to get to the root cause of an
 issue that I am observing with a win2k guest that won't boot under
 Fedora 18 on one system but will boot on another. A few days ago I
 posted on the fedora forum:
 
 http://forums.fedoraforum.org/showthread.php?t=289401
 
 I can repeat the details in this thread if requested. The issue is that,
 with a fresh build of Fedora 18, updated to the most recent kernel and
 packages on an external USB ssd, my win2k VM boots on my laptop (Core
 i73720QM processor) but does not boot on my desktop system (Core i7-870
 processor).
 
 I'm not sure whether this is a kvm issue or a kernel issue. I'll be glad
 to dig deeper, just let me what information is needed.

Hi Earl,
From your forum post:

  KVM internal error. Suberror: 1
  emulation failure
  EAX=63700200 EBX=e6f5 ECX=000f EDX=0936
  ...
  Code=74 1d b0 37 e6 70 eb 00 e4 71 eb 00 32 e4 c1 c0 04 c0 c8 04 d5 0a 3d 
13 0
  0 75 04 b8 7a 15 c3 b8 00 00 c3 55 8b ec 1e 06 56 57 8b 46 04 8e d8 8b 76 06

Here is my guess:

Laptop has a CPU from 2012.  Desktop has a CPU from 2009.

Intel added unrestricted guest support to VMX.  This feature allows
the CPU to run real mode code in guest mode.

CPUs that do not support unrestricted guest (your desktop?) use an
emulator implemented in software inside the kvm.ko kernel module.

The emulator may be unable to handle the real mode instruction in the
particular kernel version you are running.

The laptop doesn't hit this issue because it supports unrestricted
guest while the desktop falls back to the emulator inside kvm.ko where
it hits the bug.

You may find that changing kernel versions on the desktop will make it
work.

The best would be to compile a vanilla Linux kernel for the desktop
machine to verify that this issue still happens.  If so, please post the
full KVM internal error output to this mailing list and hopefully
someone can fix the emulator.

Problem with my theory: I haven't figured out how to check which Intel
CPU models support unrestricted guest, so I'm not 100% sure this is
the issue.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2013-03-12

2013-03-11 Thread Stefan Hajnoczi

On Mon, Mar 11, 2013 at 4:42 PM, Juan Quintela quint...@redhat.com wrote:

 Please send in any agenda topics you are interested in.

Overview of mentoring for Google Summer of Code 2013:

 * Post project ideas here: http://wiki.qemu.org/Google_Summer_of_Code_2013
 * Who can be a mentor?
 * What's in it for the mentor?
 * What does a mentor do?
 * How does a mentor select a student to work with?

Open discussion - any questions about Google Summer of Code.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2 0/6] tcm_vhost hotplug/hotunplug support and locking/flushing fix

2013-03-08 Thread Stefan Hajnoczi

On Fri, Mar 08, 2013 at 10:21:41AM +0800, Asias He wrote:
 Changes in v2:
 - Remove code duplication in tcm_vhost_{hotplug,hotunplug}
 - Fix racing of vs_events_nr
 - Add flush fix patch to this series
 
 Asias He (6):
   tcm_vhost: Add missed lock in vhost_scsi_clear_endpoint()
   tcm_vhost: Introduce tcm_vhost_check_feature()
   tcm_vhost: Introduce tcm_vhost_check_endpoint()
   tcm_vhost: Fix vs-vs_endpoint checking in vhost_scsi_handle_vq()
   tcm_vhost: Add hotplug/hotunplug support
   tcm_vhost: Flush vhost_work in vhost_scsi_flush()
 
  drivers/vhost/tcm_vhost.c | 243 
 --
  drivers/vhost/tcm_vhost.h |  10 ++
  2 files changed, 247 insertions(+), 6 deletions(-)
 
 -- 
 1.8.1.4
 

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm + ceph performance issues

2013-03-08 Thread Stefan Hajnoczi

On Thu, Mar 07, 2013 at 12:57:55PM +0100, Wolfgang Hennerbichler wrote:
I'm running a virtual machine with the following command:

LC_ALL=C
PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 4096 -smp
2,sockets=2,cores=1,threads=1 -name korfu_ceph -uuid
a9131b8f-d087-26f4-2ca9-018505f11838 -nodefconfig -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/korfu_ceph.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
-no-shutdown -device lsi,id=scsi0,bus=pci.0,addr=0x4 -drive
file=rbd:rd/korfu:rbd_cache=1:mon_host=rd-clusternode21\:6789\;rd-clusternode22\:6789,if=none,id=drive-ide0-0-0,format=raw
-device
ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
-usb -vnc 127.0.0.1:0 -vga std -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

kvm-version is that from ubuntu LTS: 1.0+noroms-0ubuntu14.7

When I read or write big files the system basically gets unusable
(mouse-cursor in VNC jerks across the screen, i/o is mostly on
blocking). I know it is related to ceph in a way, but also to KVM, as it
seems that there a lot of IRQ's happening (or how do you explain the
mourse cursor in VNC jerking and lagging behind time?_) . Ceph Mailing
List doesn't really help. High CPU load doesn't hurt the machine, it's
only Harddisk I/O. Oh, and the main host running kvm doesn't really
suffer, too. some i/o waiting, but not really swapping or something.

Here's my libvirt-config if it is of any help:
http://pastie.org/6411055

any hints would REALLY be appreciated...

Please try using virtio-blk instead of IDE.

If the guest still jerks try using the Linux rbd block driver instead of
QEMU -drive rbd:. I haven't used Ceph much but there should be
documentation on attaching a RADOS block device to your Linux host.
Tell QEMU to use the RADOS block device like a regular file (you are now
using the kernel driver instead of QEMU code to talk to the Ceph
cluster).

Please let us know the outcome. If you find that virtio-blk does not
make much difference but using the kernel rbd driver does, then this
suggests there is a bug in QEMU's block/rbd.c.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] tcm_vhost: Add hotplug/hotunplug support

2013-03-07 Thread Stefan Hajnoczi

On Thu, Mar 07, 2013 at 08:26:20AM +0800, Asias He wrote:
 On Wed, Mar 06, 2013 at 10:21:09AM +0100, Stefan Hajnoczi wrote:
  On Wed, Mar 06, 2013 at 02:16:30PM +0800, Asias He wrote:
   +static struct tcm_vhost_evt *tcm_vhost_allocate_evt(struct vhost_scsi 
   *vs,
   + u32 event, u32 reason)
   +{
   + struct tcm_vhost_evt *evt;
   +
   + if (atomic_read(vs-vs_events_nr)  VHOST_SCSI_MAX_EVENT)
   + return NULL;
   +
   + evt = kzalloc(sizeof(*evt), GFP_KERNEL);
   +
   + if (evt) {
   + atomic_inc(vs-vs_events_nr);
  
  This looks suspicious: checking vs_events_nr  VHOST_SCSI_MAX_EVENT
  first and then incrementing later isn't atomic!
 
 This does not matter. (1) and (2) are okay. In case (3), the other side
 can only decrease the number of event, the limit will not be exceeded.
 
 (1) 
  atomic_dec()
atomic_read() 
atomic_inc()
 (2)
atomic_read() 
atomic_inc()
  atomic_dec()
 
 (3)
atomic_read() 
  atomic_dec()
atomic_inc()

The cases you listed are fine but I'm actually concerned about
tcm_vhost_allocate_evt() racing with itself.  There are 3 callers and
I'm not sure which lock prevents them from executing at the same time.

   +static int tcm_vhost_hotunplug(struct tcm_vhost_tpg *tpg, struct se_lun 
   *lun)
   +{
   + struct vhost_scsi *vs = tpg-vhost_scsi;
   +
   + mutex_lock(tpg-tv_tpg_mutex);
   + vs = tpg-vhost_scsi;
   + mutex_unlock(tpg-tv_tpg_mutex);
   + if (!vs)
   + return -EOPNOTSUPP;
   +
   + if (!tcm_vhost_check_feature(vs, VIRTIO_SCSI_F_HOTPLUG))
   + return -EOPNOTSUPP;
   +
   + return tcm_vhost_send_evt(vs, tpg, lun,
   + VIRTIO_SCSI_T_TRANSPORT_RESET,
   + VIRTIO_SCSI_EVT_RESET_REMOVED);
   +}
  
  tcm_vhost_hotplug() and tcm_vhost_hotunplug() are the same function
  except for VIRTIO_SCSI_EVT_RESET_RESCAN vs
  VIRTIO_SCSI_EVT_RESET_REMOVED.  That can be passed in as an argument and
  the code duplication can be eliminated.
 
 I thought about this also. We can have a tcm_vhost_do_hotplug() helper.
 
tcm_vhost_do_hotplug(tpg, lun, plug)

tcm_vhost_hotplug() {
   tcm_vhost_do_hotplug(tpg, lun, true)
}

tcm_vhost_hotunplug() {
   tcm_vhost_do_hotplug(tpg, lun, false)
}
 
 The reason I did not do that is I do not like the true/false argument
 but anyway this could remove duplication. I will do it.

true/false makes the calling code hard to read, I suggest passing in
VIRTIO_SCSI_EVT_RESET_RESCAN or VIRTIO_SCSI_EVT_RESET_REMOVED as the
argument.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] tcm_vhost: Add hotplug/hotunplug support

2013-03-07 Thread Stefan Hajnoczi

On Thu, Mar 07, 2013 at 05:47:26PM +0800, Asias He wrote:
 On Thu, Mar 07, 2013 at 09:58:04AM +0100, Stefan Hajnoczi wrote:
  On Thu, Mar 07, 2013 at 08:26:20AM +0800, Asias He wrote:
   On Wed, Mar 06, 2013 at 10:21:09AM +0100, Stefan Hajnoczi wrote:
On Wed, Mar 06, 2013 at 02:16:30PM +0800, Asias He wrote:
 +static struct tcm_vhost_evt *tcm_vhost_allocate_evt(struct 
 vhost_scsi *vs,
 + u32 event, u32 reason)
 +{
 + struct tcm_vhost_evt *evt;
 +
 + if (atomic_read(vs-vs_events_nr)  VHOST_SCSI_MAX_EVENT)
 + return NULL;
 +
 + evt = kzalloc(sizeof(*evt), GFP_KERNEL);
 +
 + if (evt) {
 + atomic_inc(vs-vs_events_nr);

This looks suspicious: checking vs_events_nr  VHOST_SCSI_MAX_EVENT
first and then incrementing later isn't atomic!
   
   This does not matter. (1) and (2) are okay. In case (3), the other side
   can only decrease the number of event, the limit will not be exceeded.
   
   (1) 
atomic_dec()
  atomic_read() 
  atomic_inc()
   (2)
  atomic_read() 
  atomic_inc()
atomic_dec()
   
   (3)
  atomic_read() 
atomic_dec()
  atomic_inc()
  
  The cases you listed are fine but I'm actually concerned about
  tcm_vhost_allocate_evt() racing with itself.  There are 3 callers and
  I'm not sure which lock prevents them from executing at the same time.
 
 No lock to prevent it. But what is the racing of executing
 tcm_vhost_allocate_evt() at the same time?

atomic_read() = VHOST_SCSI_MAX_EVENT
   atomic_read() = VHOST_SCSI_MAX_EVENT
atomic_inc()
   atomic_inc()

Now vs-vs_events_nr == VHOST_SCSI_MAX_EVENT + 1 which the if statement
was supposed to prevent.

 +static int tcm_vhost_hotunplug(struct tcm_vhost_tpg *tpg, struct 
 se_lun *lun)
 +{
 + struct vhost_scsi *vs = tpg-vhost_scsi;
 +
 + mutex_lock(tpg-tv_tpg_mutex);
 + vs = tpg-vhost_scsi;
 + mutex_unlock(tpg-tv_tpg_mutex);
 + if (!vs)
 + return -EOPNOTSUPP;
 +
 + if (!tcm_vhost_check_feature(vs, VIRTIO_SCSI_F_HOTPLUG))
 + return -EOPNOTSUPP;
 +
 + return tcm_vhost_send_evt(vs, tpg, lun,
 + VIRTIO_SCSI_T_TRANSPORT_RESET,
 + VIRTIO_SCSI_EVT_RESET_REMOVED);
 +}

tcm_vhost_hotplug() and tcm_vhost_hotunplug() are the same function
except for VIRTIO_SCSI_EVT_RESET_RESCAN vs
VIRTIO_SCSI_EVT_RESET_REMOVED.  That can be passed in as an argument and
the code duplication can be eliminated.
   
   I thought about this also. We can have a tcm_vhost_do_hotplug() helper.
   
  tcm_vhost_do_hotplug(tpg, lun, plug)
  
  tcm_vhost_hotplug() {
 tcm_vhost_do_hotplug(tpg, lun, true)
  }
  
  tcm_vhost_hotunplug() {
 tcm_vhost_do_hotplug(tpg, lun, false)
  }
   
   The reason I did not do that is I do not like the true/false argument
   but anyway this could remove duplication. I will do it.
  
  true/false makes the calling code hard to read, I suggest passing in
  VIRTIO_SCSI_EVT_RESET_RESCAN or VIRTIO_SCSI_EVT_RESET_REMOVED as the
  argument.
 
 Yes. However, I think passing VIRTIO_SCSI_EVT_RESET_* is even worse.
 
 1) Having VIRTIO_SCSI_EVT_RESET_RESCAN or VIRTIO_SCSI_EVT_RESET_REMOVED
 around VIRTIO_SCSI_T_TRANSPORT_RESET would be nicer.
 
 2) tcm_vhost_do_hotplug(tpg, lun, VIRTIO_SCSI_EVT_RESET_*)
 doest not make much sense. What the hell is VIRTIO_SCSI_EVT_RESET_* when
 you do hotplug or hotunplug. In contrast, if we have
 tcm_vhost_do_hotplug(tpg, lun, plug), plug means doing hotplug or
 hotunplug.

The VIRTIO_SCSI_EVT_RESET_REMOVED constant is pretty clear (removed
means unplug).  The VIRTIO_SCSI_EVT_RESET_RESCAN is less clear, but this
code is in drivers/vhost/tcm_vhost.c so you can expect the reader to
know the device specification :).

Anyway, it's not the end of the world if you leave the duplicated code
in, use a boolean parameter, or use the virtio event constant.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] tcm_vhost: Introduce tcm_vhost_check_feature()

2013-03-06 Thread Stefan Hajnoczi

On Wed, Mar 06, 2013 at 02:16:27PM +0800, Asias He wrote:
 This helper is useful to check if a feature is supported.
 
 Signed-off-by: Asias He as...@redhat.com
 ---
  drivers/vhost/tcm_vhost.c | 14 ++
  1 file changed, 14 insertions(+)
 
 diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
 index b3e50d7..fdbf986 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -91,6 +91,20 @@ static int iov_num_pages(struct iovec *iov)
  ((unsigned long)iov-iov_base  PAGE_MASK))  PAGE_SHIFT;
  }
  
 +static bool tcm_vhost_check_feature(struct vhost_scsi *vs, u64 feature)
 +{
 + u64 acked_features;
 + bool ret = false;
 +
 + mutex_lock(vs-dev.mutex);
 + acked_features = vs-dev.acked_features;
 + if (acked_features  1ULL  feature)
 + ret = true;
 + mutex_unlock(vs-dev.mutex);
 +
 + return ret;
 +}

This is like vhost_has_feature() except it acquires dev.mutex?

In any case it isn't tcm_vhost-specific and could be in vhost.c.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Tracing kvm: kvm_entry and kvm_exit

2013-02-28 Thread Stefan Hajnoczi

On Thu, Feb 28, 2013 at 5:49 AM, David Ahern dsah...@gmail.com wrote:
 On 2/27/13 9:39 AM, David Ahern wrote:

 I have been playing with the live mode a bit lately. I'll add a debug to
 note 2 consecutive entry events without an exit -- see if it sheds some
 light on it.


 If you feel game take this for a spin:
   https://github.com/dsahern/linux/commits/perf-kvm-live-3.8

This is very cool, thanks for sharing.

Next time I'm profiling vmexit latencies I'll give it a try.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v3 4/5] KVM: ioeventfd for virtio-ccw devices.

2013-02-26 Thread Stefan Hajnoczi

On Tue, Feb 26, 2013 at 12:55:36PM +0200, Michael S. Tsirkin wrote:
 On Mon, Feb 25, 2013 at 04:27:49PM +0100, Cornelia Huck wrote:
  diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
  index f0ced1a..8de3cd7 100644
  --- a/virt/kvm/eventfd.c
  +++ b/virt/kvm/eventfd.c
  @@ -679,11 +679,16 @@ static int
   kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
   {
  int   pio = args-flags  KVM_IOEVENTFD_FLAG_PIO;
  -   enum kvm_bus  bus_idx = pio ? KVM_PIO_BUS : KVM_MMIO_BUS;
  +   int   ccw;
  +   enum kvm_bus  bus_idx;
  struct _ioeventfd*p;
  struct eventfd_ctx   *eventfd;
  int   ret;
   
  +   ccw = args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY;
  +   bus_idx = pio ? KVM_PIO_BUS :
  +   ccw ? KVM_VIRTIO_CCW_NOTIFY_BUS :
  +   KVM_MMIO_BUS;
 
 May be better to rewrite using if/else.

Saw this after sending my comment.  I agree with Michael, an if
statement allows you to drop the locals and capture the bus_idx
conversion in a single place (it could even be a static function to save
duplicating the code in both functions that use it).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/5] KVM: ioeventfd for virtio-ccw devices.

2013-02-26 Thread Stefan Hajnoczi

On Mon, Feb 25, 2013 at 04:27:49PM +0100, Cornelia Huck wrote:
 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index f0ced1a..8de3cd7 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -679,11 +679,16 @@ static int
  kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
  {
   int   pio = args-flags  KVM_IOEVENTFD_FLAG_PIO;
 - enum kvm_bus  bus_idx = pio ? KVM_PIO_BUS : KVM_MMIO_BUS;
 + int   ccw;
 + enum kvm_bus  bus_idx;
   struct _ioeventfd*p;
   struct eventfd_ctx   *eventfd;
   int   ret;
  
 + ccw = args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY;
 + bus_idx = pio ? KVM_PIO_BUS :
 + ccw ? KVM_VIRTIO_CCW_NOTIFY_BUS :
 + KVM_MMIO_BUS;
   /* must be natural-word sized */
   switch (args-len) {
   case 1:
 @@ -759,11 +764,16 @@ static int
  kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
  {
   int   pio = args-flags  KVM_IOEVENTFD_FLAG_PIO;
 - enum kvm_bus  bus_idx = pio ? KVM_PIO_BUS : KVM_MMIO_BUS;
 + int   ccw;
 + enum kvm_bus  bus_idx;
   struct _ioeventfd*p, *tmp;
   struct eventfd_ctx   *eventfd;
   int   ret = -ENOENT;
  
 + ccw = args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY;
 + bus_idx = pio ? KVM_PIO_BUS :
 + ccw ? KVM_VIRTIO_CCW_NOTIFY_BUS :
 + KVM_MMIO_BUS;

This is getting pretty convoluted.  Drop of pio and ccw local variables,
replace ?: with an if statement:

if (args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
bus_idx = KVM_VIRTIO_CCW_NOTIFY_BUS;
else if (args-flags  KVM_IOEVENTFD_FLAG_PIO)
bus_idx = KVM_PIO_BUS;
else
bus_idx = KVM_MMIO_BUS;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is there any solution in KVM that like VAAI does in EXSI

2013-02-26 Thread Stefan Hajnoczi

On Tue, Feb 26, 2013 at 01:49:42PM +0800, Timon Wang wrote:
 Is there any solution in KVM that works like VAAI does in EXSI, I
 found a PPT that posted in Sep. 2012, which said that storage offload
 will be consider in future.
 I am wondering anybody knows about this, or provide some information about 
 this?

Thin Provisioning support is being added to QEMU.  Some configurations
already work - virtio-scsi on a block device or raw file supports
discard, for example.

Linux recently got Zero Blocks support in the form of the BLKZEROOUT
ioctl.  It is not being exploited by QEMU or libvirt yet.

Copy Offload, not aware of active development.  Perhaps libvirt or
libstoragemgmt will support it.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Tracing kvm: kvm_entry and kvm_exit

2013-02-25 Thread Stefan Hajnoczi

On Fri, Feb 22, 2013 at 11:34:27AM -0500, Mohamad Gebai wrote:
 I am tracing kvm using perf and I am analyzing the sequences of kvm_entry and
 kvm_exit tracepoints.
 I noticed that during the boot process of a VM, there are a lot more (2 to 3 
 as
 many times) kvm_entry event than there are kvm_exit. I tried looking around 
 but
 didn't find anything that explains this. Is this missing instrumentation? Or
 what other path does kvm take that doesn't generate a kvm_exit event?

Gleb Natapov noticed something similar when playing with the perf script
I posted here:

http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/104181

Perhaps there is a code path that is missing trace_kvm_exit().

We didn't investigate why it happens but the unexplained kvm_entry
events only appeared at the beginning of the trace, so the theory was
that events are not activated atomically by perf(1).

CCing perf mailing list.

It would be interesting if someone knows the answer.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu help documentation

2013-02-15 Thread Stefan Hajnoczi

On Thu, Feb 14, 2013 at 02:22:51PM +0100, Paolo Pedaletti wrote:
 I have trouble to get full list of the output of
 qemu help
 inside kvm when I switch to second console CTRL-ALT-2
 
 I can't find the full list even inside source code (apt-get source
 qemu-kvm) and neither inside binary file (grep blockarg qemu-*)
 
 Is it possible to redirect the output of  help on an external
 file? Or paging it?
 
 This because (the main problem is) I'm trying to get full kernel
 message at boot, but inside KVM window it's not possible to scroll
 up ( goal: 
 http://pedalinux.blogspot.it/2013/02/physical-to-virtual-step-by-step.html
 ) or to dump outside terminal output.

Try Ctrl+PageUp.

If that doesn't work you can put the monitor on stdio like this:

  $ qemu-system-x86_64 -monitor stdio ...

Then you can interact from your shell and scroll back up as usual.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Win2003 disk corruption with kvm-1.0. and virtio

2013-02-14 Thread Stefan Hajnoczi

On Wed, Feb 13, 2013 at 10:53:14AM +0100, Sylvain Bauza wrote:
 As per documentation, Nova (Openstack Compute layer) is doing a
 'qemu-img convert -s' against a running instance.
 http://docs.openstack.org/trunk/openstack-compute/admin/content/creating-images-from-running-instances.html

That command will not corrupt the running instance because it opens the
image read-only.

It is possible that the new image is corrupted since qemu-img is reading
from a qcow2 file that is changing underneath it.  However, the chance
is small as long as the snapshot isn't deleted while qemu-img convert is
running.

So this doesn't sound like the cause of the problems you are seeing.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Win2003 disk corruption with kvm-1.0. and virtio

2013-02-14 Thread Stefan Hajnoczi

On Tue, Feb 12, 2013 at 03:30:37PM +0100, Sylvain Bauza wrote:
 We currently run Openstack Essex hosts with KVM-1.0 (Ubuntu 12.04)
 instances with qcow2,virtio,cache=none
 
 For Linux VMs, no trouble at all but we do observe filesystem
 corruption and inconsistency (missing DLLs, CHKDSK asked by
 EventViewer, failure at reboot) with some of our Windows 2003 SP2
 64b images.
 
 At first boot, stress tests (CrystalDiskMark 3.0.2 and intensive
 CHKDSK) don't show up problems. It is only appearing 6 or 12h later.

Are you running the latest virtio-win drivers?  See
http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers.

Have you tested with IDE instead of virtio on the Windows guests?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Google Summer of Code 2013 ideas wiki open

2013-02-14 Thread Stefan Hajnoczi

On Thu, Feb 14, 2013 at 11:39 AM, harryxiyou harryxi...@gmail.com wrote:
 On Tue, Feb 12, 2013 at 5:21 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Thu, Feb 7, 2013 at 4:19 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 I believe Google will announce GSoC again this year (there is
 no guarantee though) and I have created the wiki page so we can begin
 organizing project ideas that students can choose from.

 Google Summer of Code 2013 has just been announced!

 http://google-opensource.blogspot.de/2013/02/flip-bits-not-burgers-google-summer-of.html

 Some project ideas have already been discussed on IRC or private
 emails.  Please go ahead and put them on the project ideas wiki page:

 http://wiki.qemu.org/Google_Summer_of_Code_2013


 I am a senior student and wanna do some jobs about storage in Libvirt
 in GSOC 2013.
 I wonder whether Libvirt and QEMU will join GSOC 2013 together. If
 true, i will focus
 on http://wiki.qemu.org/Google_Summer_of_Code_2013 and add myself 
 introductions
 to QEMU links said by Stefan Hajnoczi. Could anyone give me some suggestions?
 Thanks in advance.

Hi Harry,
Thanks for your interest.  You can begin thinking about ideas but
please keep in mind that we are still in the very early stages of GSoC
preparation.

Google will publish the list of accepted organizations on April 8th.
Then there is a period of over 3 weeks to discuss your project idea
with the organization.

In the meantime, the best thing to do is to get familiar with the code
bases and see if you can find/fix a bug.  Contributing patches is a
great way to get noticed.

There is always a chance that QEMU and/or libvirt may not be among the
list of accepted organizations, so don't put all your eggs in one
basket :).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Win2003 disk corruption with kvm-1.0. and virtio

2013-02-13 Thread Stefan Hajnoczi

On Tue, Feb 12, 2013 at 03:30:37PM +0100, Sylvain Bauza wrote:
 We currently run Openstack Essex hosts with KVM-1.0 (Ubuntu 12.04)
 instances with qcow2,virtio,cache=none
 
 For Linux VMs, no trouble at all but we do observe filesystem
 corruption and inconsistency (missing DLLs, CHKDSK asked by
 EventViewer, failure at reboot) with some of our Windows 2003 SP2
 64b images.
 
 At first boot, stress tests (CrystalDiskMark 3.0.2 and intensive
 CHKDSK) don't show up problems. It is only appearing 6 or 12h later.
 
 Do you have any idea on how to prevent it ? Is cache=writethrough an
 acceptable solution ? We don't want to leave qcow2 image format as
 it does allow to do live snapshots et al.

How are you taking live snapshots?  qemu-img should not be used on a
disk image that is currently open by a running guest, it may lead to
corruption.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Google Summer of Code 2013 ideas wiki open

2013-02-11 Thread Stefan Hajnoczi

On Thu, Feb 7, 2013 at 4:19 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 I believe Google will announce GSoC again this year (there is
 no guarantee though) and I have created the wiki page so we can begin
 organizing project ideas that students can choose from.

Google Summer of Code 2013 has just been announced!

http://google-opensource.blogspot.de/2013/02/flip-bits-not-burgers-google-summer-of.html

Some project ideas have already been discussed on IRC or private
emails.  Please go ahead and put them on the project ideas wiki page:

http://wiki.qemu.org/Google_Summer_of_Code_2013

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Google Summer of Code 2013 ideas wiki open

2013-02-07 Thread Stefan Hajnoczi

On Thu, Feb 7, 2013 at 4:19 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
CCed libvir-list to see if libvirt would like to do a joint
application with QEMU.

As mentioned, it's early days and GSoC 2013 has not been announced
yet.  I just want to start gathering ideas and seeing who is willing
to mentor this year.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Investigating abnormal stealtimes

2013-02-05 Thread Stefan Hajnoczi

On Tue, Feb 5, 2013 at 1:26 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 - 'Steal time' is the amount of time taken while vcpu is able to run
 but not runnable. Maybe 'vmexit latency' is a better name.

You are right, 'vmexit latency' is a better name.

 - Perhaps it would be good to subtract the time the thread was
 involuntarily scheduled out due 'timeslice' expiration. Otherwise,
 running a CPU intensive task returns false positives (that is, long
 delays to due to reschedule due to 'timeslice' exhausted by guest CPU
 activity, not due to KVM or QEMU issues such as voluntarily schedule in
 pthread_mutex_lock).

 Alternatively you can raise the priority of the vcpu threads (to get rid
 of the false positives).

I think this depends on the use-case.  If the aim is to find out why
the guest has poor response times then timeslice expiration is
interesting.  If the aim is to optimize QEMU or kvm.ko then timeslice
expiration is a nuisance :).

Your idea to raise the vcpu thread priority sounds good to me.

 - Idea: Would be handy to extract trace events in the offending
 'latency above threshold' vmexit/vmentry region.
 Say that you enable other trace events (unrelated to kvm) which can
 help identify the culprit. Instead of scanning the file manually
 searching for 100466.1062486786 save one vmexit/vmentry cycle,
 along with other trace events in that period, in a separate file.

Good idea.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to limit upload bandwidth for a guest server?

2013-02-05 Thread Stefan Hajnoczi

On Sun, Feb 03, 2013 at 07:59:07PM -0600, Neil Aggarwal wrote:
 I have a CentOS server using KVM to host guest servers.
 I am trying to limit the bandwidth usable by a guest server.
 
 I tried to use tc, but that is only limiting the download bandwidth
 to a server.  It does not seem to filter packets uploaded by the
 server.
 
 Is there a tool to limit upload traffic for a guest server?

Consider using management tools like libvirt that handle tc and friends
for you.  The domain XML is interfacebandwidthinbound and
outbound.

Back to the question, are you looking for ingress qdiscs?

http://www.lartc.org/howto/lartc.adv-qdisc.ingress.html

This is a standard tc question, not related to virtualization.  You may
get better help from Linux networking mailing lists or IRC channels.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] QEMU buildbot maintenance state

2013-01-31 Thread Stefan Hajnoczi

On Wed, Jan 30, 2013 at 10:31:22AM +0100, Gerd Hoffmann wrote:
   Hi,
 
  Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel
  and Christian?  It would be awesome if you could do this given your
  experience running and customizing buildbot.
 
 I'll try to set aside some time for that.  Christians idea to host the
 config at github is good, that certainly makes it easier to balance
 things to more people.
 
 Another thing which would be helpful:  Any chance we can setup a
 maintainer tree mirror @ git.qemu.org?  A single repository where each
 maintainer tree shows up as a branch?
 
 This would make the buildbot setup *alot* easier.  We can go for a
 AnyBranchScheduler then with BuildFactory and BuildConfig shared,
 instead of needing one BuildFactory and BuildConfig per branch.  Also
 makes the buildbot web interface less cluttered as we don't have a
 insane amount of BuildConfigs any more.  And saves some resources
 (bandwidth + diskspace) for the buildslaves.
 
 I think people who want to look what is coming or who want to test stuff
 cooking it would be a nice service too if they have a one-stop shop
 where they can get everything.

I sent a pull request that makes the BuildFactory definitions simpler
using a single create_build_factory() function:

https://github.com/b1-systems/buildbot/pull/1

Keep in mind that BuildFactories differ not just by repo/branch but
also:
 * in-tree or out-of-tree
 * extra ./configure arguments
 * gmake instead of make

I think this means it is not as simple as defining a single
BuildFactory.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] QEMU buildbot maintenance state

2013-01-31 Thread Stefan Hajnoczi

On Wed, Jan 30, 2013 at 10:31:22AM +0100, Gerd Hoffmann wrote:
   Hi,
 
  Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel
  and Christian?  It would be awesome if you could do this given your
  experience running and customizing buildbot.
 
 I'll try to set aside some time for that.

Excellent, thank you!

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QEMU buildbot maintenance state

2013-01-30 Thread Stefan Hajnoczi

On Tue, Jan 29, 2013 at 04:04:39PM +0100, Christian Berendt wrote:
 On 01/28/2013 03:29 PM, Daniel Gollub wrote:
 JFYI, the main buildbot configuration which controls everything (beside
 buildslave credentials) is accessible to everyone:
 http://people.b1-systems.de/~gollub/buildbot/
 
 If you are familiar with buildbot feel free to incorporate your suggested
 changes directly on a copy and send me or Christian the diff so we just have
 to review and apply it.
 
 I moved the configuration on GitHub
 (https://github.com/b1-systems/buildbot). I'll add a cron job to the
 buildbot system to regular pull and apply the latest configuration.
 Simply open a pull request to modify the configuration.

Thanks Christian!  I have updated the QEMU wiki page:

http://wiki.qemu.org/ContinuousIntegration

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Investigating abnormal stealtimes

2013-01-29 Thread Stefan Hajnoczi

Khoa and I have been discussing a workload that triggers softlockups and
hung task warnings inside the guest.  These warnings can pop up due to
bugs in the guest Linux kernel but they can also be triggered by the
hypervisor if vcpus are not being scheduled at reasonable times.

I've wanted a tool that reports high stealtimes and includes the last
vmexit reason.  This allows us to figure out if specific I/O device
emulation is taking too long or if other factors like host memory
pressure are degrading guest performance.

Here is a first sketch of such a tool.  It's a perf-script(1) Python
script which can be used to analyze perf.data files recorded with
kvm:kvm_entry and kvm:kvm_exit events.

Stealtimes exceeding a threshold will be flagged up:

  $ perf script -s /absolute/path/to/stealtime.py
  100466.1062486786 9690: steal time 0.029318914 secs,
  exit_reason IO_INSTRUCTION,
  guest_rip 0x81278f02,
  isa 1, info1 0xcf80003, info2 0x0

The example above shows an I/O access to 0xcf8 (PCI Configuration Space
Address port) that took about 28 milliseconds.  The host pid was 9690; this
can be used to investigate the QEMU vcpu thread.  The guest rip can be used
to investigate guest code that triggered this vmexit.

Given this information, it becomes possible to debug QEMU to figure out
why vmexit handling is taking too long.  It might be due to global mutex
contention if another thread holds the global mutex while blocking.
This sort of investigation needs to be done manually today but it might
be possible to add perf event handlers to watch for global mutex
contention inside QEMU and automatically identify the culprit.

Stalls inside the kvm kernel module can also be investigated since
kvm:kvm_exit events are triggered when they happen too.

I wanted to share in case it is useful for others.  Suggestions for
better approaches welcome!

Signed-off-by: Stefan Hajnoczi stefa...@redhat.com

---

#!/usr/bin/env python
# perf script event handlers, generated by perf script -g python
# Licensed under the terms of the GNU GPL License version 2

# Script to print steal times longer than a given threshold
#
# To collect trace data:
# $ perf record -a -e kvm:kvm_entry -e kvm:kvm_exit
#
# To print results from trace data:
#
# $ perf script -s /absolute/path/to/stealtime.py
# 100466.1062486786 9690: steal time 0.029318914 secs,
# exit_reason IO_INSTRUCTION,
# guest_rip 0x81278f02,
# isa 1, info1 0xcf80003, info2 0x0
#
# The example above shows an I/O access to 0xcf8 (PCI Configuration Space
# Address port) that took about 28 milliseconds.  The host pid was 9690; this
# can be used to investigate the QEMU vcpu thread.  The guest rip can be used
# to investigate guest code that triggered this vmexit.

# Print steal times longer than this threshold in milliseconds
THRESHOLD_MS = 100

import os
import sys

sys.path.append(os.environ['PERF_EXEC_PATH'] + \
'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')

from perf_trace_context import *
from Core import *

vcpu_threads = {}

def trace_begin():
print 'argv:', str(sys.argv)

def trace_end():
pass

def kvm__kvm_exit(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
exit_reason, guest_rip, isa, info1, 
info2):

if common_pid in vcpu_threads:
last = vcpu_threads[common_pid]
assert last[0] == 'kvm__kvm_entry'
while last[2]  common_nsecs:
common_secs -= 1
common_nsecs += 1 * 1000 * 1000 * 1000
delta_secs = common_secs - last[1]
delta_nsecs = common_nsecs - last[2]

vcpu_threads[common_pid] = (event_name, common_secs, common_nsecs, 
exit_reason, guest_rip, isa, info1, info2)

def kvm__kvm_entry(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
vcpu_id):

if common_pid in vcpu_threads:
last = vcpu_threads[common_pid]
assert last[0] == 'kvm__kvm_exit'
while last[2]  common_nsecs:
common_secs -= 1
common_nsecs += 1 * 1000 * 1000 * 1000
delta_secs = common_secs - last[1]
delta_nsecs = common_nsecs - last[2]
if delta_secs  0 or delta_nsecs  THRESHOLD_MS * 1000 * 1000:
print '%05u.%09u %u: steal time %05u.%09u secs, exit_reason %s, 
guest_rip %#x, isa %d, info1 %#x, info2 %#x' % (
last[1], last[2],
common_pid,
delta_secs, delta_nsecs,
symbol_str(kvm__kvm_exit, exit_reason, last[3]),
last[4],
last[5],
last[6],
last[7])

vcpu_threads[common_pid] = (event_name, common_secs, common_nsecs)

def trace_unhandled(event_name, context, event_fields_dict):
print ' '.join

Re: [Qemu-devel] QEMU buildbot maintenance state (was: Re: KVM call agenda for 2013-01-29)

2013-01-29 Thread Stefan Hajnoczi

On Mon, Jan 28, 2013 at 03:29:16PM +0100, Daniel Gollub wrote:
  If Daniel does not have sufficient time to administer it, can we maybe
  have that set up on qemu.org instead, with more than one person that has
  access to it?
 
 JFYI, I just requested if I am allowed to grant Stefan root access to our 
 box. 
 I would not mind to give him access - but need to check back with our IT 
 first.

Thanks offering this.  Unfortunately I can't accept because I'm at the
limit of keeping up with my other QEMU responsibilities.  I don't have
enough time to do this job well.

Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel
and Christian?  It would be awesome if you could do this given your
experience running and customizing buildbot.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2013-01-29

2013-01-28 Thread Stefan Hajnoczi

On Mon, Jan 28, 2013 at 11:59:40AM +0100, Juan Quintela wrote:
 Please send in any agenda topics you are interested in.

Replacing select(2) so that we will not hit the 1024 fd_set limit in the
future.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [QEMU PATCH v5 0/3] virtio-net: fix of ctrl commands

2013-01-23 Thread Stefan Hajnoczi

On Tue, Jan 22, 2013 at 11:44:43PM +0800, Amos Kong wrote:
 Currently virtio-net code relys on the layout of descriptor,
 this patchset removed the assumptions and introduced a control
 command to set mac address. Last patch is a trivial renaming.
 
 V2: check guest's iov_len
 V3: fix of migration compatibility
 make mac field in config space read-only when new feature is acked
 V4: add fix of descriptor layout assumptions, trivial rename
 V5: fix endianness after iov_to_buf copy
 
 Amos Kong (2):
   virtio-net: introduce a new macaddr control
   virtio-net: rename ctrl rx commands
 
 Michael S. Tsirkin (1):
   virtio-net: remove layout assumptions for ctrl vq
 
  hw/pc_piix.c|4 ++
  hw/virtio-net.c |  142 +-
  hw/virtio-net.h |   26 +++
  3 files changed, 108 insertions(+), 64 deletions(-)
 

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [QEMU PATCH v4 1/3] virtio-net: remove layout assumptions for ctrl vq

2013-01-22 Thread Stefan Hajnoczi

On Tue, Jan 22, 2013 at 10:38:14PM +0800, Amos Kong wrote:
 On Mon, Jan 21, 2013 at 05:03:30PM +0100, Stefan Hajnoczi wrote:
  On Sat, Jan 19, 2013 at 09:54:26AM +0800, ak...@redhat.com wrote:
   From: Michael S. Tsirkin m...@redhat.com
   
   Virtio-net code makes assumption about virtqueue descriptor layout
   (e.g. sg[0] is the header, sg[1] is the data buffer).
   
   This patch makes code not rely on the layout of descriptors.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   Signed-off-by: Amos Kong ak...@redhat.com
   ---
hw/virtio-net.c | 128 
   
1 file changed, 74 insertions(+), 54 deletions(-)
   
   diff --git a/hw/virtio-net.c b/hw/virtio-net.c
   index 3bb01b1..113e194 100644
   --- a/hw/virtio-net.c
   +++ b/hw/virtio-net.c
   @@ -315,44 +315,44 @@ static void virtio_net_set_features(VirtIODevice 
   *vdev, uint32_t features)
}

static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
   - VirtQueueElement *elem)
   + struct iovec *iov, unsigned int 
   iov_cnt)
{
uint8_t on;
   +size_t s;

   -if (elem-out_num != 2 || elem-out_sg[1].iov_len != sizeof(on)) {
   -error_report(virtio-net ctrl invalid rx mode command);
   -exit(1);
   +s = iov_to_buf(iov, iov_cnt, 0, on, sizeof(on));
   +if (s != sizeof(on)) {
   +return VIRTIO_NET_ERR;
}

   -on = ldub_p(elem-out_sg[1].iov_base);
   -
   -if (cmd == VIRTIO_NET_CTRL_RX_MODE_PROMISC)
   +if (cmd == VIRTIO_NET_CTRL_RX_MODE_PROMISC) {
n-promisc = on;
   -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLMULTI)
   +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLMULTI) {
n-allmulti = on;
   -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLUNI)
   +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLUNI) {
n-alluni = on;
   -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOMULTI)
   +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOMULTI) {
n-nomulti = on;
   -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOUNI)
   +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOUNI) {
n-nouni = on;
   -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOBCAST)
   +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOBCAST) {
n-nobcast = on;
   -else
   +} else {
return VIRTIO_NET_ERR;
   +}

return VIRTIO_NET_OK;
}

static int virtio_net_handle_mac(VirtIONet *n, uint8_t cmd,
   - VirtQueueElement *elem)
   + struct iovec *iov, unsigned int iov_cnt)
{
struct virtio_net_ctrl_mac mac_data;
   +size_t s;

   -if (cmd != VIRTIO_NET_CTRL_MAC_TABLE_SET || elem-out_num != 3 ||
   -elem-out_sg[1].iov_len  sizeof(mac_data) ||
   -elem-out_sg[2].iov_len  sizeof(mac_data))
   +if (cmd != VIRTIO_NET_CTRL_MAC_TABLE_SET) {
return VIRTIO_NET_ERR;
   +}

n-mac_table.in_use = 0;
n-mac_table.first_multi = 0;
   @@ -360,54 +360,71 @@ static int virtio_net_handle_mac(VirtIONet *n, 
   uint8_t cmd,
n-mac_table.multi_overflow = 0;
memset(n-mac_table.macs, 0, MAC_TABLE_ENTRIES * ETH_ALEN);

   -mac_data.entries = ldl_p(elem-out_sg[1].iov_base);
   +s = iov_to_buf(iov, iov_cnt, 0, mac_data.entries,
   +   sizeof(mac_data.entries));
 
 Hi Stefan, can we adjust the endianness after each iov_to_buf() copy?

Yes.

It's only necessary for uint16_t and larger types since a single byte
cannot be swapped (so ldub_p() is not needed).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [QEMU PATCH v4 1/3] virtio-net: remove layout assumptions for ctrl vq

2013-01-21 Thread Stefan Hajnoczi

On Sat, Jan 19, 2013 at 09:54:26AM +0800, ak...@redhat.com wrote:
 From: Michael S. Tsirkin m...@redhat.com
 
 Virtio-net code makes assumption about virtqueue descriptor layout
 (e.g. sg[0] is the header, sg[1] is the data buffer).
 
 This patch makes code not rely on the layout of descriptors.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  hw/virtio-net.c | 128 
 
  1 file changed, 74 insertions(+), 54 deletions(-)
 
 diff --git a/hw/virtio-net.c b/hw/virtio-net.c
 index 3bb01b1..113e194 100644
 --- a/hw/virtio-net.c
 +++ b/hw/virtio-net.c
 @@ -315,44 +315,44 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
 uint32_t features)
  }
  
  static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
 - VirtQueueElement *elem)
 + struct iovec *iov, unsigned int iov_cnt)
  {
  uint8_t on;
 +size_t s;
  
 -if (elem-out_num != 2 || elem-out_sg[1].iov_len != sizeof(on)) {
 -error_report(virtio-net ctrl invalid rx mode command);
 -exit(1);
 +s = iov_to_buf(iov, iov_cnt, 0, on, sizeof(on));
 +if (s != sizeof(on)) {
 +return VIRTIO_NET_ERR;
  }
  
 -on = ldub_p(elem-out_sg[1].iov_base);
 -
 -if (cmd == VIRTIO_NET_CTRL_RX_MODE_PROMISC)
 +if (cmd == VIRTIO_NET_CTRL_RX_MODE_PROMISC) {
  n-promisc = on;
 -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLMULTI)
 +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLMULTI) {
  n-allmulti = on;
 -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLUNI)
 +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_ALLUNI) {
  n-alluni = on;
 -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOMULTI)
 +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOMULTI) {
  n-nomulti = on;
 -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOUNI)
 +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOUNI) {
  n-nouni = on;
 -else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOBCAST)
 +} else if (cmd == VIRTIO_NET_CTRL_RX_MODE_NOBCAST) {
  n-nobcast = on;
 -else
 +} else {
  return VIRTIO_NET_ERR;
 +}
  
  return VIRTIO_NET_OK;
  }
  
  static int virtio_net_handle_mac(VirtIONet *n, uint8_t cmd,
 - VirtQueueElement *elem)
 + struct iovec *iov, unsigned int iov_cnt)
  {
  struct virtio_net_ctrl_mac mac_data;
 +size_t s;
  
 -if (cmd != VIRTIO_NET_CTRL_MAC_TABLE_SET || elem-out_num != 3 ||
 -elem-out_sg[1].iov_len  sizeof(mac_data) ||
 -elem-out_sg[2].iov_len  sizeof(mac_data))
 +if (cmd != VIRTIO_NET_CTRL_MAC_TABLE_SET) {
  return VIRTIO_NET_ERR;
 +}
  
  n-mac_table.in_use = 0;
  n-mac_table.first_multi = 0;
 @@ -360,54 +360,71 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t 
 cmd,
  n-mac_table.multi_overflow = 0;
  memset(n-mac_table.macs, 0, MAC_TABLE_ENTRIES * ETH_ALEN);
  
 -mac_data.entries = ldl_p(elem-out_sg[1].iov_base);
 +s = iov_to_buf(iov, iov_cnt, 0, mac_data.entries,
 +   sizeof(mac_data.entries));
  
 -if (sizeof(mac_data.entries) +
 -(mac_data.entries * ETH_ALEN)  elem-out_sg[1].iov_len)
 +if (s != sizeof(mac_data.entries)) {
  return VIRTIO_NET_ERR;
 +}
 +iov_discard_front(iov, iov_cnt, s);
 +
 +if (mac_data.entries * ETH_ALEN  iov_size(iov, iov_cnt)) {

The (possible) byteswap was lost.  ldl_p() copies from target endianness
to host endianness.

 +return VIRTIO_NET_ERR;
 +}
  
  if (mac_data.entries = MAC_TABLE_ENTRIES) {
 -memcpy(n-mac_table.macs, elem-out_sg[1].iov_base + 
 sizeof(mac_data),
 -   mac_data.entries * ETH_ALEN);
 +s = iov_to_buf(iov, iov_cnt, 0, n-mac_table.macs,
 +   mac_data.entries * ETH_ALEN);
 +if (s != mac_data.entries * ETH_ALEN) {
 +return VIRTIO_NET_OK;

s/VIRTIO_NET_OK/VIRTIO_NET_ERR/

 +}
  n-mac_table.in_use += mac_data.entries;
  } else {
  n-mac_table.uni_overflow = 1;
  }
  
 +iov_discard_front(iov, iov_cnt, mac_data.entries * ETH_ALEN);
 +
  n-mac_table.first_multi = n-mac_table.in_use;
  
 -mac_data.entries = ldl_p(elem-out_sg[2].iov_base);
 +s = iov_to_buf(iov, iov_cnt, 0, mac_data.entries,
 +   sizeof(mac_data.entries));

Same deal with mac_data.entries byteswap.

  
 -if (sizeof(mac_data.entries) +
 -(mac_data.entries * ETH_ALEN)  elem-out_sg[2].iov_len)
 +if (s != sizeof(mac_data.entries)) {
  return VIRTIO_NET_ERR;
 +}
  
 -if (mac_data.entries) {
 -if (n-mac_table.in_use + mac_data.entries = MAC_TABLE_ENTRIES) {
 -memcpy(n-mac_table.macs + (n-mac_table.in_use * ETH_ALEN),
 -   elem-out_sg[2].iov_base + sizeof(mac_data),
 -

Re: [QEMU PATCH v4 2/3] virtio-net: introduce a new macaddr control

2013-01-21 Thread Stefan Hajnoczi

On Sat, Jan 19, 2013 at 09:54:27AM +0800, ak...@redhat.com wrote:
 @@ -350,6 +351,18 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t 
 cmd,
  struct virtio_net_ctrl_mac mac_data;
  size_t s;
  
 +if (cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET) {
 +if (iov_size(iov, iov_cnt) != ETH_ALEN) {
 +return VIRTIO_NET_ERR;
 +}
 +s = iov_to_buf(iov, iov_cnt, 0, n-mac, sizeof(n-mac));
 +if (s != sizeof(n-mac)) {
 +return VIRTIO_NET_ERR;
 +}

Since iov_size() was checked before iov_to_buf(), we never hit this
error.  And if we did n-mac would be trashed (i.e. error handling is
not complete).

I think assert(s == sizeof(n-mac)) is more appropriate appropriate.
Also, please change ETH_ALEN to sizeof(n-mac) to make the relationship
between the check and the copy clear.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 2/3] net: split eth_mac_addr for better error handling

2013-01-21 Thread Stefan Hajnoczi

On Sun, Jan 20, 2013 at 10:43:08AM +0800, ak...@redhat.com wrote:
 From: Stefan Hajnoczi stefa...@gmail.com
 
 When we set mac address, software mac address in system and hardware mac
 address all need to be updated. Current eth_mac_addr() doesn't allow
 callers to implement error handling nicely.
 
 This patch split eth_mac_addr() to prepare part and real commit part,
 then we can prepare first, and try to change hardware address, then do
 the real commit if hardware address is set successfully.
 
 Signed-off-by: Stefan Hajnoczi stefa...@gmail.com
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  include/linux/etherdevice.h |  2 ++
  net/ethernet/eth.c  | 43 ---
  2 files changed, 38 insertions(+), 7 deletions(-)

Feel free to make yourself author and put me just as Suggested-by:.  I
posted pseudo-code but didn't write the patch or test it, so it's fair
to say that credit goes to you. :)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] virtio-spec: set mac address by a new vq command

2013-01-18 Thread Stefan Hajnoczi

On Thu, Jan 17, 2013 at 06:25:47PM +0800, ak...@redhat.com wrote:
 From: Amos Kong ak...@redhat.com
 
 Virtio-net driver currently programs MAC address byte by byte,
 this means that we have an intermediate step where mac is wrong.
 This patch introduced a new control command to set MAC address
 in one time, and added a new feature flag VIRTIO_NET_F_MAC_ADDR
 for this feature.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
 v2: add more detail about new command (Stefan)
 ---
  virtio-spec.lyx | 58 
 -
  1 file changed, 57 insertions(+), 1 deletion(-)
 
 diff --git a/virtio-spec.lyx b/virtio-spec.lyx
 index 1ba9992..1ec0cd4 100644
 --- a/virtio-spec.lyx
 +++ b/virtio-spec.lyx
 @@ -56,6 +56,7 @@
  \html_math_output 0
  \html_css_as_file 0
  \html_be_strict false
 +\author -1930653948 Amos Kong 
  \author -608949062 Rusty Russell,,, 
  \author -385801441 Cornelia Huck cornelia.h...@de.ibm.com
  \author 1112500848 Rusty Russell ru...@rustcorp.com.au
 @@ -4391,6 +4392,14 @@ VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send 
 gratuitous packets.
  
  \change_inserted 1986246365 1352742808
  VIRTIO_NET_F_MQ(22) Device supports multiqueue with automatic receive 
 steering.
 +\change_inserted -1930653948 1358319033
 +
 +\end_layout
 +
 +\begin_layout Description
 +
 +\change_inserted -1930653948 1358319080
 +VIRTIO_NET_F_CTRL_MAC_ADDR(23) Set MAC address.
  \change_unchanged
  
  \end_layout
 @@ -5284,7 +5293,11 @@ The class VIRTIO_NET_CTRL_RX has two commands: 
 VIRTIO_NET_CTRL_RX_PROMISC
  \end_layout
  
  \begin_layout Subsubsection*
 -Setting MAC Address Filtering
 +Setting MAC Address
 +\change_deleted -1930653948 1358318470
 + Filtering
 +\change_unchanged
 +
  \end_layout
  
  \begin_layout Standard
 @@ -5324,6 +5337,17 @@ struct virtio_net_ctrl_mac {
  \begin_layout Plain Layout
  
   #define VIRTIO_NET_CTRL_MAC_TABLE_SET0 
 +\change_inserted -1930653948 1358318313
 +
 +\end_layout
 +
 +\begin_layout Plain Layout
 +
 +\change_inserted -1930653948 1358318331
 +
 + #define VIRTIO_NET_CTRL_MAC_ADDR_SET 1
 +\change_unchanged
 +
  \end_layout
  
  \end_inset
 @@ -5349,6 +5373,38 @@ T_CTRL_MAC_TABLE_SET.
   The command-specific-data is two variable length tables of 6-byte MAC 
 addresses.
   The first table contains unicast addresses, and the second contains 
 multicast
   addresses.
 +\change_inserted -1930653948 1358318545
 +
 +\end_layout
 +
 +\begin_layout Standard
 +
 +\change_inserted -1930653948 1358418243
 +The config space 
 +\begin_inset Quotes eld
 +\end_inset
 +
 +mac
 +\begin_inset Quotes erd
 +\end_inset
 +
 + field and the command VIRTIO_NET_CTRL_MAC_ADDR_SET both set the default
 + MAC address which rx filtering accepts.
 + The command VIRTIO_NET_CTRL_MAC_ADDR_SET is atomic whereas the config space
 + 
 +\begin_inset Quotes eld
 +\end_inset
 +
 +mac
 +\begin_inset Quotes erd
 +\end_inset
 +
 + field is not.
 + Therefore, VIRTIO_NET_CTRL_MAC_ADDR_SET is preferred, especially while
 + the NIC is up.
 + The command-specific-data is a 6-byte MAC address.
 +\change_unchanged

The specification must also say that the mac field is read-only when
the VIRTIO_NET_CTRL_MAC_ADDR_SET command is supported.

(I think you added this behavior to your patch.)

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/2] virtio-net: introduce a new control to set macaddr

2013-01-18 Thread Stefan Hajnoczi

On Thu, Jan 17, 2013 at 06:40:12PM +0800, ak...@redhat.com wrote:
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 395ab4f..837c978 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -802,14 +802,32 @@ static int virtnet_set_mac_address(struct net_device 
 *dev, void *p)
   struct virtnet_info *vi = netdev_priv(dev);
   struct virtio_device *vdev = vi-vdev;
   int ret;
 + struct scatterlist sg;
 + char save_addr[ETH_ALEN];
 + unsigned char save_aatype;
 +
 + memcpy(save_addr, dev-dev_addr, ETH_ALEN);
 + save_aatype = dev-addr_assign_type;
  
   ret = eth_mac_addr(dev, p);
   if (ret)
   return ret;
  
 - if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC))
 + if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
 + sg_init_one(sg, dev-dev_addr, dev-addr_len);
 + if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
 +   VIRTIO_NET_CTRL_MAC_ADDR_SET,
 +   sg, 1, 0)) {
 + dev_warn(vdev-dev,
 +  Failed to set mac address by vq command.\n);
 + memcpy(dev-dev_addr, save_addr, ETH_ALEN);
 + dev-addr_assign_type = save_aatype;
 + return -EINVAL;
 + }

eth_mac_addr() doesn't allow callers to implement error handling nicely.
Although you didn't duplicate it's code directly, this patch still leaks
internals of eth_mac_addr().

How about splitting eth_mac_addr() in a separate patch:

int eth_prepare_mac_addr_change(struct net_device *dev, void *p)
{
struct sockaddr *addr = p;
if (!(dev-priv_flags  IFF_LIVE_ADDR_CHANGE)  netif_running(dev))
return -EBUSY;
if (!is_valid_ether_addr(addr-sa_data))
return -EADDRNOTAVAIL;
return 0;
}

void eth_commit_mac_addr_change(struct net_device *dev, void *p)
{
struct sockaddr *addr = p;
memcpy(dev-dev_addr, addr-sa_data, ETH_ALEN);
/* if device marked as NET_ADDR_RANDOM, reset it */
dev-addr_assign_type = ~NET_ADDR_RANDOM;
}

/* Default implementation of MAC address changing */
int eth_mac_addr(struct net_device *dev, void *p)
{
int ret;
ret = eth_prepare_mac_addr_change(dev, p);
if (ret  0)
return ret;
eth_commit_mac_addr_change(dev, p);
return 0;
}

Now virtio_net.c does:

ret = eth_prepare_mac_addr_change(dev, p);
if (ret  0)
return ret;

if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
sg_init_one(sg, dev-dev_addr, dev-addr_len);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
  VIRTIO_NET_CTRL_MAC_ADDR_SET,
  sg, 1, 0)) {
dev_warn(vdev-dev,
 Failed to set mac address by vq command.\n);
return -EINVAL;
}
} ...

eth_commit_mac_addr_change(dev, p);
return 0;

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [QEMU PATCH v2] virtio-net: introduce a new macaddr control

2013-01-17 Thread Stefan Hajnoczi

On Thu, Jan 17, 2013 at 01:45:11PM +0800, Amos Kong wrote:
 On Thu, Jan 17, 2013 at 11:49:20AM +1030, Rusty Russell wrote:
  ak...@redhat.com writes:
   @@ -349,6 +351,14 @@ static int virtio_net_handle_mac(VirtIONet *n, 
   uint8_t cmd,
{
struct virtio_net_ctrl_mac mac_data;

   +if (cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET  elem-out_num == 2 
   +elem-out_sg[1].iov_len == ETH_ALEN) {
   +/* Set MAC address */
   +memcpy(n-mac, elem-out_sg[1].iov_base, 
   elem-out_sg[1].iov_len);
   +qemu_format_nic_info_str(n-nic-nc, n-mac);
   +return VIRTIO_NET_OK;
   +}
  
  Does the rest of the net device still rely on the layout of descriptors?
 
 No, only info string of net client relies on n-mac

I think the question is whether the hw/virtio-net.c code makes
assumptions about virtqueue descriptor layout (e.g. sg[0] is the header,
sg[1] is the data buffer).

The answer is yes, the control virtqueue function directly accesses
iov[n].

Additional patches would be required to convert the existing
hw/virtio-net.c code to make no assumptions about virtqueue descriptor
layout.  It's outside the scope of this series.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VirtIO id X is not a head!

2013-01-17 Thread Stefan Hajnoczi

On Wed, Jan 16, 2013 at 08:58:50PM +0100, Matthias Leinweber wrote:
 i try to implement a virtual device/driver, but i ran into some
 trouble using the virtio api.
 My implementation looks as follows:
 A kthread exposes memory via add_buf, kicks and sleeps. If a callback
 is issued he is woken up and takes the filled buffer back via get_buf.
 (No other kthread, process or whatever works on this vq in the
 kernel).
 In qemu a qemu_thread waits for some shared memory and tries to pop
 elements from the vq and copies some data into the guest accessible
 memory. Not all elements are  necessarily poped before fill flush and
 notify are called. If a pop returns with  0 the thread goes to sleep
 until the handler routine for this vq wakes the thread up again.
 
 from time to time (after several 100k gets,adds and pops) i get: id %u
 is not a head!.
 virtio_ring.c:
 if (unlikely(i = vq-vring.num)) {
   BAD_RING(vq, id %u out of range\n, i);
   return NULL;
 
 I have no idea what i am doing wrong. Is synchronization needed
 between add pop and get or am i not allowed to use a qemu_thread when
 working on a vq?

Hard to tell exactly what is going on without seeing the code.

QEMU has a global mutex and therefore does not need to do much explicit
locking...except if you spawn your own thread.  The hw/virtio.c code in
QEMU is not thread-safe.  You cannot use it from a thread without
holding the QEMU global mutex.

It's fine to do I/O handling in worker threads, but you must use a BH,
event notifier, or some other mechanism of kicking the QEMU iothread and
process the virtqueue completion in a callback there.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [QEMU PATCH v3] virtio-net: introduce a new macaddr control

2013-01-17 Thread Stefan Hajnoczi

On Thu, Jan 17, 2013 at 06:30:46PM +0800, ak...@redhat.com wrote:
 From: Amos Kong ak...@redhat.com
 
 In virtio-net guest driver, currently we write MAC address to
 pci config space byte by byte, this means that we have an
 intermediate step where mac is wrong. This patch introduced
 a new control command to set MAC address, it's atomic.
 
 VIRTIO_NET_F_CTRL_MAC_ADDR is a new feature bit for compatibility.
 
 mac field will be set to read-only when VIRTIO_NET_F_CTRL_MAC_ADDR
 is acked.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
 V2: check guest's iov_len
 V3: fix of migration compatibility
 make mac field in config space read-only when new feature is acked
 ---
  hw/pc_piix.c|  4 
  hw/virtio-net.c | 10 +-
  hw/virtio-net.h | 12 ++--
  3 files changed, 23 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [QEMU PATCH v2] virtio-net: introduce a new macaddr control

2013-01-16 Thread Stefan Hajnoczi

On Wed, Jan 16, 2013 at 02:37:34PM +0800, Jason Wang wrote:
 On Wednesday, January 16, 2013 02:16:47 PM ak...@redhat.com wrote:
  From: Amos Kong ak...@redhat.com
  
  In virtio-net guest driver, currently we write MAC address to
  pci config space byte by byte, this means that we have an
  intermediate step where mac is wrong. This patch introduced
  a new control command to set MAC address in one time.
  
  VIRTIO_NET_F_CTRL_MAC_ADDR is a new feature bit for compatibility.
  
  Signed-off-by: Amos Kong ak...@redhat.com
  ---
  V2: check guest's iov_len before memcpy
  ---
   hw/virtio-net.c | 10 ++
   hw/virtio-net.h |  9 -
   2 files changed, 18 insertions(+), 1 deletion(-)
  
  diff --git a/hw/virtio-net.c b/hw/virtio-net.c
  index dc7c6d6..d05f98f 100644
  --- a/hw/virtio-net.c
  +++ b/hw/virtio-net.c
  @@ -247,6 +247,7 @@ static uint32_t virtio_net_get_features(VirtIODevice
  *vdev, uint32_t features) VirtIONet *n = to_virtio_net(vdev);
  
   features |= (1  VIRTIO_NET_F_MAC);
  +features |= (1  VIRTIO_NET_F_CTRL_MAC_ADDR);
  
   if (!peer_has_vnet_hdr(n)) {
   features = ~(0x1  VIRTIO_NET_F_CSUM);
  @@ -282,6 +283,7 @@ static uint32_t virtio_net_bad_features(VirtIODevice
  *vdev) /* Linux kernel 2.6.25.  It understood MAC (as everyone must), * but
  also these: */
   features |= (1  VIRTIO_NET_F_MAC);
  +features |= (1  VIRTIO_NET_F_CTRL_MAC_ADDR);
   features |= (1  VIRTIO_NET_F_CSUM);
   features |= (1  VIRTIO_NET_F_HOST_TSO4);
   features |= (1  VIRTIO_NET_F_HOST_TSO6);
  @@ -349,6 +351,14 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t
  cmd, {
   struct virtio_net_ctrl_mac mac_data;
  
  +if (cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET  elem-out_num == 2 
  +elem-out_sg[1].iov_len == ETH_ALEN) {
  +/* Set MAC address */
  +memcpy(n-mac, elem-out_sg[1].iov_base, elem-out_sg[1].iov_len);
  +qemu_format_nic_info_str(n-nic-nc, n-mac);
  +return VIRTIO_NET_OK;
  +}
  +
   if (cmd != VIRTIO_NET_CTRL_MAC_TABLE_SET || elem-out_num != 3 ||
   elem-out_sg[1].iov_len  sizeof(mac_data) ||
   elem-out_sg[2].iov_len  sizeof(mac_data))
  diff --git a/hw/virtio-net.h b/hw/virtio-net.h
  index d46fb98..9394cc0 100644
  --- a/hw/virtio-net.h
  +++ b/hw/virtio-net.h
  @@ -44,6 +44,8 @@
   #define VIRTIO_NET_F_CTRL_VLAN  19  /* Control channel VLAN filtering
  */ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support
  */
  
  +#define VIRTIO_NET_F_CTRL_MAC_ADDR   23 /* Set MAC address */
  +
 
 I wonder whether we need a DEFINE_PROP_BIT to disable and compat this 
 feature. 
 Consider we may migrate from a new version to an old version.

I agree, migration needs to be handled.  The bit should never change
while the device is initialized and running.  We should also never start
rejecting or ignoring the command if it was available before.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-spec: set mac address by a new vq command

2013-01-16 Thread Stefan Hajnoczi

On Wed, Jan 16, 2013 at 03:33:24PM +0800, ak...@redhat.com wrote:
 +\change_inserted -1930653948 1358320004
 +The command VIRTIO_NET_CTRL_MAC_ADDR_SET is used to set 
 +\begin_inset Quotes eld
 +\end_inset
 +
 +physical
 +\begin_inset Quotes erd
 +\end_inset
 +
 + address of the network card.

The physical address of the network card?  That term is not defined
anywhere in the specification.

Perhaps it's best to explain that the config space mac field and
VIRTIO_NET_CTRL_MAC_ADDR_SET both set the default MAC address which rx
filtering accepts.  (The MAC table is an additional set of MAC addresses
which rx filtering accepts.)

It would also be worth explaining that VIRTIO_NET_CTRL_MAC_ADDR_SET is
atomic whereas the config space mac field is not.  Therefore,
VIRTIO_NET_CTRL_MAC_ADDR_SET is preferred, especially while the NIC is
up.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] virtio-net: introduce a new macaddr control

2013-01-11 Thread Stefan Hajnoczi

On Thu, Jan 10, 2013 at 10:51:57PM +0800, ak...@redhat.com wrote:
 @@ -349,6 +351,13 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t 
 cmd,
  {
  struct virtio_net_ctrl_mac mac_data;
  
 +if (cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET  elem-out_num == 2) {
 +/* Set MAC address */
 +memcpy(n-mac, elem-out_sg[1].iov_base, elem-out_sg[1].iov_len);

We cannot trust the guest's iov_len, it could overflow n-mac.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/12] tap: multiqueue support

2013-01-10 Thread Stefan Hajnoczi

On Wed, Jan 09, 2013 at 11:25:24PM +0800, Jason Wang wrote:
 On 01/09/2013 05:56 PM, Stefan Hajnoczi wrote:
  On Fri, Dec 28, 2012 at 06:31:53PM +0800, Jason Wang wrote:
  diff --git a/qapi-schema.json b/qapi-schema.json
  index 5dfa052..583eb7c 100644
  --- a/qapi-schema.json
  +++ b/qapi-schema.json
  @@ -2465,7 +2465,7 @@
   { 'type': 'NetdevTapOptions',
 'data': {
   '*ifname': 'str',
  -'*fd': 'str',
  +'*fd': ['String'],
  This change is not backwards-compatible.  You need to add a '*fds':
  ['String'] field instead.
 
 I'm not quite understand this case, I think it still work when we we
 just specify one fd.

You are right, the QemuOpts visitor shows no incompatibility.

But there is also a QMP interface: netdev_add.  I think changing the
type to a string list breaks compatibility there.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 00/12] Multiqueue virtio-net

2013-01-10 Thread Stefan Hajnoczi

On Wed, Jan 09, 2013 at 11:33:25PM +0800, Jason Wang wrote:
 On 01/09/2013 11:32 PM, Michael S. Tsirkin wrote:
  On Wed, Jan 09, 2013 at 03:29:24PM +0100, Stefan Hajnoczi wrote:
  On Fri, Dec 28, 2012 at 06:31:52PM +0800, Jason Wang wrote:
  Perf Numbers:
 
  Two Intel Xeon 5620 with direct connected intel 82599EB
  Host/Guest kernel: David net tree
  vhost enabled
 
  - lots of improvents of both latency and cpu utilization in 
  request-reponse test
  - get regression of guest sending small packets which because TCP tends 
  to batch
less when the latency were improved
 
  1q/2q/4q
  TCP_RR
   size #sessions trans.rate  norm trans.rate  norm trans.rate  norm
  1 1 9393.26   595.64  9408.18   597.34  9375.19   584.12
  1 2072162.1   2214.24 129880.22 2456.13 196949.81 2298.13
  1 50107513.38 2653.99 139721.93 2490.58 259713.82 2873.57
  1 100   126734.63 2676.54 145553.5  2406.63 265252.68 2943
  64 19453.42   632.33  9371.37   616.13  9338.19   615.97
  64 20   70620.03  2093.68 125155.75 2409.15 191239.91 2253.32
  64 50   1069662448.29 146518.67 2514.47 242134.07 2720.91
  64 100  117046.35 2394.56 190153.09 2696.82 238881.29 2704.41
  256 1   8733.29   736.36  8701.07   680.83  8608.92   530.1
  256 20  69279.89  2274.45 115103.07 2299.76 144555.16 1963.53
  256 50  97676.02  2296.09 150719.57 2522.92 254510.5  3028.44
  256 100 150221.55 2949.56 197569.3  2790.92 300695.78 3494.83
  TCP_CRR
   size #sessions trans.rate  norm trans.rate  norm trans.rate  norm
  1 1 2848.37  163.41 2230.39  130.89 2013.09  120.47
  1 2023434.5  562.11 31057.43 531.07 49488.28 564.41
  1 5028514.88 582.17 40494.23 605.92 60113.35 654.97
  1 100   28827.22 584.73 48813.25 661.6  61783.62 676.56
  64 12780.08  159.4  2201.07  127.96 2006.8   117.63
  64 20   23318.51 564.47 30982.44 530.24 49734.95 566.13
  64 50   28585.72 582.54 40576.7  610.08 60167.89 656.56
  64 100  28747.37 584.17 49081.87 667.87 60612.94 662
  256 1   2772.08  160.51 2231.84  131.05 2003.62  113.45
  256 20  23086.35 559.8  30929.09 528.16 48454.9  555.22
  256 50  28354.7  579.85 40578.31 60760261.71 657.87
  256 100 28844.55 585.67 48541.86 659.08 61941.07 676.72
  TCP_STREAM guest receiving
   size #sessions throughput  norm throughput  norm throughput  norm
  1 1 16.27   1.33   16.11.12   16.13   0.99
  1 2 33.04   2.08   32.96   2.19   32.75   1.98
  1 4 66.62   6.83   68.35.56   66.14   2.65
  64 1896.55  56.67  914.02  58.14  898.9   61.56
  64 21830.46 91.02  1812.02 64.59  1835.57 66.26
  64 43626.61 142.55 3636.25 100.64 3607.46 75.03
  256 1   2619.49 131.23 2543.19 129.03 2618.69 132.39
  256 2   5136.58 203.02 5163.31 141.11 5236.51 149.4
  256 4   7063.99 242.83 9365.4  208.49 9421.03 159.94
  512 1   3592.43 165.24 3603.12 167.19 3552.5  169.57
  512 2   7042.62 246.59 7068.46 180.87 7258.52 186.3
  512 4   6996.08 241.49 9298.34 206.12 9418.52 159.33
  1024 1  4339.54 192.95 4370.2  191.92 4211.72 192.49
  1024 2  7439.45 254.77 9403.99 215.24 9120.82 222.67
  1024 4  7953.86 272.11 9403.87 208.23 9366.98 159.49
  4096 1  7696.28 272.04 7611.41 270.38 7778.71 267.76
  4096 2  7530.35 261.1  8905.43 246.27 8990.18 267.57
  4096 4  7121.6  247.02 9411.75 206.71 9654.96 184.67
  16384 1 7795.73 268.54 7780.94 267.2  7634.26 260.73
  16384 2 7436.57 255.81 9381.86 220.85 9392220.36
  16384 4 7199.07 247.81 9420.96 205.87 9373.69 159.57
  TCP_MAERTS guest sending
   size #sessions throughput  norm throughput  norm throughput  norm
  1 1 15.94   0.62   15.55   0.61   15.13   0.59
  1 2 36.11   0.83   32.46   0.69   32.28   0.69
  1 4 71.59   1  68.91   0.94   61.52   0.77
  64 1630.71  22.52  622.11  22.35  605.09  21.84
  64 21442.36 30.57  1292.15 25.82  1282.67 25.55
  64 43186.79 42.59  2844.96 36.03  2529.69 30.06
  256 1   1760.96 58.07  1738.44 57.43  1695.99 56.19
  256 2   4834.23 95.19  3524.85 64.21  3511.94 64.45
  256 4   9324.63 145.74 8956.49 116.39 6720.17 73.86
  512 1   2678.03 84.1   2630.68 82.93  2636.54 82.57
  512 2   9368.17 195.61 9408.82 204.53 5316.3  92.99
  512 4   9186.34 209.68 9358.72 183.82 9489.29 160.42
  1024 1  3620.71 109.88 3625.54 109.83 3606.61 112.35
  1024 2  9429258.32 7082.79 120.55 7403.53 134.78
  1024 4  9430.66 290.44 9499.29 232.31 9414.6  190.92
  4096 1  9339.28 296.48 9374.23 372.88 9348.76 298.49
  4096 2  9410.53 378.69 9412.61 286.18 9409.75 278.31
  4096 4  9487.35 374.1  9556.91 288.81 9441.94 221.64
  16384 1 9380.43 403.8  9379.78 399.13 9382.42 393.55
  16384 2 9367.69 406.93 9415.04 312.68 9409.29 300.9
  16384 4 9391.96 405.17 9695.12 310.54 9423.76 223.47
  Trying to understand the performance results:
 
  What is the host device configuration?  tap + bridge?
 
 Yes.
 
  Did you use host CPU affinity for the vhost threads?
 
 I use numactl to pin cpu threads and vhost threads in the same numa node.
  Can multiqueue tap take advantage of multiqueue host NICs

Re: [PATCH 01/12] tap: multiqueue support

2013-01-10 Thread Stefan Hajnoczi

On Fri, Dec 28, 2012 at 06:31:53PM +0800, Jason Wang wrote:

Mainly suggestions to make the code easier to understand, but see the
comment about the 1:1 queue/NetClientState model for a general issue
with this approach.

 Recently, linux support multiqueue tap which could let userspace call 
 TUNSETIFF
 for a signle device many times to create multiple file descriptors as

s/signle/single/

(Noting these if you respin.)

 independent queues. User could also enable/disabe a specific queue through

s/disabe/disable/

 TUNSETQUEUE.
 
 The patch adds the generic infrastructure to create multiqueue taps. To 
 achieve
 this a new parameter queues were introduced to specify how many queues were
 expected to be created for tap. The fd parameter were also changed to 
 support
 a list of file descriptors which could be used by management (such as libvirt)
 to pass pre-created file descriptors (queues) to qemu.
 
 Each TAPState were still associated to a tap fd, which mean multiple TAPStates
 were created when user needs multiqueue taps.
 
 Only linux part were implemented now, since it's the only OS that support
 multiqueue tap.
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  net/tap-aix.c |   18 -
  net/tap-bsd.c |   18 -
  net/tap-haiku.c   |   18 -
  net/tap-linux.c   |   70 +++-
  net/tap-linux.h   |4 +
  net/tap-solaris.c |   18 -
  net/tap-win32.c   |   10 ++
  net/tap.c |  248 
 +
  net/tap.h |8 ++-
  qapi-schema.json  |5 +-
  10 files changed, 335 insertions(+), 82 deletions(-)

This patch should be split up:
1. linux-headers: import linux/if_tun.h multiqueue constants
2. tap: add Linux multiqueue support (tap_open(), tap_fd_attach(), 
tap_fd_detach())
3. tap: queue attach/detach (tap_attach(), tap_detach())
4. tap: split out net_init_one_tap() function (pure code motion, to make later 
diffs easy to review)
5. tap: add queues and multi-fd options (net_init_tap()/net_init_one_tap() 
changes)

Each commit description can explain how this works in more detail.  I
think I've figured it out now but it would have helped to separate
things out from the start.

 diff --git a/net/tap-aix.c b/net/tap-aix.c
 index f27c177..f931ef3 100644
 --- a/net/tap-aix.c
 +++ b/net/tap-aix.c
 @@ -25,7 +25,8 @@
  #include net/tap.h
  #include stdio.h
  
 -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
 vnet_hdr_required)
 +int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
 + int vnet_hdr_required, int mq_required)
  {
  fprintf(stderr, no tap on AIX\n);
  return -1;
 @@ -59,3 +60,18 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
  int tso6, int ecn, int ufo)
  {
  }
 +
 +int tap_fd_attach(int fd)
 +{
 +return -1;
 +}
 +
 +int tap_fd_detach(int fd)
 +{
 +return -1;
 +}
 +
 +int tap_fd_ifname(int fd, char *ifname)
 +{
 +return -1;
 +}
 diff --git a/net/tap-bsd.c b/net/tap-bsd.c
 index a3b717d..07c287d 100644
 --- a/net/tap-bsd.c
 +++ b/net/tap-bsd.c
 @@ -33,7 +33,8 @@
  #include net/if_tap.h
  #endif
  
 -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
 vnet_hdr_required)
 +int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
 + int vnet_hdr_required, int mq_required)
  {
  int fd;
  #ifdef TAPGIFNAME
 @@ -145,3 +146,18 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
  int tso6, int ecn, int ufo)
  {
  }
 +
 +int tap_fd_attach(int fd)
 +{
 +return -1;
 +}
 +
 +int tap_fd_detach(int fd)
 +{
 +return -1;
 +}
 +
 +int tap_fd_ifname(int fd, char *ifname)
 +{
 +return -1;
 +}
 diff --git a/net/tap-haiku.c b/net/tap-haiku.c
 index 34739d1..62ab423 100644
 --- a/net/tap-haiku.c
 +++ b/net/tap-haiku.c
 @@ -25,7 +25,8 @@
  #include net/tap.h
  #include stdio.h
  
 -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
 vnet_hdr_required)
 +int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
 + int vnet_hdr_required, int mq_required)
  {
  fprintf(stderr, no tap on Haiku\n);
  return -1;
 @@ -59,3 +60,18 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
  int tso6, int ecn, int ufo)
  {
  }
 +
 +int tap_fd_attach(int fd)
 +{
 +return -1;
 +}
 +
 +int tap_fd_detach(int fd)
 +{
 +return -1;
 +}
 +
 +int tap_fd_ifname(int fd, char *ifname)
 +{
 +return -1;
 +}
 diff --git a/net/tap-linux.c b/net/tap-linux.c
 index c6521be..0854ef5 100644
 --- a/net/tap-linux.c
 +++ b/net/tap-linux.c
 @@ -35,7 +35,8 @@
  
  #define PATH_NET_TUN /dev/net/tun
  
 -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
 vnet_hdr_required)
 +int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
 + int vnet_hdr_required, int mq_required)
  {
  struct ifreq ifr;
  int fd, ret;
 @@ -67,6 +68,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
 int

< 1 2 3 4 5 6 7 8 9 >

201 - 300 of 883 matches

Mail list logo