[COMMIT master] Merge branch 'upstream-merge'

2009-08-25 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

* upstream-merge: (54 commits)
  Make the e1000 the default network adapter for the pc target.
  eliminate errors about unused results in block/vpc.c
  virtio-blk: add msi support.
  qdev/prop: convert isa-bus to helper macros.
  make pthreads mandatory
  qemu: move virtio-pci.o to near pci.o
  char: Emit 'CLOSED' events on char device close
  cleanup cpu-exec.c, part 0/N: consolidate handle_cpu_signal
  Unbreak large mem support by removing kqemu
  unify popen/fopen qemu wrappers
  Only build osdep once
  Route IOAPIC interrupts via ISA bus
  Add a configure switch to enable / disable all user targets. I felt compelled 
to do it for symmetry, mostly it is useful to disable user targets when you 
don't want to build them.
  Migration via unix sockets.
  Route PC irqs to ISA bus instead of i8259 directly
  Makefile: fixed rule TAGS
  QEMU set irq0override in fw_cfg
  SMART ATA Functionality
  Add missing linefeed in error message
  Clean up VGA type selection; far too many variables being used to track one 
state leads to confusion if new variables are added.
  ...

Signed-off-by: Avi Kivity a...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IBM server-X3650 issue with kvm modules

2009-08-25 Thread Haneef Syed
Hi Everyone,

Presently I m porting kvm-76 modules on IBM server x3650 target. 

System Information:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Xeon(R) CPU   E5410  @ 2.33GHz
stepping: 6
cpu MHz : 2327.501
cache size  : 6144 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm 
constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips: 4658.35
clflush size: 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

Thanks for patient..

when i run qemu utilities to install redhat or fedora guest on montavista 
host running in ibm server, i found that there is vmwrite error occurs on 
following 

registers 

GUEST_RFLAGS = 0x6820
TPR_THRESHOLD  = 0x401c
CPU_BASED_VM_EXEC_CONTROL= 0x4002
GUEST_RSP = 0x681c
GUEST_RIP  = 0x681e
HOST_CR0= 0x6c00

And then kvm-76 aborted.. Any help will be appreciated.. Thanks in 
advance..



__
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM server-X3650 issue with kvm modules

2009-08-25 Thread Avi Kivity

On 08/25/2009 09:00 AM, Haneef Syed wrote:

when i run qemu utilities to install redhat or fedora guest on montavista
host running in ibm server, i found that there is vmwrite error occurs on
following

   


You've already posted this before.  It's impossible for us to debug kvm 
interactions with the montavista kernel.  Please take it up with 
montavista's support.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Extending virtio_console to support multiple ports

2009-08-25 Thread Amit Shah

Hello all,

Here is a new iteration of the patch series that implements a
transport for guest and host communications.

The code has been updated to reuse the virtio-console device instead
of creating a new virtio-serial device.

I've tested for compatibility (old qemu  new kernel, new qemu  old
kernel, new qemu  new kernel) and it all works fine.

There are a few items on my todo list but this works well.

New since last send:
- connect/disconnect notifications on connections to ports

TODO:
- Look at converting to a tty interface instead of the current char
interface
- Migration: save the extra state that's necessary
- Convert all config writes to little endian in qemu / convert from
little endian to host endian in guest
- Address a few FIXMEs spread in the code
- Introduce a watermark to stop a rogue host process flooding guest
with data
- Make connect/disconnect work for guest

Please review.
Amit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_console: Add interface for guest and host communication

2009-08-25 Thread Amit Shah
Expose multiple char devices (ports) for simple communication
between the host userspace and guest.

Sample offline usages can be: poking around in a guest to find out
the file systems used, applications installed, etc. Online usages
can be sharing of clipboard data between the guest and the host,
sending information about logged-in users to the host, locking the
screen or session when a vnc session is closed, and so on.

The interface presented to guest userspace is of a simple char
device, so it can be used like this:

fd = open(/dev/vcon2, O_RDWR);
ret = read(fd, buf, 100);
ret = write(fd, string, strlen(string));

Each port is to be assigned a unique function, for example, the
first 4 ports may be reserved for libvirt usage, the next 4 for
generic streaming data and so on. This port-function mapping
isn't finalised yet.

For requirements, use-cases and some history see

http://www.linux-kvm.org/page/VMchannel_Requirements

Signed-off-by: Amit Shah amit.s...@redhat.com
---
 drivers/char/Kconfig   |4 +-
 drivers/char/virtio_console.c  |  871 +++-
 include/linux/virtio_console.h |   27 ++
 3 files changed, 791 insertions(+), 111 deletions(-)

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6a06913..fe76627 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -678,7 +678,9 @@ config VIRTIO_CONSOLE
select HVC_DRIVER
help
  Virtio console for use with lguest and other hypervisors.
-
+ Also serves as a general-purpose serial device for data transfer
+ between the guest and host. Character devices at /dev/vconNN will
+ be created when corresponding ports are found.
 
 config HVCS
tristate IBM Hypervisor Virtual Console Server support
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index c74dacf..3548bf6 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -13,6 +13,7 @@
  * Host can send more.  Buffering in the Host could alleviate this, but it is a
  * difficult problem in general. :*/
 /* Copyright (C) 2006, 2007 Rusty Russell, IBM Corporation
+ * Copyright (C) 2009, Amit Shah, Red Hat, Inc.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -28,100 +29,374 @@
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
+
+#include linux/cdev.h
+#include linux/device.h
 #include linux/err.h
+#include linux/fs.h
 #include linux/init.h
+#include linux/poll.h
 #include linux/virtio.h
 #include linux/virtio_console.h
+#include linux/workqueue.h
 #include hvc_console.h
 
-/*D:340 These represent our input and output console queues, and the virtio
+/*D:340 FIXME These represent our input and output console queues, and the 
virtio
  * operations for them. */
-static struct virtqueue *in_vq, *out_vq;
-static struct virtio_device *vdev;
+struct virtio_console_struct {
+   struct work_struct rx_work;
+   struct work_struct tx_work;
+   struct work_struct config_work;
+
+   struct list_head port_head;
+   struct list_head unused_read_head;
+
+   struct virtio_device *vdev;
+   struct class *class;
+   struct virtqueue *in_vq, *out_vq;
+
+   struct virtio_console_config *config;
+};
+
+/* This struct holds individual buffers received for each port */
+struct virtio_console_port_buffer {
+   struct list_head next;
 
-/* This is our input buffer, and how much data is left in it. */
-static unsigned int in_len;
-static char *in, *inbuf;
+   char *buf;
+
+   size_t len; /* length of the buffer */
+   size_t offset; /* offset in the buf from which to consume data */
+};
+
+struct virtio_console_port {
+   /* Next port in the list */
+   struct list_head next;
+
+   /* Buffer management */
+   struct virtio_console_port_buffer read_buf;
+   struct list_head readbuf_head;
+   wait_queue_head_t waitqueue;
+
+   /* Each port associates with a separate char device */
+   struct cdev cdev;
+   struct device *dev;
+
+   bool host_connected;  /* Is the host device open */
+};
+
+static struct virtio_console_struct virtconsole;
+
+static int major = 60; /* from the experimental range */
 
 /* The operations for our console. */
+/* FIXME: this should go into the per-port struct */
 static struct hv_ops virtio_cons;
 
 /* The hvc device */
 static struct hvc_struct *hvc;
 
-/*D:310 The put_chars() callback is pretty straightforward.
+static struct virtio_console_port *get_port_from_id(u32 id)
+{
+   struct virtio_console_port *port;
+
+   list_for_each_entry(port, virtconsole.port_head, next) {
+   if (MINOR(port-dev-devt) == id)
+   return port;
+   }
+   return NULL;
+}
+
+static int get_id_from_port(struct 

[PATCH 2/3] virtio-console: rename dvq to ovq

2009-08-25 Thread Amit Shah
It isn't obvious what 'dvq' stands for. Since it's the output queue and
the corresponding input queue is called 'ivq', call this 'ovq'

Signed-off-by: Amit Shah amit.s...@redhat.com
---
 hw/virtio-console.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio-console.c b/hw/virtio-console.c
index 663c8b9..92c953c 100644
--- a/hw/virtio-console.c
+++ b/hw/virtio-console.c
@@ -20,7 +20,7 @@
 typedef struct VirtIOConsole
 {
 VirtIODevice vdev;
-VirtQueue *ivq, *dvq;
+VirtQueue *ivq, *ovq;
 CharDriverState *chr;
 } VirtIOConsole;
 
@@ -135,7 +135,7 @@ VirtIODevice *virtio_console_init(DeviceState *dev)
 s-vdev.get_features = virtio_console_get_features;
 
 s-ivq = virtio_add_queue(s-vdev, 128, virtio_console_handle_input);
-s-dvq = virtio_add_queue(s-vdev, 128, virtio_console_handle_output);
+s-ovq = virtio_add_queue(s-vdev, 128, virtio_console_handle_output);
 
 s-chr = qdev_init_chardev(dev);
 qemu_chr_add_handlers(s-chr, vcon_can_read, vcon_read, vcon_event, s);
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] char: Emit 'CLOSED' events on char device close

2009-08-25 Thread Amit Shah
Notify users of the char interface whenever the file / connection is
closed.

Signed-off-by: Amit Shah amit.s...@redhat.com
---
 qemu-char.c |   10 ++
 qemu-char.h |1 +
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index be27994..c25ed1c 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -570,6 +570,7 @@ static void fd_chr_read(void *opaque)
 if (size == 0) {
 /* FD has been closed. Remove it from the active list.  */
 qemu_set_fd_handler2(s-fd_in, NULL, NULL, NULL, NULL);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 return;
 }
 if (size  0) {
@@ -602,6 +603,7 @@ static void fd_chr_close(struct CharDriverState *chr)
 }
 
 qemu_free(s);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 }
 
 /* open a character device to a unix fd */
@@ -690,6 +692,7 @@ static void stdio_read(void *opaque)
 if (size == 0) {
 /* stdin has been closed. Remove it from the active list.  */
 qemu_set_fd_handler2(0, NULL, NULL, NULL, NULL);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 return;
 }
 if (size  0) {
@@ -943,6 +946,7 @@ static void pty_chr_close(struct CharDriverState *chr)
 qemu_del_timer(s-timer);
 qemu_free_timer(s-timer);
 qemu_free(s);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 }
 
 static CharDriverState *qemu_chr_open_pty(void)
@@ -1264,6 +1268,7 @@ static void pp_close(CharDriverState *chr)
 ioctl(fd, PPRELEASE);
 close(fd);
 qemu_free(drv);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 }
 
 static CharDriverState *qemu_chr_open_pp(const char *filename)
@@ -1390,6 +1395,8 @@ static void win_chr_close(CharDriverState *chr)
 qemu_del_polling_cb(win_chr_pipe_poll, chr);
 else
 qemu_del_polling_cb(win_chr_poll, chr);
+
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 }
 
 static int win_chr_init(CharDriverState *chr, const char *filename)
@@ -1779,6 +1786,7 @@ static void udp_chr_close(CharDriverState *chr)
 closesocket(s-fd);
 }
 qemu_free(s);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 }
 
 static CharDriverState *qemu_chr_open_udp(const char *def)
@@ -1999,6 +2007,7 @@ static void tcp_chr_read(void *opaque)
 qemu_set_fd_handler(s-fd, NULL, NULL, NULL);
 closesocket(s-fd);
 s-fd = -1;
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 } else if (size  0) {
 if (s-do_telnetopt)
 tcp_chr_process_IAC_bytes(chr, s, buf, size);
@@ -2095,6 +2104,7 @@ static void tcp_chr_close(CharDriverState *chr)
 closesocket(s-listen_fd);
 }
 qemu_free(s);
+qemu_chr_event(chr, CHR_EVENT_CLOSED);
 }
 
 static CharDriverState *qemu_chr_open_tcp(const char *host_str,
diff --git a/qemu-char.h b/qemu-char.h
index 77d4eda..df620bc 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -11,6 +11,7 @@
 #define CHR_EVENT_RESET   2 /* new connection established */
 #define CHR_EVENT_MUX_IN  3 /* mux-focus was set to this terminal */
 #define CHR_EVENT_MUX_OUT 4 /* mux-focus will move on */
+#define CHR_EVENT_CLOSED  5 /* connection closed */
 
 
 #define CHR_IOCTL_SERIAL_SET_PARAMS   1
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-25 Thread Paolo Bonzini



There are userspace libraries that do almost everything, but you hardly
see things like pthread_(EFD_STATE-like)_create() or similar system
interfaces based on such abstraction.


It actually seems as close to a condition variable as an eventfd can be.


A pthread condition typical code usage maps to eventfd like:

while (read(efd, ...)  0)
if (CONDITION)
break;

So a pthread condition is really a wakeup gate like eventfd is.
EFD_STATE has nothing to do with a pthread condition.


No, your code does not work for pthread_cond_broadcast (which arguably 
is the common case) because all the eventfd readers after the first 
would block.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 07:14:29AM +0300, Avi Kivity wrote:
 On 08/25/2009 05:22 AM, Anthony Liguori wrote:

 I think 2.6.32 is pushing it. 

 2.6.32 is pushing it, but we need to push it.

 I think some time is needed to flush out the userspace interface.  In  
 particular, I don't think Mark's comments have been adequately  
 addressed.  If a version were merged without GSO support, some  
 mechanism to do feature detection would be needed in the userspace API. 
 

 I don't see any point  in merging without gso (unless it beats userspace  
 with gso, which I don't think will happen).  In any case we'll need  
 feature negotiation.

 I think this is likely going to be needed regardless.  I also think  
 the tap compatibility suggestion would simplify the consumption of  
 this in userspace.

 What about veth pairs?

 I'd like some time to look at get_state/set_state ioctl()s along with  
 dirty tracking support.  It's a much better model for live migration  
 IMHO.

 My preference is ring proxying.  Not we'll need ring proxying (or at  
 least event proxying) for non-MSI guests.

Exactly, that's what I meant earlier. That's enough, isn't it, Anthony?

 I think so more thorough benchmarking would be good too.  In  
 particular, netperf/iperf runs would be nice.

 Definitely.

 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Michael S. Tsirkin
On Mon, Aug 24, 2009 at 09:22:47PM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 On Mon, Aug 24, 2009 at 11:12:41AM +0300, Michael S. Tsirkin wrote:
   
 At Rusty's suggestion, I tested vhost base performance with ping.
 Results below, and seem to be what you'd expect.
 

 Rusty, any chance you could look at the code?  Is it in reasonable
 shape? I think it makes sense to merge it through you. What do you
 think?  One comment on file placement: I put files under a separate
 vhost directory to avoid confusion with virtio-net which runs in guest.
 Does this sound sane?  Also, can a minimal version (without TSO, tap or
 any other features) be merged upstream first so that features can be
 added later? Or do we have to wait until it's more full featured?
 Finally, can it reasonably make 2.6.32, or you think it needs more time
 out of tree?
   

 I think 2.6.32 is pushing it.  I think some time is needed to flush out  
 the userspace interface.  In particular, I don't think Mark's comments  
 have been adequately addressed.

Went over, and I thought they have. Mark, could you please comment?
Are you ok with the interface?

  If a version were merged without GSO  
 support, some mechanism to do feature detection would be needed in the  
 userspace API.

Correct. There's already GET_FEATURES/ACK_FEATURES in place for this.

 I think this is likely going to be needed regardless.  I  
 also think the tap compatibility suggestion would simplify the  
 consumption of this in userspace.

Yes. I'll post a patch to tap showing how this can be done
without vhost changes.

 I'd like some time to look at get_state/set_state ioctl()s along with  
 dirty tracking support.  It's a much better model for live migration 
 IMHO.

That option is also available in my code, I just went for a simpler one
in my qemu patch.  I will outline how it works in a separate mail.

 I think so more thorough benchmarking would be good too.  In particular,  
 netperf/iperf runs would be nice.

I don't expect the first version to perform well in all situations and
for all users.  But it can't regress, can it? One can always fall back
to userspace ...

 Regards,

 Anthony Liguori

 Thanks very much,

   
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-25 Thread Michael S. Tsirkin
On Mon, Aug 24, 2009 at 03:15:05PM -0700, Davide Libenzi wrote:
 On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
 
  On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
   On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
   
On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
 On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
 More important here is realization that eventfd is a mutex/semaphore
 implementation, not a generic event reporting interface as we are 
 trying
 to use it.


 Well it is a generic event reporting interface (for example, aio uses 
 it).

Davide, I think it's a valid point.  For example, what read on eventfd
does (zero a counter and return) is not like any semaphore I saw.
   
   
   Indeed, the default eventfd behaviour is like, well, an event. Signaling 
   (kernel side) or writing (userspace side), signals the event.
   Waiting (reading) it, will reset the event.
   If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
   Events and sempahores are two widely known and used abstractions.
   The EFD_STATE proposed one, well, no. Not at all.
  
  Hmm. All we try to do is, associate a small key with the event
  that we signal. Is it really that uncommon/KVM specific?
 
 All I'm trying to do, is to avoid that eventfd will become an horrible 
 multiplexor for every freaky one-time-use behaviors arising inside kernel 
 modules.

Yes, we don't want that. The best thing is to try to restate the problem
in a way that is generic, and then either solve or best use existing
solution. Right?

I thought I had that, but apparently not.  The reason I'm Cc-ing you is
not to try and spam you until you give up and accept the patch, it's
hoping that you see the pattern behind our usage, and help generalize
it.

If I understand it correctly, you believe this is not possible and so
any solution will have to be in KVM? Or maybe I didn't state the problem
clearly enough and should restate it?


 
 
 - Davide

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] virtio-console: Add interface for generic guest-host communication

2009-08-25 Thread Amit Shah
This interface extends the virtio-console device to handle
multiple ports that present a char device from which bits can
be sent and read.

Sample uses for such a device can be obtaining info from the
guest like the file systems used, apps installed, etc. for
offline usage and logged-in users, clipboard copy-paste, etc.
for online usage.

Each port is to be assigned a unique function, for example, the
first 4 ports may be reserved for libvirt usage, the next 4 for
generic streaming data and so on. This port-function mapping
isn't finalised yet.

For requirements, use-cases and some history see

http://www.linux-kvm.org/page/VMchannel_Requirements

Signed-off-by: Amit Shah amit.s...@redhat.com
---
 hw/pc.c |   16 ++-
 hw/virtio-console.c |  423 +--
 hw/virtio-console.h |   47 ++
 monitor.c   |7 +
 qemu-monitor.hx |   10 ++
 qemu-options.hx |2 +-
 sysemu.h|   10 +-
 vl.c|   41 +++---
 8 files changed, 479 insertions(+), 77 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index ca681b8..96eb7ac 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1453,11 +1453,17 @@ static void pc_init1(ram_addr_t ram_size,
 }
 
 /* Add virtio console devices */
-if (pci_enabled) {
-for(i = 0; i  MAX_VIRTIO_CONSOLES; i++) {
-if (virtcon_hds[i]) {
-pci_create_simple(pci_bus, -1, virtio-console-pci);
-}
+if (pci_enabled  virtcon_nr_ports) {
+void *dev;
+
+dev = pci_create_simple(pci_bus, -1, virtio-console-pci);
+if (!dev) {
+fprintf(stderr, qemu: could not create virtio console pci 
device\n);
+exit(1);
+}
+
+for (i = 0; i  virtcon_nr_ports; i++) {
+virtio_console_new_port(dev, virtcon_idx[i]);
 }
 }
 
diff --git a/hw/virtio-console.c b/hw/virtio-console.c
index 92c953c..f8f9866 100644
--- a/hw/virtio-console.c
+++ b/hw/virtio-console.c
@@ -2,9 +2,11 @@
  * Virtio Console Device
  *
  * Copyright IBM, Corp. 2008
+ * Copyright Red Hat, Inc. 2009
  *
  * Authors:
  *  Christian Ehrhardt ehrha...@linux.vnet.ibm.com
+ *  Amit Shah amit.s...@redhat.com
  *
  * This work is licensed under the terms of the GNU GPL, version 2.  See
  * the COPYING file in the top-level directory.
@@ -12,39 +14,158 @@
  */
 
 #include hw.h
+#include pci.h
+#include monitor.h
 #include qemu-char.h
 #include virtio.h
 #include virtio-console.h
 
-
 typedef struct VirtIOConsole
 {
 VirtIODevice vdev;
+PCIDevice *dev;
 VirtQueue *ivq, *ovq;
-CharDriverState *chr;
+struct VirtIOConsolePort *ports;
+struct virtio_console_config *config;
+uint32_t guest_features;
 } VirtIOConsole;
 
-static VirtIOConsole *to_virtio_console(VirtIODevice *vdev)
+typedef struct VirtIOConsolePort {
+VirtIOConsole *vcon;
+CharDriverState *hd;
+bool guest_connected;
+} VirtIOConsolePort;
+
+static VirtIOConsole *virtio_console;
+static struct virtio_console_config virtcon_config;
+
+static VirtIOConsolePort *get_port_from_id(uint32_t id)
+{
+if (id  MAX_VIRTIO_CONSOLE_PORTS)
+return NULL;
+
+return virtio_console-ports[id];
+}
+
+static int get_id_from_port(VirtIOConsolePort *port)
 {
-return (VirtIOConsole *)vdev;
+uint32_t i;
+
+for (i = 0; i  MAX_VIRTIO_CONSOLE_PORTS; i++) {
+if (port == virtio_console-ports[i]) {
+return i;
+}
+}
+return VIRTIO_CONSOLE_BAD_ID;
+}
+
+static bool use_multiport(void)
+{
+return virtio_console-guest_features  (1  VIRTIO_CONSOLE_F_MULTIPORT);
+}
+
+void virtio_console_monitor_command(Monitor *mon,
+const char *command, const char *param)
+{
+int ret;
+
+if(!strncmp(command, add_port, 8)) {
+if (!param) {
+monitor_printf(mon, Error: need port id to add new port\n);
+return;
+}
+ret = init_virtio_console_port(virtcon_nr_ports, param);
+if (ret  0) {
+monitor_printf(mon, Error: cannot add new port: %s\n,
+   strerror(-ret));
+return;
+}
+virtio_console_new_port(NULL, virtcon_idx[virtcon_nr_ports]);
+virtcon_nr_ports++;
+virtio_console-config-nr_active_ports = 
cpu_to_le32(virtcon_nr_ports);
+return;
+}
+}
+
+static size_t flush_buf(uint32_t id, VirtIOConsolePort *port,
+const uint8_t *buf, size_t len)
+{
+size_t write_len = 0;
+
+if (port-hd) {
+write_len = qemu_chr_write(port-hd, buf, len);
+return write_len;
+}
+return 0;
+}
+
+/* Guest wants to notify us of some event */
+static void handle_control_message(VirtIOConsolePort *port,
+   struct virtio_console_control *cpkt)
+{
+switch(cpkt-event) {
+case VIRTIO_CONSOLE_PORT_OPEN:
+port-guest_connected = cpkt-value;
+break;
+}
 }
 
+/* Guest 

Re: [PATCHv4 0/2] vhost: a kernel-level virtio server

2009-08-25 Thread Rusty Russell
On Thu, 20 Aug 2009 12:30:29 am Michael S. Tsirkin wrote:
 Rusty, could you review and comment on the patches please?  Since most
 of the code deals with virtio from host side, I think it will make sense
 to merge them through your tree. What do you think?

Yep, I've been waiting for all the other comments, then I got ill.

Now I'm all recovered, what better way to spend an evening than reviewing
fresh code?

New dir seems fine to me, too.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Biweekly KVM Test report, kernel 7597f... qemu 1c45e...

2009-08-25 Thread Avi Kivity

On 08/21/2009 10:14 AM, Xu, Jiajun wrote:

I found the migration failure is caused by a configuration mistake on our 
testing machine. Now 64-bit migration works well.
But I found on PAE host, migration will cause host kernel call trace.

   




Pid: 12053, comm: qemu-system-x86 Tainted: G  D(2.6.31-rc2 #1)
EIP: 0060:[c043e023] EFLAGS: 00210202 CPU: 0
EIP is at lock_hrtimer_base+0x11/0x33
EAX: f5d1541c EBX: 0010 ECX: 04a9 EDX: f5c1bc7c
ESI: f5d1541c EDI: f5c1bc7c EBP: f5c1bc74 ESP: f5c1bc68
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process qemu-system-x86 (pid: 12053, ti=f5c1b000 task=f61cb410
task.ti=f5c1b000)
Stack:
  f5d1541c  04a9 f5c1bc8c c043e097 f9b7f7cb f5d1541c 
0  04a9 f5c1bc98 c043e0f0 f5d153d0 f5c1bcb0 f9b9b4df  bfd8a102
0  f3c1e000 f5d15440 f5c1bcc0 f9b9b56d bfd8a10c f3c1e000 f5c1bda0 f9b8c26b
Call Trace:
  [c043e097] ? hrtimer_try_to_cancel+0x16/0x62
  [f9b7f7cb] ? kvm_flush_remote_tlbs+0xd/0x1a [kvm]
  [c043e0f0] ? hrtimer_cancel+0xd/0x18
  [f9b9b4df] ? pit_load_count+0x98/0x9e [kvm]
  [f9b9b56d] ? kvm_pit_load_count+0x21/0x35 [kvm]
   


Marcelo, any idea? Looks like the PIT was reloaded, but the hrtimer 
wasn't initialized?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] defer skb allocation in virtio_net -- mergable buff part

2009-08-25 Thread Michael S. Tsirkin
Wanted to try this awhile now, good to see this worked on!
Some comments inline.

On Wed, Aug 12, 2009 at 11:33:51PM -0700, Shirley Ma wrote:
 Guest virtio_net receives packets from its pre-allocated vring 
 buffers, then it delivers these packets to upper layer protocols
 as skb buffs. So it's not necessary to pre-allocate skb for each
 mergable buffer, then frees it when it's useless. 
 
 This patch has deferred skb allocation to when receiving packets, 
 it reduces skb pre-allocations and skb_frees. And it induces two 
 page list: freed_pages and used_page list, used_pages is used to 
 track pages pre-allocated, it is only useful when removing virtio_net.
 
 This patch has tested and measured against 2.6.31-rc4 git,
 I thought this patch will improve large packet performance, but I saw
 netperf TCP_STREAM performance improved for small packet for both 
 local guest to host and host to local guest cases. It also reduces 
 UDP packets drop rate from host to local guest. I am not fully understand 
 why.
 
 The netperf results from my laptop are:
 
 mtu=1500
 netperf -H xxx -l 120
 
   w/o patch   w/i patch (two runs)
 guest to host:  3336.84Mb/s   3730.14Mb/s ~ 3582.88Mb/s
 
 host to guest:  3165.10Mb/s   3370.39Mb/s ~ 3407.96Mb/s
 
 Here is the patch for your review. The same approach can apply to non-mergable
 buffs too, so we can use code in common.

Yes, we should do that. The way to test is to hack virtio to ignore
host support for mergeable buffers.

 If there is no objection, I will 
 submit the non-mergable buffs patch later.
 

The whole page cache management in virtio net is too complex.
Since it seems you bypass it for some cases, this might be where
savings come from?

 Signed-off-by: Shirley Ma x...@us.ibm.com
 ---
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 2a6e81d..e31ebc9 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -17,6 +17,7 @@
   * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
   */
  //#define DEBUG
 +#include linux/list.h
  #include linux/netdevice.h
  #include linux/etherdevice.h
  #include linux/ethtool.h
 @@ -39,6 +40,12 @@ module_param(gso, bool, 0444);
  
  #define VIRTNET_SEND_COMMAND_SG_MAX2
  
 +struct page_list
 +{

Kernel style is struct page_list {.
Also, prefix with virtnet_?

 + struct page *page;
 + struct list_head list;
 +};
 +
  struct virtnet_info
  {
   struct virtio_device *vdev;
 @@ -72,6 +79,8 @@ struct virtnet_info
  
   /* Chain pages by the private ptr. */
   struct page *pages;

Do we need the pages list now? Can we do without?

Pls document fields below.

 + struct list_head used_pages;

Seems a waste to have this list just for dev down.
Extend virtio to give us all buffers from vq
on shutdown?

 + struct list_head freed_pages;
  };
  
  static inline void *skb_vnet_hdr(struct sk_buff *skb)
 @@ -106,6 +115,26 @@ static struct page *get_a_page(struct virtnet_info *vi, 
 gfp_t gfp_mask)
   return p;
  }
  
 +static struct page_list *get_a_free_page(struct virtnet_info *vi, gfp_t 
 gfp_mask)

always called with gfp_mask GFP_ATOMIC?
Also - line too long here and elsewere. Run checkpatch?

 +{
 + struct page_list *plist;
 +
 + if (list_empty(vi-freed_pages)) {

special handling for empty list looks a bit ugly
really necessary?

 + plist = kmalloc(sizeof(struct page_list), gfp_mask);

sizeof *plist

 + if (!plist)
 + return NULL;
 + list_add_tail(plist-list, vi-freed_pages);
 + plist-page = alloc_page(gfp_mask);
 + } else {
 + plist = list_first_entry(vi-freed_pages, struct page_list, 
 list);

is the first entry special? pls document in struct definition

 + if (!plist-page)
 + plist-page = alloc_page(gfp_mask);

what does it mean plist-page == NULL? Please document this
in struct definition.

 + }
 + if (plist-page)
 + list_move_tail(plist-list, vi-used_pages);
 + return plist;
 +}
 +
  static void skb_xmit_done(struct virtqueue *svq)
  {
   struct virtnet_info *vi = svq-vdev-priv;
 @@ -121,14 +150,14 @@ static void skb_xmit_done(struct virtqueue *svq)
   tasklet_schedule(vi-tasklet);
  }
  
 -static void receive_skb(struct net_device *dev, struct sk_buff *skb,
 +static void receive_skb(struct net_device *dev, void *buf,

what is buf, here? can it have proper type and not void*?

   unsigned len)
  {
   struct virtnet_info *vi = netdev_priv(dev);
 - struct virtio_net_hdr *hdr = skb_vnet_hdr(skb);
   int err;
   int i;
 -
 + struct sk_buff *skb = NULL;
 + struct virtio_net_hdr *hdr = NULL;

do we have to init these to NULL?

   if (unlikely(len  sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
   pr_debug(%s: short packet %i\n, dev-name, len);
   dev-stats.rx_length_errors++;
 @@ -136,15 +165,30 @@ 

Re: [PATCHv4 2/2] virtio: refactor find_vqs

2009-08-25 Thread Michael S. Tsirkin
Sorry it took a bit to respond to this.  I kept hoping I'll have time to
test it but looks like I won't before I'm on vacation.  I agree it's
good cleanup, and maybe we can go even further, see below.

On Mon, Aug 10, 2009 at 09:07:52AM +0930, Rusty Russell wrote:
 On Tue, 28 Jul 2009 06:03:08 pm Michael S. Tsirkin wrote:
  On Tue, Jul 28, 2009 at 12:44:31PM +0930, Rusty Russell wrote:
   On Mon, 27 Jul 2009 01:17:09 am Michael S. Tsirkin wrote:
This refactors find_vqs, making it more readable and robust, and fixing
two regressions from 2.6.30:
- double free_irq causing BUG_ON on device removal
- probe failure when vq can't be assigned to msi-x vector
  (reported on old host kernels)

An older version of this patch was tested by Amit Shah.
   
   OK, I've applied both of these; I'd like to see a new test by Amit to
   make sure tho.
   
   I really like this cleanup!  I looked harder at this code, and my best
   attempts to untangle it further came to very little.  This is what I
   ended up with, but it's all cosmetic and can wait until next merge window.
   See what you think.
   
   Thanks!
   Rusty.
   
   virtio_pci: minor MSI-X cleanups
   
   1) Rename vp_request_vectors to vp_request_msix_vectors, and take
  non-MSI-X case out to caller.
  
  I'm not sure this change was for the best: we still have a separate code
  path under if !use_msix, only in another place now.  See below.
  And this seems to break the symmetry between request_ and free_vectors.
 
 Yes, but unfortunately request_vectors was never really symmetrical :(
 
 That's because we didn't do the request_irq's for the per_vector case, because
 we don't have the names.  This is what prevented me from doing a nice
 encapsulation.

Yes. But let's split free_vectors out into free_msix_vectors and
free_intx as well?

   - err = vp_request_vectors(vdev, nvectors, per_vq_vectors);
   + if (!use_msix) {
   + /* Old style: one normal interrupt for change and all vqs. */
   + vp_dev-msix_vectors = 0;
   + vp_dev-per_vq_vectors = false;
   + err = request_irq(vp_dev-pci_dev-irq, vp_interrupt,
   +   IRQF_SHARED, dev_name(vdev-dev), vp_dev);
   + if (!err)
   + vp_dev-intx_enabled = 1;
  
  shorter as vp_dev-intx_enabled = !err
  
   + return err;
  
  Is that all? Don't we need to create the vqs?
 
 Oh, yeah :)
 
 This patch applies on top.  Basically reverts that part, and
 renames vector to msix_vector, which I think makes the code more
 readable.

Yes, I agree, this is good cleanup, structure field names should be
desriptive.  Are you sure we want to make all local variables named
vector renamed to msix_vector though? CodingStyle says local var names
should be short ...  A couple of ideas below.

 virtio: more PCI minor cleanups
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/virtio/virtio_pci.c |   73 
 +---
  1 file changed, 36 insertions(+), 37 deletions(-)
 
 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
 --- a/drivers/virtio/virtio_pci.c
 +++ b/drivers/virtio/virtio_pci.c
 @@ -84,7 +84,7 @@ struct virtio_pci_vq_info
   struct list_head node;
  
   /* MSI-X vector (or none) */
 - unsigned vector;
 + unsigned msix_vector;

Good. And now we don't need the comment.

  };
  
  /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
 @@ -349,7 +349,7 @@ error:
  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 void (*callback)(struct virtqueue *vq),
 const char *name,
 -   u16 vector)
 +   u16 msix_vector)
  {
   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
   struct virtio_pci_vq_info *info;
 @@ -374,7 +374,7 @@ static struct virtqueue *setup_vq(struct
  
   info-queue_index = index;
   info-num = num;
 - info-vector = vector;
 + info-msix_vector = msix_vector;
  
   size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
   info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
 @@ -398,10 +398,10 @@ static struct virtqueue *setup_vq(struct
   vq-priv = info;
   info-vq = vq;
  
 - if (vector != VIRTIO_MSI_NO_VECTOR) {
 - iowrite16(vector, vp_dev-ioaddr + VIRTIO_MSI_QUEUE_VECTOR);
 - vector = ioread16(vp_dev-ioaddr + VIRTIO_MSI_QUEUE_VECTOR);
 - if (vector == VIRTIO_MSI_NO_VECTOR) {
 + if (msix_vector != VIRTIO_MSI_NO_VECTOR) {
 + iowrite16(msix_vector, vp_dev-ioaddr+VIRTIO_MSI_QUEUE_VECTOR);
 + msix_vector = ioread16(vp_dev-ioaddr+VIRTIO_MSI_QUEUE_VECTOR);

checkpatch will complain that space is lacking around +.
We won't have a problem if we keep it named vector.

 + if (msix_vector == VIRTIO_MSI_NO_VECTOR) {
   

Re: [PATCHv4 2/2] vhost_net: a kernel-level virtio server

2009-08-25 Thread Rusty Russell
On Thu, 20 Aug 2009 12:33:09 am Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.

...

 +config VHOST_NET
 + tristate Host kernel accelerator for virtio net
 + depends on NET  EVENTFD
 + ---help---
 +   This kernel module can be loaded in host kernel to accelerate
 +   guest networking with virtio_net. Not to be confused with virtio_net
 +   module itself which needs to be loaded in guest kernel.
 +
 +   To compile this driver as a module, choose M here: the module will
 +   be called vhost_net.

Just want to note that the patch explanation and the Kconfig help text are
exceptional examples of reader-oriented text.  Nice!

 + /* Sanity check */
 + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) {
 + vq_err(vq, Unexpected header len for TX: 
 +%ld expected %zd\n, vq-iov-iov_len,
 +sizeof(struct virtio_net_hdr));
 + break;
 + }

OK, this check, which is in the qemu version, is *wrong*.  There should be
no assumption on sg boundaries.  For example, the guest should be able to
put the virtio_net_hdr in the front of the skbuf data if there is room.

You should try to explicitly consume sizeof(struct virtio_net_hdr) of the
iov, and if fail, do this message.  You can skip the out = 1 test then,
too.

Anyway, I really prefer vq-iov[0]. to vq-iov- here.

 + /* Sanity check */
 + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) {
 + vq_err(vq, Unexpected header len for RX: 
 +%ld expected %zd\n,
 +vq-iov-iov_len, sizeof(struct virtio_net_hdr));
 + break;

Here too.

 + u32 __user *featurep = argp;
 + int __user *fdp = argp;
 + u32 features;
 + int fd, r;
 + switch (ioctl) {
 + case VHOST_NET_SET_SOCKET:
 + r = get_user(fd, fdp);
 + if (r  0)
 + return r;
 + return vhost_net_set_socket(n, fd);
 + case VHOST_GET_FEATURES:
 + /* No features for now */
 + features = 0;
 + return put_user(features, featurep);

We may well get more than 32 feature bits, at least for virtio_net, which will
force us to do some trickery in virtio_pci.  I'd like to avoid that here,
though it's kind of ugly.  We'd need VHOST_GET_FEATURES (and ACK) to take a
struct like:

u32 feature_size;
u32 features[];

 +int vhost_net_init(void)

static?

 +void vhost_net_exit(void)

static?

 +/* Start polling a file. We add ourselves to file's wait queue. The user must
 + * keep a reference to a file until after vhost_poll_stop is called. */

I experienced minor confusion from the comments in this file.  Where you said
user I think caller.  No biggie though.

 + memory-nregions = 2;
 + memory-regions[0].guest_phys_addr = 1;
 + memory-regions[0].userspace_addr = 1;
 + memory-regions[0].memory_size = ~0ULL;
 + memory-regions[1].guest_phys_addr = 0;
 + memory-regions[1].userspace_addr = 0;
 + memory-regions[1].memory_size = 1;

Not sure I understand why there are two regions to start?

 + case VHOST_SET_VRING_BASE:
 + r = copy_from_user(s, argp, sizeof s);
 + if (r  0)
 + break;
 + if (s.num  0x) {
 + r = -EINVAL;
 + break;
 + }
 + vq-last_avail_idx = s.num;
 + break;
 + case VHOST_GET_VRING_BASE:
 + s.index = idx;
 + s.num = vq-last_avail_idx;
 + r = copy_to_user(argp, s, sizeof s);
 + break;

Ah, this is my fault.  I didn't expose the last_avail_idx in the ring
because the other side doesn't need it; but without it the ring state is not
fully observable from outside (no external save / restore!).

I have a patch which published these indices (we have room), see:

http://ozlabs.org/~rusty/kernel/rr-2009-08-12-1/virtio:ring-publish-indices.patch

Perhaps we should use that mechanism instead?  We don't actually have to
offer the feature (we don't care about the guest state), but it's nice as
documentation.  I've been waiting for an excuse to use that patch.

 +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long 
 arg)
 +{
 + void __user *argp = (void __user *)arg;
 + long r;
 +
 + mutex_lock(d-mutex);
 + if (ioctl == VHOST_SET_OWNER) {
 + r = vhost_dev_set_owner(d);
 + goto done;
 + }
 +
 + r = vhost_dev_check_owner(d);

You can do a VHOST_SET_OWNER without being the owner?  So really,
the -EPERM from all the vhost_dev_check_owner() is not a security thing,
but a I 

KVM: PIT: fix pit_state copy in set_pit2/get_pit2

2009-08-25 Thread Marcelo Tosatti

The kvm_pit_state2 structure contains extra space, so the memcpy
in kvm_vm_ioctl_set_pit2 corrupts kvm-arch.vpit-pit_state.

Fix it by memcpy'ing the channel information and assigning flags
manually.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0f22f72..35e7fc5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2095,7 +2095,9 @@ static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, struct 
kvm_pit_state2 *ps)
int r = 0;
 
mutex_lock(kvm-arch.vpit-pit_state.lock);
-   memcpy(ps, kvm-arch.vpit-pit_state, sizeof(struct kvm_pit_state2));
+   memcpy(ps-channels, kvm-arch.vpit-pit_state.channels,
+   sizeof(ps-channels));
+   ps-flags = kvm-arch.vpit-pit_state.flags;
mutex_unlock(kvm-arch.vpit-pit_state.lock);
return r;
 }
@@ -2109,7 +2111,9 @@ static int kvm_vm_ioctl_set_pit2(struct kvm *kvm, struct 
kvm_pit_state2 *ps)
cur_legacy = ps-flags  KVM_PIT_FLAGS_HPET_LEGACY;
if (!prev_legacy  cur_legacy)
start = 1;
-   memcpy(kvm-arch.vpit-pit_state, ps, sizeof(struct kvm_pit_state2));
+   memcpy(kvm-arch.vpit-pit_state.channels, ps-channels,
+  sizeof(kvm-arch.vpit-pit_state.channels));
+   kvm-arch.vpit-pit_state.flags = ps-flags;
kvm_pit_load_count(kvm, 0, kvm-arch.vpit-pit_state.channels[0].count, 
start);
mutex_unlock(kvm-arch.vpit-pit_state.lock);
return r;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: PIT: fix pit_state copy in set_pit2/get_pit2

2009-08-25 Thread Avi Kivity

On 08/25/2009 03:29 PM, Marcelo Tosatti wrote:

The kvm_pit_state2 structure contains extra space, so the memcpy
in kvm_vm_ioctl_set_pit2 corrupts kvm-arch.vpit-pit_state.

Fix it by memcpy'ing the channel information and assigning flags
manually.
   


Good catch; applied.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Arnd Bergmann
On Tuesday 25 August 2009, Avi Kivity wrote:
 On 08/25/2009 05:22 AM, Anthony Liguori wrote:
 
  I think 2.6.32 is pushing it. 
 
 2.6.32 is pushing it, but we need to push it.

Agreed.

  I think some time is needed to flush out the userspace interface.  In 
  particular, I don't think Mark's comments have been adequately 
  addressed.  If a version were merged without GSO support, some 
  mechanism to do feature detection would be needed in the userspace API. 
 
 I don't see any point  in merging without gso (unless it beats userspace 
 with gso, which I don't think will happen).  In any case we'll need 
 feature negotiation.

The feature negotiation that Michael has put in seems sufficient for this.
If you care more about latency than bandwidth, the current driver is
an enormous improvement over the user space code, which I find is enough
reason to have it now.

  I think this is likely going to be needed regardless.  I also think 
  the tap compatibility suggestion would simplify the consumption of 
  this in userspace.
 
 What about veth pairs?

I think you are talking about different issues. Veth pairs let you connect
vhost to a bridge device like you do in the typical tun/tap setup, which
addresses one problem.

The other issue that came up is that raw sockets require root permissions,
so you have to start qemu as root or with an open file descriptor for the
socket (e.g. through libvirt). Permission handling on tap devices is ugly
as well, but is a solved problem. The solution is to be able to pass
a tap device (or a socket from a tap device) into vhost net. IMHO we
need this eventually, but it's not a show stopper for 2.6.32.

Along the same lines, I also think it should support TCP and UDP sockets
so we can offload 'qemu -net socket,mcast' and 'qemu -net socket,connect'
in addition to 'qemu -net socket,raw'.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-25 Thread Rusty Russell
On Thu, 20 Aug 2009 05:14:29 pm Gerd Hoffmann wrote:
 I think strings are better as numbers for identifying protocols as you 
 can work without a central registry for the numbers then.

Yep, all you need is a central registry for names :)

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-25 Thread Gerd Hoffmann

On 08/25/09 14:43, Rusty Russell wrote:

On Thu, 20 Aug 2009 05:14:29 pm Gerd Hoffmann wrote:

I think strings are better as numbers for identifying protocols as you
can work without a central registry for the numbers then.


Yep, all you need is a central registry for names :)


There are schemes to do without (reverse domain names, i.e. 
au.com.rustcorp.* is all yours and you don't have to register).  Also 
with names the namespace is much bigger and clashes are much less 
likely, so chances that it works out with everybody simply picking a 
sane name are much higher.


cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Anthony Liguori

Avi Kivity wrote:
I think this is likely going to be needed regardless.  I also think 
the tap compatibility suggestion would simplify the consumption of 
this in userspace.


What about veth pairs?


Does veth support GSO and checksum offload?

I'd like some time to look at get_state/set_state ioctl()s along with 
dirty tracking support.  It's a much better model for live migration 
IMHO.


My preference is ring proxying.  Not we'll need ring proxying (or at 
least event proxying) for non-MSI guests.


I avoided suggested ring proxying because I didn't want to suggest that 
merging should be contingent on it.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Anthony Liguori

Michael S. Tsirkin wrote:
My preference is ring proxying.  Not we'll need ring proxying (or at  
least event proxying) for non-MSI guests.



Exactly, that's what I meant earlier. That's enough, isn't it, Anthony?
  


It is if we have a working implementation that demonstrates the 
userspace interface is sufficient.  Once it goes into the upstream 
kernel, we need to have backwards compatibility code in QEMU forever to 
support that kernel version.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4 2/2] vhost_net: a kernel-level virtio server

2009-08-25 Thread Michael S. Tsirkin
Thanks for the comments, I'll work on them ASAP.
Answers to questions and more comments below.

On Tue, Aug 25, 2009 at 09:40:40PM +0930, Rusty Russell wrote:
 On Thu, 20 Aug 2009 12:33:09 am Michael S. Tsirkin wrote:
  What it is: vhost net is a character device that can be used to reduce
  the number of system calls involved in virtio networking.
  Existing virtio net code is used in the guest without modification.
 
 ...
 
  +config VHOST_NET
  +   tristate Host kernel accelerator for virtio net
  +   depends on NET  EVENTFD
  +   ---help---
  + This kernel module can be loaded in host kernel to accelerate
  + guest networking with virtio_net. Not to be confused with virtio_net
  + module itself which needs to be loaded in guest kernel.
  +
  + To compile this driver as a module, choose M here: the module will
  + be called vhost_net.
 
 Just want to note that the patch explanation and the Kconfig help text are
 exceptional examples of reader-oriented text.  Nice!

High praise indeed from the author of the lguest Quest :).

  +   /* Sanity check */
  +   if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) {
  +   vq_err(vq, Unexpected header len for TX: 
  +  %ld expected %zd\n, vq-iov-iov_len,
  +  sizeof(struct virtio_net_hdr));
  +   break;
  +   }
 
 OK, this check, which is in the qemu version, is *wrong*.  There should be
 no assumption on sg boundaries.  For example, the guest should be able to
 put the virtio_net_hdr in the front of the skbuf data if there is room.
 
 You should try to explicitly consume sizeof(struct virtio_net_hdr) of the
 iov, and if fail, do this message.  You can skip the out = 1 test then,
 too.
 
 Anyway, I really prefer vq-iov[0]. to vq-iov- here.

I'll fix that. Probably should fix qemu as well.

  +   /* Sanity check */
  +   if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) {
  +   vq_err(vq, Unexpected header len for RX: 
  +  %ld expected %zd\n,
  +  vq-iov-iov_len, sizeof(struct virtio_net_hdr));
  +   break;
 
 Here too.
 
  +   u32 __user *featurep = argp;
  +   int __user *fdp = argp;
  +   u32 features;
  +   int fd, r;
  +   switch (ioctl) {
  +   case VHOST_NET_SET_SOCKET:
  +   r = get_user(fd, fdp);
  +   if (r  0)
  +   return r;
  +   return vhost_net_set_socket(n, fd);
  +   case VHOST_GET_FEATURES:
  +   /* No features for now */
  +   features = 0;
  +   return put_user(features, featurep);
 
 We may well get more than 32 feature bits, at least for virtio_net, which will
 force us to do some trickery in virtio_pci.  I'd like to avoid that here,
 though it's kind of ugly.  We'd need VHOST_GET_FEATURES (and ACK) to take a
 struct like:
 
   u32 feature_size;
   u32 features[];

Do you feel just making it 64 bit won't be enough?  How about 128 bit?

  +int vhost_net_init(void)
 
 static?
 
  +void vhost_net_exit(void)
 
 static?

Good catch.

  +/* Start polling a file. We add ourselves to file's wait queue. The user 
  must
  + * keep a reference to a file until after vhost_poll_stop is called. */
 
 I experienced minor confusion from the comments in this file.  Where you said
 user I think caller.  No biggie though.
 
  +   memory-nregions = 2;
  +   memory-regions[0].guest_phys_addr = 1;
  +   memory-regions[0].userspace_addr = 1;
  +   memory-regions[0].memory_size = ~0ULL;
  +   memory-regions[1].guest_phys_addr = 0;
  +   memory-regions[1].userspace_addr = 0;
  +   memory-regions[1].memory_size = 1;
 
 Not sure I understand why there are two regions to start?

We are trying to cover a whole 2^64 range here.  It's size does not fit
in a single 64 bit length value.  I could special case 0 length to mean
2^64, but decided against it.

  +   case VHOST_SET_VRING_BASE:
  +   r = copy_from_user(s, argp, sizeof s);
  +   if (r  0)
  +   break;
  +   if (s.num  0x) {
  +   r = -EINVAL;
  +   break;
  +   }
  +   vq-last_avail_idx = s.num;
  +   break;
  +   case VHOST_GET_VRING_BASE:
  +   s.index = idx;
  +   s.num = vq-last_avail_idx;
  +   r = copy_to_user(argp, s, sizeof s);
  +   break;
 
 Ah, this is my fault.  I didn't expose the last_avail_idx in the ring
 because the other side doesn't need it; but without it the ring state is not
 fully observable from outside (no external save / restore!).
 
 I have a patch which published these indices (we have room), see:
   
 http://ozlabs.org/~rusty/kernel/rr-2009-08-12-1/virtio:ring-publish-indices.patch
 
 Perhaps we should use that mechanism instead?  We don't actually have to
 offer the feature (we don't care about the guest state), but it's nice as
 documentation.  I've been 

Re: vhost net: performance with ping benchmark

2009-08-25 Thread Anthony Liguori

Avi Kivity wrote:
My preference is ring proxying.  Not we'll need ring proxying (or at 
least event proxying) for non-MSI guests.


Thinking about this more...

How does the hand off work?  Assuming you normally don't proxy ring 
entries and switch to proxying them when you want to migration, do you 
have a set of ioctl()s that changes the semantics of the ring to be host 
virtual addresses instead of guest physical?  If so, what do you do with 
in flight requests?  Does qemu have to buffer new requests and wait for 
old ones to complete?


Unless you always do ring proxying.  If that's the case, we don't need 
any of the slot management code in vhost.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 08:08:05AM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 My preference is ring proxying.  Not we'll need ring proxying (or at  
 least event proxying) for non-MSI guests.
 

 Exactly, that's what I meant earlier. That's enough, isn't it, Anthony?
   

 It is if we have a working implementation that demonstrates the
 userspace interface is sufficient.

The idea is trivial enough to be sure the interface is sufficient:
we point kernel at used buffer at address X, and
copy stuff from there to guest buffer, then signal guest.
I'll post a code snippet to show how it's done if you like.

 Once it goes into the upstream
 kernel, we need to have backwards compatibility code in QEMU forever
 to  support that kernel version.

Don't worry: kernel needs to handle old userspace as well, and neither I
nor Rusty want to have a compatibility mess in kernel.

 Regards,

 Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix apic id reading in x2apic mode

2009-08-25 Thread Gleb Natapov
Format of apic id register is different in x2apic mode.
Return correct apic id when apic is in x2apic mode.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e41e948..bd3a402 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -551,6 +551,12 @@ static u32 __apic_read(struct kvm_lapic *apic, unsigned 
int offset)
return 0;
 
switch (offset) {
+   case APIC_ID:
+   if (apic_x2apic_mode(apic))
+   val = kvm_apic_id(apic);
+   else
+   val = kvm_apic_id(apic)  24;
+   break;
case APIC_ARBPRI:
printk(KERN_WARNING Access APIC ARBPRI register 
   which is for P6\n);
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 08:24:07AM -0500, Anthony Liguori wrote:
 Avi Kivity wrote:
 My preference is ring proxying.  Not we'll need ring proxying (or at  
 least event proxying) for non-MSI guests.

 Thinking about this more...

 How does the hand off work?  Assuming you normally don't proxy ring  
 entries and switch to proxying them when you want to migration, do you  
 have a set of ioctl()s that changes the semantics of the ring to be host  
 virtual addresses instead of guest physical?  If so, what do you do with  
 in flight requests?  Does qemu have to buffer new requests and wait for  
 old ones to complete?

 Unless you always do ring proxying.  If that's the case, we don't need  
 any of the slot management code in vhost.

 Regards,

 Anthony Liguori

Here's how it works. It relies on the fact that in virtio, guest can not
assume that descriptors have been used unless they appeared in used
buffers.

When migration starts, we do this:

1. stop kernel (disable socket)
2. call VHOST_SET_VRING_USED: note it gets virtual address, bot guest
   physical. We point it at buffer in qemu memory
3. call VHOST_SET_VRING_CALL, pass eventfd created by qemu
5. copy over existing used buffer
4. unstop kernel (reenable socket)

Now when migration is in progress, we do this:
A. poll eventfd in 3 above
B. When event is seen, look at used buffer that we gave to kernel
C. Parse descriptors and mark pages that kernel wrote to
   as dirty
D. update used buffer that guest looks at
E. signal eventfd for guest

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 08:06:39AM -0500, Anthony Liguori wrote:
 Avi Kivity wrote:
 I think this is likely going to be needed regardless.  I also think  
 the tap compatibility suggestion would simplify the consumption of  
 this in userspace.

 What about veth pairs?

 Does veth support GSO and checksum offload?

AFAIK, no. But again, improving veth is a separate project :)

 I'd like some time to look at get_state/set_state ioctl()s along with 
 dirty tracking support.  It's a much better model for live migration  
 IMHO.

 My preference is ring proxying.  Not we'll need ring proxying (or at  
 least event proxying) for non-MSI guests.

 I avoided suggested ring proxying because I didn't want to suggest that  
 merging should be contingent on it.

Happily, the proposed interface supports is.

 Regards,

 Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 08:08:05AM -0500, Anthony Liguori wrote:
 Once it goes into the upstream  
 kernel, we need to have backwards compatibility code in QEMU forever to  
 support that kernel version.

BTW, qemu can keep doing the userspace thing if some capability it needs
is missing. It won't be worse off than if the driver is not upstream at
all :).

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default

2009-08-25 Thread Rusty Russell
On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
 Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
 means it does not allow filesystems to use barriers.  But the typical use
 case for virtio-blk is to use a backed that uses synchronous I/O

Really?  Does qemu open with O_SYNC?

I'm definitely no block expert, but this seems strange...
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default

2009-08-25 Thread Christoph Hellwig
On Tue, Aug 25, 2009 at 11:41:37PM +0930, Rusty Russell wrote:
 On Fri, 21 Aug 2009 06:26:16 am Christoph Hellwig wrote:
  Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
  means it does not allow filesystems to use barriers.  But the typical use
  case for virtio-blk is to use a backed that uses synchronous I/O
 
 Really?  Does qemu open with O_SYNC?
 
 I'm definitely no block expert, but this seems strange...
 Rusty.

Qemu can open it various ways, but the only one that is fully safe
is O_SYNC (cache=writethrough).  The O_DIRECT (cache=none) option is also
fully safe with the above patch under some limited circumstances 
(disk write caches off and using a host device or fully allocated file).

Fixing the cache=writeback option and the majority case for cache=none
requires implementing a cache flush command, and for the latter one
also fixes to the host kernel I'm working on.  You will get another
patch to implement the proper cache controls in virtio-blk for me in
a couple of days, too.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2844286 ] kvm hangs when iommu used

2009-08-25 Thread SourceForge.net
Bugs item #2844286, was opened at 2009-08-25 10:59
Message generated for change (Tracker Item Submitted) made by mjcoss
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2844286group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: kernel
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Mike Coss (mjcoss)
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm hangs when iommu used

Initial Comment:
CPU: Intel i7 975 Extreme Edition
Host:  Gentoo w/2.6.31-rc6 kernel 32bit
Guest: Windows XP sp3 32bit

I'm trying to use Vt-d to enable pci pass through of a Nvidia  GTX 295 card, 
and I currently have 2 issues.  The first is that when I shutdown the kvm, the 
process hangs.  It can't be killed, and it uses 100% utilization of the cpu 
that it's running on.  As expected, it prevents the system from shutting down 
and I have to reset the machine.  The second issue is that when I start the 
kvm, I get an error message 

Option ROM size fe00 exceeds available space
Option ROM size fe00 exceeds available space

If I don't try to use pass through the KVM starts and stops fine.  Under 
windows, it sees the GTX 295, and allowed me to install the driver, but it 
fails to start with error message that it doesn't have enough resources to 
start.  I have tried kvm-88, and the latest qemu-kvm-0.11.0-rc1.  I have also 
tried it with a Linux distribution, OpenSUSE, and the X server doesn't start, 
however I do get a message from the driver that I'm still tracking down. 

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2844286group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost net: performance with ping benchmark

2009-08-25 Thread Avi Kivity

On 08/25/2009 04:08 PM, Anthony Liguori wrote:

Michael S. Tsirkin wrote:
My preference is ring proxying.  Not we'll need ring proxying (or 
at  least event proxying) for non-MSI guests.


Exactly, that's what I meant earlier. That's enough, isn't it, Anthony?


It is if we have a working implementation that demonstrates the 
userspace interface is sufficient.  Once it goes into the upstream 
kernel, we need to have backwards compatibility code in QEMU forever 
to support that kernel version.


Not at all.  We still have pure userspace support, so if we don't like 
the first two versions of vhost, we can simply not support them.  Of 
course I'm not advocating merging something known bad or untested, just 
pointing out that the cost of an error is not that bad.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4 2/2] vhost_net: a kernel-level virtio server

2009-08-25 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 09:40:40PM +0930, Rusty Russell wrote:
  +   u32 __user *featurep = argp;
  +   int __user *fdp = argp;
  +   u32 features;
  +   int fd, r;
  +   switch (ioctl) {
  +   case VHOST_NET_SET_SOCKET:
  +   r = get_user(fd, fdp);
  +   if (r  0)
  +   return r;
  +   return vhost_net_set_socket(n, fd);
  +   case VHOST_GET_FEATURES:
  +   /* No features for now */
  +   features = 0;
  +   return put_user(features, featurep);
 
 We may well get more than 32 feature bits, at least for virtio_net, which will
 force us to do some trickery in virtio_pci.

Unlike PCI, if we ever run out of bits we can just
add FEATURES_EXTENDED ioctl, no need for trickery.

  I'd like to avoid that here,
 though it's kind of ugly.  We'd need VHOST_GET_FEATURES (and ACK) to take a
 struct like:
 
   u32 feature_size;
   u32 features[];


Thinking about this proposal some more, how will the guest
determine the size to supply the GET_FEATURES ioctl?

Since we are a bit tight in 32 bit space already,
let's just use a 64 bit integer and be done with it?
Right?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Notes on block I/O data integrity

2009-08-25 Thread Christoph Hellwig
As various people wanted to know how the various data integrity patches
I've send out recently play together here's a small writeup on what
issues we have in QEMU and how to fix it:

There are two major aspects of data integrity we need to care in the
QEMU block I/O code:

 (1) stable data storage - we must be able to force data out of caches
 onto the stable media, and we must get completion notification for it.
 (2) request ordering - we must be able to make sure some I/O request
 do not get reordered with other in-flight requests before or after
 it.

Linux uses two related abstractions to implement the this (other operating
system are probably similar)

 (1) a cache flush request that flushes the whole volatile write cache to
 stable storage
 (2) a barrier request, which
  (a) is guaranteed to actually go all the way to stable storage
  (b) does not reordered versus any requests before or after it

For disks not using volatile write caches the cache flush is a no-op,
and barrier requests are implemented by draining the queue of
outstanding requests before the barrier request, and only allowing new
requests to proceed after it has finished.  Instead of the queue drain
tag ordering could be used, but at this point that is not the case in
Linux.

For disks using volatile write caches, the cache flush is implemented by
a protocol specific request, and the the barrier request are implemented
by performing cache flushes before and after the barrier request, in
addition to the draining mentioned above.  The second cache flush can be
replaced by setting the Force Unit Access bit on the barrier request 
on modern disks.


The above is supported by the QEMU emulated disks in the following way:

  - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
indicate a volatile write cache in the ATA IDENTIFY command.  Because
of that guests do no not actually send down cache flush request.  Linux
guests do however drain the I/O queues to guarantee ordering in absence
of volatile write caches.
  - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
and also advertises the write cache enabled bit.  This means Linux
sends down cache flush requests to implement barriers, and provides
sufficient queue draining.
  - The virtio-blk driver does not implement any cache flush command.
And while there is a virtio-blk feature bit for barrier support
it is not support by virtio-blk.  Due to the lack of a cache flush
command it also is insufficient to implement the required data
integrity semantics.  Currently the virtio-blk Linux does not
advertise any form of barrier support, and we don't even get the
queue draining required for proper operation in a cache-less
environment.

The I/O from these front end drivers maps to different host kernel I/O
patterns  depending on the cache= drive command line.  There are three
choices for it:

 (a) cache=writethrough
 (b) cache=writeback
 (c) cache=none

(a) means all writes are synchronous (O_DSYNC), which means the host
kernel guarantees us that data is on stable storage once the I/O
request has completed.
In cache=writethrough mode the IDE and SCSI drivers are safe because
the queue is properly drained to guarantee I/O ordering.  Virtio-blk
is not safe due to the lack of queue draining.
(b) means we use regular buffered writes and need a fsync/fdatasync to
actually guarantee that data is stable on disk.
In data=writeback mode on the SCSI emulation is safe as all others
miss the cache flush requests.
(c) means we use direct I/O (O_DIRECT) to bypass the host cache and
perform direct dma to/from the I/O buffer in QEMU.  While direct I/O
bypasses the host cache it does not guarantee flushing of volatile
write caches in disks, nor completion of metadata operations in
filesystems (e.g. block allocations).
In data=none only the SCSI emulation is entirely safe right now
due to the lack of cache flushes in the other drivers.


Action plan for the guest drivers:

 - virtio-blk needs to advertise ordered queue by default.
   This makes cache=writethrough safe on virtio.

Action plan for QEMU:

 - IDE needs to set the write cache enabled bit
 - virtio needs to implement a cache flush command and advertise it
   (also needs a small change to the host driver)
 - we need to implement an aio_fsync to not stall the vpu on cache
   flushes
 - investigate only advertising a write cache when we really have one
   to avoid the cache flush requests for cache=writethrough

Notes on disk cache flushes on Linux hosts:

 - barrier requests and cache flushes are supported by all local
   disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
   However unlike the other filesystems ext3 does _NOT_ enable barriers
   and cache flush requests by default.
 - currently O_SYNC 

Re: Notes on block I/O data integrity

2009-08-25 Thread Javier Guerra
On Tue, Aug 25, 2009 at 1:11 PM, Christoph Hellwigh...@lst.de wrote:
  - barrier requests and cache flushes are supported by all local
   disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
   However unlike the other filesystems ext3 does _NOT_ enable barriers
   and cache flush requests by default.

what about LVM? iv'e read somewhere that it used to just eat barriers
used by XFS, making it less safe than simple partitions.

-- 
Javier
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Notes on block I/O data integrity

2009-08-25 Thread Christoph Hellwig
On Tue, Aug 25, 2009 at 02:33:30PM -0500, Javier Guerra wrote:
 On Tue, Aug 25, 2009 at 1:11 PM, Christoph Hellwigh...@lst.de wrote:
  ??- barrier requests and cache flushes are supported by all local
  ?? disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
  ?? However unlike the other filesystems ext3 does _NOT_ enable barriers
  ?? and cache flush requests by default.
 
 what about LVM? iv'e read somewhere that it used to just eat barriers
 used by XFS, making it less safe than simple partitions.

Oh, any additional layers open another by cans of worms.  On Linux until
very recently using LVM or software raid means only disabled
write caches are safe.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Notes on block I/O data integrity

2009-08-25 Thread Nikola Ciprich
Hello Christopher,

thanks a lot vor this overview, it answers a lot of my questions!
May I suggest You put it somewhere on the wiki so it doesn't get 
forgotten in the maillist only?
It also rises few new questions though. We have experienced postgresql
database corruptions lately, two times to be exact. First time, I blamed
server crash, but lately (freshly created) database got corrupted for the 
second time and there were no crashes since the initialisation. The server
hardware is surely OK. I didn't have much time to look into this
yet, but Your mail just poked me to return to the subject. The situation
is a bit more complex, as there are additional two layers of storage there:
we're using SATA/SAS drives, network-mirrored by DRBD, clustered LVM on top
of those, and finally qemu-kvm using virtio on top of created logical
volumes. So there are plenty of possible culprits, but Your mention of virtio
unsafeness while using cache=writethrough (which is the default for drive 
types other then qcow) leads me to suspicion that this might be the reason of 
the problem. Databases are sensitive for requests reordering, so I guess
using virtio for postgres storage was quite stupid from me :(
So my question is, could You please advise me a bit on the storage
configuration? virtio performed much better then SCSI, but of course
data integrity is crucial, so would You suggest rather using SCSI?
DRBD doesn't have problem with barriers, clustered LVM SHOULD not 
have problems with it, as we're using just striped volumes, but I'll
check it to be sure. So is it safe for me to keep cache=writethrough
for the database volume?

thanks a lor in advance for any hints!

with best regards

nik

On Tue, Aug 25, 2009 at 08:11:20PM +0200, Christoph Hellwig wrote:
 As various people wanted to know how the various data integrity patches
 I've send out recently play together here's a small writeup on what
 issues we have in QEMU and how to fix it:
 
 There are two major aspects of data integrity we need to care in the
 QEMU block I/O code:
 
  (1) stable data storage - we must be able to force data out of caches
  onto the stable media, and we must get completion notification for it.
  (2) request ordering - we must be able to make sure some I/O request
  do not get reordered with other in-flight requests before or after
  it.
 
 Linux uses two related abstractions to implement the this (other operating
 system are probably similar)
 
  (1) a cache flush request that flushes the whole volatile write cache to
  stable storage
  (2) a barrier request, which
   (a) is guaranteed to actually go all the way to stable storage
   (b) does not reordered versus any requests before or after it
 
 For disks not using volatile write caches the cache flush is a no-op,
 and barrier requests are implemented by draining the queue of
 outstanding requests before the barrier request, and only allowing new
 requests to proceed after it has finished.  Instead of the queue drain
 tag ordering could be used, but at this point that is not the case in
 Linux.
 
 For disks using volatile write caches, the cache flush is implemented by
 a protocol specific request, and the the barrier request are implemented
 by performing cache flushes before and after the barrier request, in
 addition to the draining mentioned above.  The second cache flush can be
 replaced by setting the Force Unit Access bit on the barrier request 
 on modern disks.
 
 
 The above is supported by the QEMU emulated disks in the following way:
 
   - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
 WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
 indicate a volatile write cache in the ATA IDENTIFY command.  Because
 of that guests do no not actually send down cache flush request.  Linux
 guests do however drain the I/O queues to guarantee ordering in absence
 of volatile write caches.
   - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
 and also advertises the write cache enabled bit.  This means Linux
 sends down cache flush requests to implement barriers, and provides
 sufficient queue draining.
   - The virtio-blk driver does not implement any cache flush command.
 And while there is a virtio-blk feature bit for barrier support
 it is not support by virtio-blk.  Due to the lack of a cache flush
 command it also is insufficient to implement the required data
 integrity semantics.  Currently the virtio-blk Linux does not
 advertise any form of barrier support, and we don't even get the
 queue draining required for proper operation in a cache-less
 environment.
 
 The I/O from these front end drivers maps to different host kernel I/O
 patterns  depending on the cache= drive command line.  There are three
 choices for it:
 
  (a) cache=writethrough
  (b) cache=writeback
  (c) cache=none
 
 (a) means all writes are synchronous (O_DSYNC), which means the 

Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-25 Thread Davide Libenzi
On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:

 Yes, we don't want that. The best thing is to try to restate the problem
 in a way that is generic, and then either solve or best use existing
 solution. Right?
 
 I thought I had that, but apparently not.  The reason I'm Cc-ing you is
 not to try and spam you until you give up and accept the patch, it's
 hoping that you see the pattern behind our usage, and help generalize
 it.
 
 If I understand it correctly, you believe this is not possible and so
 any solution will have to be in KVM? Or maybe I didn't state the problem
 clearly enough and should restate it?

Please do.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ide vs. scsi read throughput

2009-08-25 Thread Saul Tamari
Hi,


While using kvm88, I tried comparing the read throughput between PIIX
IDE and the LSI 53c895a emulation.

My test code is:
while : ; do dd if=/dev/sda1 of=/dev/null bs=4096 count=1024000 
results_file 21; done

I averaged the results with:
cat results_file | grep -v records | cut --delimiter=  -f 8 | awk
'{n++; a[n]=$1; t+=$1; printf(cur=%d   avg=%f   n=%d\n, $1, t/n, n)}
 END {avg=t/n; for (i=1;i=n;i++) sd+=(a[i] - avg) * (a[i] - avg);
printf(\n\nstd_dev=%f\n, sqrt(sd/n))}'


I ran a hundred loops on:
1. The native host's .img file
2. PIIX ide virtual drive
3. LSI scsi virtual drive

The average read throughput results I got are:
100 GB/sec on the native host (standard deviation is 3.7)
55 GB/sec on the virtual ide drive (standard deviation is 21.8)
60 GB/sec on the virtual scsi drive (standard deviation is 5.8)


Is there a way to make the virtual drives perform better and maybe get
results that are closer to the host's?


Another issue I noticed while looking at the results is large
fluctuations in ide results. The standard deviation of
the ide results is very large. A large portion of the results are
around 30GB/sec and around 80GB/sec.
Is there any meaningful explanation for this large variation? Is there
some issue in the QEMU's ide code path that causes this?


Thanks,
S
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


syslog overflowing with latest kvm tree

2009-08-25 Thread renevant


Hello,

I got lots of the following with the latest tree:

Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:44 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:45 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:51 Athas kernel: __ratelimit: 50538 callbacks suppressed
Aug 25 22:01:51 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:51 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:51 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:52 Athas kernel: emulation failed (unhandled instruction) rip
811b959a 66 8b 11 a8
Aug 25 22:01:56 Athas kernel: __ratelimit: 5766 callbacks suppressed


Wasn't doing anything fancy, just had a 64 bit guest compiling a kernel. Just
letting you guys know, i'm not sure if it's because of something i've done, I do
merge the tree with linus'.

Regards,

Will Trives



This email was sent from Netspace Webmail: http://www.netspace.net.au

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AlacrityVM benchmark numbers updated

2009-08-25 Thread Gregory Haskins
We are pleased to announce the availability of the latest networking
benchmark numbers for AlacrityVM.  We've made several tweaks to the
original v0.1 release to improve performance.  The most notable is a
switch from get_user_pages to switch_mm+copy_[to/from]_user thanks to a
review suggestion from Michael Tsirkin (as well as his patch to
implement it).

This change alone accounted for freeing up an additional 1.2Gbps, which
is over 25% improvement from v0.1.  The previous numbers were 4560Gbps
before the change, and 5708Gbps after (for 1500mtu over 10GE).  This
moves us ever closer to the goal of native performance under virtualization.

These changes will be incorporated into the upcoming v0.2 release.  For
more details, please visit our project wiki:

http://developer.novell.com/wiki/index.php/AlacrityVM

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: Page allocation failures in guest

2009-08-25 Thread Rusty Russell
On Fri, 14 Aug 2009 05:55:48 am Pierre Ossman wrote:
 On Wed, 12 Aug 2009 15:01:52 +0930
 Rusty Russell ru...@rustcorp.com.au wrote:
  Subject: virtio: net refill on out-of-memory
... 
 Patch applied. Now we wait. :)

Any results?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2843250 ] live migration fails with emulated scsi drive

2009-08-25 Thread SourceForge.net
Bugs item #2843250, was opened at 2009-08-24 05:24
Message generated for change (Comment added) made by jamessong
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2843250group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Trent Johnson (tljohnsn)
Assigned to: Nobody/Anonymous (nobody)
Summary: live migration fails with emulated scsi drive

Initial Comment:
CPU model: Intel(R) Xeon(R) CPU   L5335
KVM version 88
host kernel: centos 2.6.18-128.4.1.el5
guest: Tried both centos 4.8 - 32bit latest kernel and centos 5.3 64 bit, same 
kernel as above
qemu command line:
/usr/bin/qemu-kvm  -M pc -m 1024 -smp 2 -name c5test-lvm -uuid 
a298100a-67af-8f80-d32c-de271b4cefb1 -monitor pty -boot c -drive 
file=,if=ide,media=cdrom,index=2 -drive file=/dev/drbd1,if=scsi,index=0,boot=on 
-net nic,macaddr=54:52:00:30:7f:ae,vlan=0,name=nic.0 -net tap,vlan=0,name=tap.0 
-serial pty -parallel none -usb -curses -vga cirrus

Also tried -no-kvm and still get the errors
sd 0:0:0:0: ABORT operation timed-out.
sd 0:0:0:0: DEVICE RESET operation started.
sd 0:0:0:0: DEVICE RESET operation timed-out.
sd 0:0:0:0: BUS RESET operation started.
sym0: suspicious SCSI data while resetting the BUS.
sym0: dp1,d15-8,dp0,d7-0,rst,req,ack,bsy,sel,atn,msg,c/d,i/o = 0x0, expecting 
0x100
sd 0:0:0:0: BUS RESET operation timed-out.
sd 0:0:0:0: HOST RESET operation started.
sym0: suspicious SCSI data while resetting the BUS.
sym0: dp1,d15-8,dp0,d7-0,rst,req,ack,bsy,sel,atn,msg,c/d,i/o = 0x0, expecting 
0x100
sym0: the chip cannot lock the frequency
sym0: SCSI BUS has been reset.
sd 0:0:0:0: HOST RESET operation timed-out.
sd 0:0:0:0: scsi: Device offlined - not ready after error recovery


Live migration will give errors and stop io to the disk when using an emulated 
scsi drive on the guest.  It does not fail when using an ide drive.  I get the 
following errors:



--

Comment By: James Song (jamessong)
Date: 2009-08-26 11:25

Message:
Does the destination box also set up the scsi already?

--

Comment By: Nolan Leake (kijiki)
Date: 2009-08-24 10:29

Message:
This is fixed in qemu-kvm (and qemu upstream, for that matter).

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2843250group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Page allocation failures in guest

2009-08-25 Thread Pierre Ossman
On Wed, 26 Aug 2009 11:47:17 +0930
Rusty Russell ru...@rustcorp.com.au wrote:

 On Fri, 14 Aug 2009 05:55:48 am Pierre Ossman wrote:
  On Wed, 12 Aug 2009 15:01:52 +0930
  Rusty Russell ru...@rustcorp.com.au wrote:
   Subject: virtio: net refill on out-of-memory
 ... 
  Patch applied. Now we wait. :)
 
 Any results?
 

It's been up for 12 days, so I'd say it works. But there is nothing in
dmesg, which suggests I haven't triggered the condition yet.

I wonder if there might be something broken with Fedora's kernel. :/
(I am running the same upstream version, and their conf, for this test,
but not all of their patches)

Rgds
-- 
 -- Pierre Ossman

  WARNING: This correspondence is being monitored by the
  Swedish government. Make sure your server uses encryption
  for SMTP traffic and consider using PGP for end-to-end
  encryption.


signature.asc
Description: PGP signature