Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
Michael S. Tsirkin m...@redhat.com writes:

Haven't really read the whole thing, just noticed something at a glance.

 +/* Expects to be always run from workqueue - which acts as
 + * read-size critical section for our kind of RCU. */
 +static void handle_tx(struct vhost_net *net)
 +{
 + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 + unsigned head, out, in, s;
 + struct msghdr msg = {
 + .msg_name = NULL,
 + .msg_namelen = 0,
 + .msg_control = NULL,
 + .msg_controllen = 0,
 + .msg_iov = vq-iov,
 + .msg_flags = MSG_DONTWAIT,
 + };
 + size_t len, total_len = 0;
 + int err, wmem;
 + size_t hdr_size;
 + struct socket *sock = rcu_dereference(vq-private_data);
 + if (!sock)
 + return;
 +
 + wmem = atomic_read(sock-sk-sk_wmem_alloc);
 + if (wmem = sock-sk-sk_sndbuf)
 + return;
 +
 + use_mm(net-dev.mm);

I haven't gone over all this code in detail, but that isolated reference count
use looks suspicious. What prevents the mm from going away before
you increment, if it's not the current one?

-Andi 

-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Tue, Nov 03, 2009 at 10:11:12PM +0100, Eric Dumazet wrote:
 Michael S. Tsirkin a écrit :
  
  Paul, you acked this previously. Should I add you acked-by line so
  people calm down?  If you would rather I replace
  rcu_dereference/rcu_assign_pointer with rmb/wmb, I can do this.
  Or maybe patch Documentation to explain this RCU usage?
  
 
 So you believe I am over-reacting to this dubious use of RCU ?
 
 RCU documentation is already very complex, we dont need to add yet another
 subtle use, and makes it less readable.
 
 It seems you use 'RCU api' in drivers/vhost/net.c as convenient macros :
 
 #define rcu_dereference(p) ({ \
 typeof(p) _p1 = ACCESS_ONCE(p); \
 smp_read_barrier_depends(); \
 (_p1); \
 })
 
 #define rcu_assign_pointer(p, v) \
 ({ \
 if (!__builtin_constant_p(v) || \
 ((v) != NULL)) \
 smp_wmb(); \
 (p) = (v); \
 })
 
 
 There are plenty regular uses of smp_wmb() in kernel, not related to Read 
 Copy Update,
 there is nothing wrong to use barriers with appropriate comments.

Well, what I do has classic RCU characteristics: readers do not take
locks, writers take a lock and flush after update. This is why I believe
rcu_dereference and rcu_assign_pointer are more appropriate here than
open-coding barriers.

Before deciding whether it's a good idea to open-code barriers
instead, I would like to hear Paul's opinion.

 
 (And you already use mb(), wmb(), rmb(), smp_wmb() in your patch)

Yes, virtio guest pretty much forces this, there's no way to share
a lock with the guest.

 BTW there is at least one locking bug in vhost_net_set_features()
 
 Apparently, mutex_unlock() doesnt trigger a fault if mutex is not locked
 by current thread... even with DEBUG_MUTEXES / DEBUG_LOCK_ALLOC
 
 
 static void vhost_net_set_features(struct vhost_net *n, u64 features)
 {
size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
sizeof(struct virtio_net_hdr) : 0;
int i;
 !  mutex_unlock(n-dev.mutex);
n-dev.acked_features = features;
smp_wmb();
for (i = 0; i  VHOST_NET_VQ_MAX; ++i) {
mutex_lock(n-vqs[i].mutex);
n-vqs[i].hdr_size = hdr_size;
mutex_unlock(n-vqs[i].mutex);
}
mutex_unlock(n-dev.mutex);
vhost_net_flush(n);
 }

Thanks very much for spotting this! Will fix.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 12:08:47PM +0100, Andi Kleen wrote:
 Michael S. Tsirkin m...@redhat.com writes:
 
 Haven't really read the whole thing, just noticed something at a glance.
 
  +/* Expects to be always run from workqueue - which acts as
  + * read-size critical section for our kind of RCU. */
  +static void handle_tx(struct vhost_net *net)
  +{
  +   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
  +   unsigned head, out, in, s;
  +   struct msghdr msg = {
  +   .msg_name = NULL,
  +   .msg_namelen = 0,
  +   .msg_control = NULL,
  +   .msg_controllen = 0,
  +   .msg_iov = vq-iov,
  +   .msg_flags = MSG_DONTWAIT,
  +   };
  +   size_t len, total_len = 0;
  +   int err, wmem;
  +   size_t hdr_size;
  +   struct socket *sock = rcu_dereference(vq-private_data);
  +   if (!sock)
  +   return;
  +
  +   wmem = atomic_read(sock-sk-sk_wmem_alloc);
  +   if (wmem = sock-sk-sk_sndbuf)
  +   return;
  +
  +   use_mm(net-dev.mm);
 
 I haven't gone over all this code in detail, but that isolated reference count
 use looks suspicious. What prevents the mm from going away before
 you increment, if it's not the current one?

We take a reference to it before we start any virtqueues,
and stop all virtqueues before we drop the reference:
/* Caller should have device mutex */
static long vhost_dev_set_owner(struct vhost_dev *dev)
{
/* Is there an owner already? */
if (dev-mm)
return -EBUSY;
/* No owner, become one */
dev-mm = get_task_mm(current);
return 0;
}

And 
vhost_dev_cleanup:


if (dev-mm)
mmput(dev-mm);
dev-mm = NULL;
}


Fine?

 -Andi 
 
 -- 
 a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
 Fine?

I cannot say -- are there paths that could drop the device beforehand?
(as in do you hold a reference to it?)

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 02:15:33PM +0100, Andi Kleen wrote:
 On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
Fine?
   
   I cannot say -- are there paths that could drop the device beforehand?
  
  Do you mean drop the mm reference?
 
 No the reference to the device, which owns the mm for you.

The device is created when file is open and destroyed
when file is closed. So I think the fs code handles the
reference counting for me: it won't call file cleanup
callback while some userspace process has the file open.
Right?

  
   (as in do you hold a reference to it?)
  
  By design I think I always have a reference to mm before I use it.
  
  This works like this:
  ioctl SET_OWNER - calls get_task_mm, I think this gets a reference to mm
  ioctl SET_BACKEND - checks that SET_OWNER was run, starts virtqueue
  ioctl RESET_OWNER - stops virtqueues, drops the reference to mm
  file close - stops virtqueues, if we still have it then drops mm
  
  This is why I think I can call use_mm/unuse_mm while virtqueue is running,
  safely.
  Makes sense?
 
 Do you protect against another thread doing RESET_OWNER in parallel while
 RESET_OWNER runs?

Yes, I have a mutex in the device for that. Same with SET_BACKEND.

 -Andi
 
 -- 
 a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 1/3] tun: export underlying socket

2009-11-04 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Tue, 3 Nov 2009 19:24:00 +0200

 Assuming it's okay with davem, I think it makes sense to merge this
 patch through Rusty's tree because vhost is the first user of the new
 interface.  Posted here for completeness.

I'm fine with that, please add my:

Acked-by: David S. Miller da...@davemloft.net
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
On Wed, Nov 04, 2009 at 03:17:36PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 02:15:33PM +0100, Andi Kleen wrote:
  On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
 Fine?

I cannot say -- are there paths that could drop the device beforehand?
   
   Do you mean drop the mm reference?
  
  No the reference to the device, which owns the mm for you.
 
 The device is created when file is open and destroyed
 when file is closed. So I think the fs code handles the
 reference counting for me: it won't call file cleanup
 callback while some userspace process has the file open.
 Right?

Yes.

But the semantics when someone inherits such a fd through exec
or through file descriptor passing would be surely interesting
You would still do IO on the old VM.

I guess it would be a good way to confuse memory accounting schemes 
or administrators @)

It would be all saner if this was all a single atomic step.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 02:37:28PM +0100, Andi Kleen wrote:
 On Wed, Nov 04, 2009 at 03:17:36PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 04, 2009 at 02:15:33PM +0100, Andi Kleen wrote:
   On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
  Fine?
 
 I cannot say -- are there paths that could drop the device beforehand?

Do you mean drop the mm reference?
   
   No the reference to the device, which owns the mm for you.
  
  The device is created when file is open and destroyed
  when file is closed. So I think the fs code handles the
  reference counting for me: it won't call file cleanup
  callback while some userspace process has the file open.
  Right?
 
 Yes.
 
 But the semantics when someone inherits such a fd through exec
 or through file descriptor passing would be surely interesting
 You would still do IO on the old VM.
 
 I guess it would be a good way to confuse memory accounting schemes 
 or administrators @)
 It would be all saner if this was all a single atomic step.
 
 -Andi

I have this atomic actually. A child process will first thing
do SET_OWNER: this is required before any other operation.

SET_OWNER atomically (under mutex) does two things:
- check that there is no other owner
- get mm and set current process as owner

I hope this addresses your concern?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv8 0/3] vhost: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
Ok, I think I've addressed all comments so far here.
Rusty, I'd like this to go into linux-next, through your tree, and
hopefully 2.6.33.  What do you think?

---

This implements vhost: a kernel-level backend for virtio,
The main motivation for this work is to reduce virtualization
overhead for virtio by removing system calls on data path,
without guest changes. For virtio-net, this removes up to
4 system calls per packet: vm exit for kick, reentry for kick,
iothread wakeup for packet, interrupt injection for packet.

This driver is pretty minimal, but it's fully functional (including
migration support interfaces), and already shows performance (especially
latency) improvement over userspace.

Some more detailed description attached to the patch itself.

The patches apply to both 2.6.32-rc6 and kvm.git.  I'd like them to go
into linux-next if possible.  Please comment.

Changelog from v7:
- Add note on RCU usage, mirroring this in vhost/vhost.h
- Fix locking typo noted by Eric Dumazet
- Fix warnings on 32 bit

Changelog from v6:
- review comments by Daniel Walker addressed
- checkpatch cleanup
- fix build on 32 bit
- maintainers entry corrected

Changelog from v5:
- tun support
- backends with virtio net header support (enables GSO, checksum etc)
- 32 bit compat fixed
- support indirect buffers, tx exit mitigation,
  tx interrupt mitigation
- support write logging (allows migration without virtio ring code in userspace)

Changelog from v4:
- disable rx notification when have rx buffers
- addressed all comments from Rusty's review
- copy bugfixes from lguest commits:
ebf9a5a99c1a464afe0b4dfa64416fc8b273bc5c
e606490c440900e50ccf73a54f6fc6150ff40815

Changelog from v3:
- checkpatch fixes

Changelog from v2:
- Comments on RCU usage
- Compat ioctl support
- Make variable static
- Copied more idiomatic english from Rusty

Changes from v1:
- Move use_mm/unuse_mm from fs/aio.c to mm instead of copying.
- Reorder code to avoid need for forward declarations
- Kill a couple of debugging printks

Michael S. Tsirkin (3):
  tun: export underlying socket
  mm: export use_mm/unuse_mm to modules
  vhost_net: a kernel-level virtio server

 MAINTAINERS|9 +
 arch/x86/kvm/Kconfig   |1 +
 drivers/Makefile   |1 +
 drivers/net/tun.c  |  101 -
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  633 +
 drivers/vhost/vhost.c  |  970 
 drivers/vhost/vhost.h  |  158 +++
 include/linux/Kbuild   |1 +
 include/linux/if_tun.h |   14 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  126 ++
 mm/mmu_context.c   |3 +
 14 files changed, 2012 insertions(+), 19 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv8 1/3] tun: export underlying socket

2009-11-04 Thread Michael S. Tsirkin
Tun device looks similar to a packet socket
in that both pass complete frames from/to userspace.

This patch fills in enough fields in the socket underlying tun driver
to support sendmsg/recvmsg operations, and message flags
MSG_TRUNC and MSG_DONTWAIT, and exports access to this socket
to modules.  Regular read/write behaviour is unchanged.

This way, code using raw sockets to inject packets
into a physical device, can support injecting
packets into host network stack almost without modification.

First user of this interface will be vhost virtualization
accelerator.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Herbert Xu herb...@gondor.apana.org.au
Acked-by: David S. Miller da...@davemloft.net
---

DaveM agreed that this will go in through Rusty's tree together
with vhost patch, because vhost is the first user of the new
interface.

 drivers/net/tun.c  |  101 +++-
 include/linux/if_tun.h |   14 +++
 2 files changed, 96 insertions(+), 19 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4fdfa2a..18f8876 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -144,6 +144,7 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file)
err = 0;
tfile-tun = tun;
tun-tfile = tfile;
+   tun-socket.file = file;
dev_hold(tun-dev);
sock_hold(tun-socket.sk);
atomic_inc(tfile-count);
@@ -158,6 +159,7 @@ static void __tun_detach(struct tun_struct *tun)
/* Detach from net device */
netif_tx_lock_bh(tun-dev);
tun-tfile = NULL;
+   tun-socket.file = NULL;
netif_tx_unlock_bh(tun-dev);
 
/* Drop read queue */
@@ -387,7 +389,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
/* Notify and wake up reader process */
if (tun-flags  TUN_FASYNC)
kill_fasync(tun-fasync, SIGIO, POLL_IN);
-   wake_up_interruptible(tun-socket.wait);
+   wake_up_interruptible_poll(tun-socket.wait, POLLIN |
+  POLLRDNORM | POLLRDBAND);
return NETDEV_TX_OK;
 
 drop:
@@ -743,7 +746,7 @@ static __inline__ ssize_t tun_put_user(struct tun_struct 
*tun,
len = min_t(int, skb-len, len);
 
skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
-   total += len;
+   total += skb-len;
 
tun-dev-stats.tx_packets++;
tun-dev-stats.tx_bytes += len;
@@ -751,34 +754,23 @@ static __inline__ ssize_t tun_put_user(struct tun_struct 
*tun,
return total;
 }
 
-static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
-   unsigned long count, loff_t pos)
+static ssize_t tun_do_read(struct tun_struct *tun,
+  struct kiocb *iocb, const struct iovec *iv,
+  ssize_t len, int noblock)
 {
-   struct file *file = iocb-ki_filp;
-   struct tun_file *tfile = file-private_data;
-   struct tun_struct *tun = __tun_get(tfile);
DECLARE_WAITQUEUE(wait, current);
struct sk_buff *skb;
-   ssize_t len, ret = 0;
-
-   if (!tun)
-   return -EBADFD;
+   ssize_t ret = 0;
 
DBG(KERN_INFO %s: tun_chr_read\n, tun-dev-name);
 
-   len = iov_length(iv, count);
-   if (len  0) {
-   ret = -EINVAL;
-   goto out;
-   }
-
add_wait_queue(tun-socket.wait, wait);
while (len) {
current-state = TASK_INTERRUPTIBLE;
 
/* Read frames from the queue */
if (!(skb=skb_dequeue(tun-socket.sk-sk_receive_queue))) {
-   if (file-f_flags  O_NONBLOCK) {
+   if (noblock) {
ret = -EAGAIN;
break;
}
@@ -805,6 +797,27 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const 
struct iovec *iv,
current-state = TASK_RUNNING;
remove_wait_queue(tun-socket.wait, wait);
 
+   return ret;
+}
+
+static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
+   unsigned long count, loff_t pos)
+{
+   struct file *file = iocb-ki_filp;
+   struct tun_file *tfile = file-private_data;
+   struct tun_struct *tun = __tun_get(tfile);
+   ssize_t len, ret;
+
+   if (!tun)
+   return -EBADFD;
+   len = iov_length(iv, count);
+   if (len  0) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = tun_do_read(tun, iocb, iv, len, file-f_flags  O_NONBLOCK);
+   ret = min_t(ssize_t, ret, len);
 out:
tun_put(tun);
return ret;
@@ -847,7 +860,8 @@ static void tun_sock_write_space(struct sock *sk)
return;
 
if (sk-sk_sleep  waitqueue_active(sk-sk_sleep))
-   wake_up_interruptible_sync(sk-sk_sleep);
+   

[PATCHv8 2/3] mm: export use_mm/unuse_mm to modules

2009-11-04 Thread Michael S. Tsirkin
vhost net module wants to do copy to/from user from a kernel thread,
which needs use_mm. Export it to modules.

Acked-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 mm/mmu_context.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index ded9081..0777654 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -5,6 +5,7 @@
 
 #include linux/mm.h
 #include linux/mmu_context.h
+#include linux/module.h
 #include linux/sched.h
 
 #include asm/mmu_context.h
@@ -37,6 +38,7 @@ void use_mm(struct mm_struct *mm)
if (active_mm != mm)
mmdrop(active_mm);
 }
+EXPORT_SYMBOL_GPL(use_mm);
 
 /*
  * unuse_mm
@@ -56,3 +58,4 @@ void unuse_mm(struct mm_struct *mm)
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
 }
+EXPORT_SYMBOL_GPL(unuse_mm);
-- 
1.6.5.2.143.g8cc62

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used
in a workqueue item.  The role of rcu_read_lock() is taken on by the
start of execution of the workqueue item, of rcu_read_unlock() by the
end of execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work().

Acked-by: Arnd Bergmann a...@arndb.de
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Paul, I'm Cc you so that you can verify that nothing significant changed
in vhost/net.c since you approved it the previous time.  Could you
please Ack use of rcu_dereference/rcu_assign_pointer there?  Eric
Dumazet proposed that I open-code these, he suggested that this is an
abuse of these macros and not enough like RCU. What is your opinion?

 MAINTAINERS|9 +
 arch/x86/kvm/Kconfig   |1 +
 drivers/Makefile   |1 +
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  633 +
 drivers/vhost/vhost.c  |  970 
 drivers/vhost/vhost.h  |  158 +++
 include/linux/Kbuild   |1 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  126 ++
 11 files changed, 1913 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 8824115..980a69b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5619,6 +5619,15 @@ S:   Maintained
 F: Documentation/filesystems/vfat.txt
 F: fs/fat/
 
+VIRTIO HOST (VHOST)
+M: Michael S. Tsirkin m...@redhat.com
+L: k...@vger.kernel.org
+L: virtualizat...@lists.osdl.org
+L: net...@vger.kernel.org
+S: Maintained
+F: drivers/vhost/
+F: include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 M: Roger Luethi r...@hellgate.ch
 S: Maintained
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_AMD
 
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig
 
diff --git a/drivers/Makefile b/drivers/Makefile
index 6ee53c7..81e3659 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_HID)   += hid/
 obj-$(CONFIG_PPC_PS3)  += ps3/
 obj-$(CONFIG_OF)   += of/
 obj-$(CONFIG_SSB)  += ssb/
+obj-$(CONFIG_VHOST_NET)+= vhost/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_VLYNQ)+= vlynq/
 obj-$(CONFIG_STAGING)  += staging/
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
new file mode 100644
index 000..9f409f4
--- /dev/null
+++ b/drivers/vhost/Kconfig
@@ -0,0 +1,11 @@
+config VHOST_NET
+   tristate Host kernel accelerator for virtio net (EXPERIMENTAL)
+   depends on NET  EVENTFD  EXPERIMENTAL
+   ---help---
+ This kernel module 

Re: [PATCHv8 0/3] vhost: a kernel-level virtio server

2009-11-04 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 Ok, I think I've addressed all comments so far here.
 Rusty, I'd like this to go into linux-next, through your tree, and
 hopefully 2.6.33.  What do you think?

I think the benchmark data is a prerequisite for merge consideration, IMO.

Do you have anything for us to look at?  I think comparison that show
the following are of interest:

throughput (e.g. netperf::TCP_STREAM): guest-host, guest-host-guest,
guest-host-remote, host-remote, remote-host-guest

latency (e.g. netperf::UDP_RR): same conditions as throughput

cpu-utilization

others?

Ideally, this should be at least between upstream virtio and vhost.
Bonus points if you include venet as well.

Kind regards,
-Greg



signature.asc
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCHv8 0/3] vhost: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 11:02:15AM -0500, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
  Ok, I think I've addressed all comments so far here.
  Rusty, I'd like this to go into linux-next, through your tree, and
  hopefully 2.6.33.  What do you think?
 
 I think the benchmark data is a prerequisite for merge consideration, IMO.

Shirley Ma was kind enough to send me some measurement results showing
how kernel level acceleration helps speed up you can find them here:
http://www.linux-kvm.org/page/VhostNet

Generally, I think that merging should happen *before* agressive
benchmarking/performance tuning: otherwise there is very substancial
risk that what is an optimization in one setup hurts performance in
another one. When code is upstream, people can bisect to debug
regressions. Another good reason is that I can stop spending time
rebasing and start profiling.

 Do you have anything for us to look at?

For guest to host, compared to latest qemu with userspace virtio
backend, latency drops by a factor of 6, bandwidth doubles, cpu
utilization drops slightly :)

 I think comparison that show the following are of interest:
 
 throughput (e.g. netperf::TCP_STREAM): guest-host, guest-host-guest,
 guest-host-remote, host-remote, remote-host-guest
 
 latency (e.g. netperf::UDP_RR): same conditions as throughput
 
 cpu-utilization
 
 others?
 
 Ideally, this should be at least between upstream virtio and vhost.
 Bonus points if you include venet as well.

And vmxnet3 :)

 Kind regards,
 -Greg
 
-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 03:41:47PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 02:37:28PM +0100, Andi Kleen wrote:
  On Wed, Nov 04, 2009 at 03:17:36PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 04, 2009 at 02:15:33PM +0100, Andi Kleen wrote:
On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
   Fine?
  
  I cannot say -- are there paths that could drop the device 
  beforehand?
 
 Do you mean drop the mm reference?

No the reference to the device, which owns the mm for you.
   
   The device is created when file is open and destroyed
   when file is closed. So I think the fs code handles the
   reference counting for me: it won't call file cleanup
   callback while some userspace process has the file open.
   Right?
  
  Yes.
  
  But the semantics when someone inherits such a fd through exec
  or through file descriptor passing would be surely interesting
  You would still do IO on the old VM.
  
  I guess it would be a good way to confuse memory accounting schemes 
  or administrators @)
  It would be all saner if this was all a single atomic step.
  
  -Andi
 
 I have this atomic actually. A child process will first thing
 do SET_OWNER: this is required before any other operation.
 
 SET_OWNER atomically (under mutex) does two things:
 - check that there is no other owner
 - get mm and set current process as owner
 
 I hope this addresses your concern?

Andrea, since you looked at this design at the early stages,
maybe you can provide feedback on the following question:

vhost has an ioctl to do get_task_mm and store the mm in per-file device
structure.  mmput is called when file is closed.  vhost is careful not
to reference the mm after is has been put. There is also an atomic
mutual exclusion mechanism to ensure that vhost does not allow one
process to access another's mm, even if they share a vhost file
descriptor.  But, this still means that mm structure can outlive the
task if the file descriptor is shared with another process.

Other drivers, such as kvm, have the same property.

Do you think this is OK?

Thanks!

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Paul E. McKenney
On Wed, Nov 04, 2009 at 01:57:29PM +0200, Michael S. Tsirkin wrote:
 On Tue, Nov 03, 2009 at 03:57:44PM -0800, Paul E. McKenney wrote:
  On Tue, Nov 03, 2009 at 01:14:06PM -0500, Gregory Haskins wrote:
   Gregory Haskins wrote:
Eric Dumazet wrote:
Michael S. Tsirkin a écrit :
+static void handle_tx(struct vhost_net *net)
+{
+ struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
+ unsigned head, out, in, s;
+ struct msghdr msg = {
+ .msg_name = NULL,
+ .msg_namelen = 0,
+ .msg_control = NULL,
+ .msg_controllen = 0,
+ .msg_iov = vq-iov,
+ .msg_flags = MSG_DONTWAIT,
+ };
+ size_t len, total_len = 0;
+ int err, wmem;
+ size_t hdr_size;
+ struct socket *sock = rcu_dereference(vq-private_data);
+ if (!sock)
+ return;
+
+ wmem = atomic_read(sock-sk-sk_wmem_alloc);
+ if (wmem = sock-sk-sk_sndbuf)
+ return;
+
+ use_mm(net-dev.mm);
+ mutex_lock(vq-mutex);
+ vhost_no_notify(vq);
+
using rcu_dereference() and mutex_lock() at the same time seems wrong, 
I suspect
that your use of RCU is not correct.
   
1) rcu_dereference() should be done inside a read_rcu_lock() section, 
and
   we are not allowed to sleep in such a section.
   (Quoting Documentation/RCU/whatisRCU.txt :
 It is illegal to block while in an RCU read-side critical 
section, )
   
2) mutex_lock() can sleep (ie block)
   


Michael,
  I warned you that this needed better documentation ;)

Eric,
  I think I flagged this once before, but Michael convinced me that it
was indeed ok, if but perhaps a bit unconventional.  I will try to
find the thread.

Kind Regards,
-Greg

   
   Here it is:
   
   http://lkml.org/lkml/2009/8/12/173
  
  What was happening in that case was that the rcu_dereference()
  was being used in a workqueue item.  The role of rcu_read_lock()
  was taken on be the start of execution of the workqueue item, of
  rcu_read_unlock() by the end of execution of the workqueue item, and
  of synchronize_rcu() by flush_workqueue().  This does work, at least
  assuming that flush_workqueue() operates as advertised, which it appears
  to at first glance.
  
  The above code looks somewhat different, however -- I don't see
  handle_tx() being executed in the context of a work queue.  Instead
  it appears to be in an interrupt handler.
  So what is the story?  Using synchronize_irq() or some such?
  
  Thanx, Paul
 
 No, there has been no change (I won't be able to use a mutex in an
 interrupt handler, will I?).  handle_tx is still called in the context
 of a work queue: either from handle_tx_kick or from handle_tx_net which
 are work queue items.

Ah, my mistake -- I was looking at 2.6.31 rather than latest git with
your patches.

 Can you ack this usage please?

I thought I had done so in my paragraph above, but if you would like
something a bit more formal...

I, Paul E. McKenney, maintainer of the RCU implmentation
embodied in the Linux kernel and co-inventor of RCU, being of
sound mind and body, notwithstanding the wear and tear inherent
in my numerous decades sojourn on this planet, hereby declare
that the following usage of work queues constitutes a valid
RCU implementation:

1.  Execution of a full workqueue item being substituted
for a conventional RCU read-side critical section, so
that the start of execution of the function specified to
INIT_WORK() corresponds to rcu_read_lock(), and the end of
this self-same function corresponds to rcu_read_unlock().

2.  Execution of flush_workqueue() being substituted for
the conventional synchronize_rcu().

The kernel developer availing himself or herself of this
declaration must observe the following caveats:

a.  The function specified to INIT_WORK() may only be
invoked via the workqueue mechanism.  Invoking said
function directly renders this declaration null
and void, as it prevents the flush_workqueue() function
from delivering the fundamental guarantee inherent in RCU.

b.  At some point in the future, said developer may be
required to apply some gcc attribute or sparse annotation
to the function passed to INIT_WORK().  Beyond that
point, failure to comply will render this declaration
null and void, as such failure would render inoperative
some potential RCU-validation tools, as duly noted by
Eric Dumazet.

c.  This 

Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Eric Dumazet
Paul E. McKenney a écrit :
 
 (Sorry, but, as always, I could not resist!)

Yes :)

Thanks Paul for this masterpiece of diplomatic Acked-by ;)



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv6 1/3] tun: export underlying socket

2009-11-04 Thread Arnd Bergmann
On Tuesday 03 November 2009, Arnd Bergmann wrote:
  index 3f5fd52..404abe0 100644
  --- a/include/linux/if_tun.h
  +++ b/include/linux/if_tun.h
  @@ -86,4 +86,18 @@ struct tun_filter {
  __u8   addr[0][ETH_ALEN];
   };
   
  +#ifdef __KERNEL__
  +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
  +struct socket *tun_get_socket(struct file *);
  +#else
  +#include linux/err.h
  +#include linux/errno.h
  +struct file;
  +struct socket;
  +static inline struct socket *tun_get_socket(struct file *f)
  +{
  +   return ERR_PTR(-EINVAL);
  +}
  +#endif /* CONFIG_TUN */
  +#endif /* __KERNEL__ */
   #endif /* __IF_TUN_H */
 
 Is this a leftover from testing? Exporting the function for !__KERNEL__
 seems pointless.
 

Michael, you didn't reply on this comment and the code is still there in v8.
Do you actually need this? What for?

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv6 1/3] tun: export underlying socket

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 07:09:05PM +0100, Arnd Bergmann wrote:
 On Tuesday 03 November 2009, Arnd Bergmann wrote:
   index 3f5fd52..404abe0 100644
   --- a/include/linux/if_tun.h
   +++ b/include/linux/if_tun.h
   @@ -86,4 +86,18 @@ struct tun_filter {
   __u8   addr[0][ETH_ALEN];
};

   +#ifdef __KERNEL__
   +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
   +struct socket *tun_get_socket(struct file *);
   +#else
   +#include linux/err.h
   +#include linux/errno.h
   +struct file;
   +struct socket;
   +static inline struct socket *tun_get_socket(struct file *f)
   +{
   +   return ERR_PTR(-EINVAL);
   +}
   +#endif /* CONFIG_TUN */
   +#endif /* __KERNEL__ */
#endif /* __IF_TUN_H */
  
  Is this a leftover from testing? Exporting the function for !__KERNEL__
  seems pointless.
  
 
 Michael, you didn't reply on this comment and the code is still there in v8.
 Do you actually need this? What for?
 
   Arnd 

Sorry, missed the question. If you look closely it is not exported for
!__KERNEL__ at all.  The stub is for when CONFIG_TUN is undefined.
Maybe I'll add a comment near #else, even though this is a bit strange
since the #if is just 2 lines above it.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 09:25:42AM -0800, Paul E. McKenney wrote:
 (Sorry, but, as always, I could not resist!)
 
   Thanx, Paul

Thanks Paul!
Jonathan: are you reading this?
Another one for your quotes of the week collection :)

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 09:25:42AM -0800, Paul E. McKenney wrote:
 (Sorry, but, as always, I could not resist!)

  Thanx, Paul
 
 Thanks Paul!
 Jonathan: are you reading this?
 Another one for your quotes of the week collection :)
 

I second that.



signature.asc
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCHv8 0/3] vhost: a kernel-level virtio server

2009-11-04 Thread Michael S. Tsirkin
On Wed, Nov 04, 2009 at 02:15:42PM -0500, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
  On Wed, Nov 04, 2009 at 11:02:15AM -0500, Gregory Haskins wrote:
  Michael S. Tsirkin wrote:
  Ok, I think I've addressed all comments so far here.
  Rusty, I'd like this to go into linux-next, through your tree, and
  hopefully 2.6.33.  What do you think?
  I think the benchmark data is a prerequisite for merge consideration, IMO.
  
  Shirley Ma was kind enough to send me some measurement results showing
  how kernel level acceleration helps speed up you can find them here:
  http://www.linux-kvm.org/page/VhostNet
 
 Thanks for the pointers.  I will roll your latest v8 code into our test
 matrix.  What kernel/qemu trees do they apply to?
 
 -Greg
 


kernel 2.6.32-rc6, qemu-kvm 47e465f031fc43c53ea8f08fa55cc3482c6435c8.
You can also use my development git trees if you like.

kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
userspace:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git vhost

Please note I rebase especially userspace tree now and when.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization