Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread David Stevens
Since datalen carries the difference and will be negative by that amount
from the original loop, what about just adding something like:

}
if (headcount)
heads[headcount-1].len += datalen;
[and really, headcount 0 since datalen  0, so just:

heads[headcount-1].len += datalen;

+-DLS


kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
  use_mm(net-dev.mm);
  mutex_lock(vq-mutex);
  vhost_disable_notify(vq);
  -   hdr_size = vq-hdr_size;
  +   vhost_hlen = vq-vhost_hlen;
  
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(vq, sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads,
  +datalen + vhost_hlen,
  +in, vq_log, log);
  +  if (headcount  0)
  + break;
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  +  if (!headcount) {
if (unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
  @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net *
break;
 }
 /* We don't need to be notified again. */
  -  if (out) {
  - vq_err(vq, Unexpected descriptor format for RX: 
  -out %d, int %d\n,
  -out, in);
  - break;
  -  }
  -  /* Skip header. TODO: support TSO/mergeable rx buffers. */
  -  s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
  +  if (vhost_hlen)
  + /* Skip header. TODO: support TSO. */
  + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in);
  +  else
  + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in);
 msg.msg_iovlen = in;
 len = iov_length(vq-iov, in);
 /* Sanity check */
 if (!len) {
vq_err(vq, Unexpected header len for RX: 
   %zd expected %zd\n,
  -iov_length(vq-hdr, s), hdr_size);
  +iov_length(vq-hdr, s), vhost_hlen);
break;
 }
 err = sock-ops-recvmsg(NULL, sock, msg,
   len, MSG_DONTWAIT | MSG_TRUNC);
 /* TODO: Check specific error and bomb out unless EAGAIN? */
 if (err  0) {
  - vhost_discard_vq_desc(vq);
  + vhost_discard_desc(vq, headcount);
break;
 }
  -  /* TODO: Should check and handle checksum. */
  -  if (err  len) {
  - pr_err(Discarded truncated rx packet: 
  - len %d  %zd\n, err, len);
  - vhost_discard_vq_desc(vq);
  +  if (err != datalen) {
  + pr_err(Discarded rx packet: 
  + len %d, expected %zd\n, err, datalen);
  + vhost_discard_desc(vq, headcount);
continue;
 }
 len = err;
  -  err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
  -  if (err) {
  - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n,
  -vq-iov-iov_base, err);
  +  if (vhost_hlen 
  +  memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0,
  +  vhost_hlen)) {
  + vq_err(vq, Unable to write vnet_hdr at addr %p\n,
  +vq-iov-iov_base);
break;
 }
  -  len += hdr_size;
  -  vhost_add_used_and_signal(net-dev, vq, head, len);
  +  /* TODO: Should check and handle checksum. */
  +  if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) 
  +  memcpy_toiovecend(vq-hdr, (unsigned char *)headcount,
  +  offsetof(typeof(hdr), num_buffers),
  +  sizeof(hdr.num_buffers))) {
  + vq_err(vq, Failed num_buffers write);
  + vhost_discard_desc(vq, headcount);
  + break;
  +  }
  +  len += vhost_hlen;
  +  vhost_add_used_and_signal_n(net-dev, vq, vq-heads,
  +   headcount);
 if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, len);
 total_len += len;
 
 OK I think I see the bug here: vhost_add_used_and_signal_n
 does not get the actual length, it gets the iovec length from vhost.
 Guest virtio uses this as packet length, with bad results.
 
 So I have applied the follows and it seems to have fixed the problem:
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index c16db02..9d7496d 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -219,7 +219,7 @@ static int peek_head_len(struct vhost_virtqueue *vq, 

 struct sock *sk)
  /* This is a multi-buffer 

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread David Stevens
netdev-ow...@vger.kernel.org wrote on 05/10/2010 10:25:57 AM:

 On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote:
  Since datalen carries the difference and will be negative by that 
amount
  from the original loop, what about just adding something like:
  
  }
  if (headcount)
  heads[headcount-1].len += datalen;
  [and really, headcount 0 since datalen  0, so just:
  
  heads[headcount-1].len += datalen;
  
  +-DLS
 
 This works too, just does more checks and comparisons.
 I am still surprised that you were unable to reproduce the problem.
 

I'm sure it happened, and probably had a performance
penalty on my systems too, but not as much as yours.
I didn't see any obvious performance difference running
with the patch, though; not sure why. I'll instrument to
see how often it's happening, I think.
But fixed now, good catch!

+-DLS

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  This patch adds mergeable receive buffer support to vhost_net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 I've been doing some more testing before sending out a pull
 request, and I see a drastic performance degradation in guest to host
 traffic when this is applied but mergeable buffers are not in used
 by userspace (existing qemu-kvm userspace).

Actually, I wouldn't expect it to work at all; the qemu-kvm
patch (particularly the feature bit setting bug fix) is required.
Without it, I think the existing code will tell the guest to use
mergeable buffers while turning it off in vhost. That was completely
non-functional for me -- what version of qemu-kvm are you using?
What I did to test w/o mergeable buffers is turn off the
bit in VHOST_FEATURES. I'll recheck these, but qemu-kvm definitely
must be updated; the original doesn't correctly handle feature bits.

+-DLS

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 08:56:14 AM:

 On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:
  
   On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
This patch adds mergeable receive buffer support to vhost_net.

Signed-off-by: David L Stevens dlstev...@us.ibm.com
   
   I've been doing some more testing before sending out a pull
   request, and I see a drastic performance degradation in guest to 
host
   traffic when this is applied but mergeable buffers are not in used
   by userspace (existing qemu-kvm userspace).
  
  Actually, I wouldn't expect it to work at all;
  the qemu-kvm
  patch (particularly the feature bit setting bug fix) is required.
 
 Which bugfix is that?

Actually, I see you put that upstream already--
commit dc14a397812b91dd0d48b03d1b8f66a251542369  in
Avi's tree is the one I was talking about.
I'll look further.

+-DLS

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-04-30 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/29/2010 06:45:15 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  This patch adds mergeable receive buffer support to vhost_net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 I have applied this, thanks very much!
 I have also applied some tweaks on top,
 please take a look.
 
 Thanks,
 MSt
 

Looks fine to me.

Acked-by: David L Stevens dlstev...@us.ibm.com

 commit 2809e94f5f26d89dc5232aaec753ffda95c4d95e
 Author: Michael S. Tsirkin m...@redhat.com
 Date:   Thu Apr 29 16:18:08 2010 +0300
 
 vhost-net: minor tweaks in mergeable buffer code
 
 Applies the following tweaks on top of mergeable buffers patch:
 1. vhost_get_desc_n assumes that all desriptors are 'in' only.
It's also unlikely to be useful for any vhost frontend
besides vhost_net, so just move it to net.c, and rename
get_rx_bufs for brevity.
 
 2. for rx, we called iov_length within vhost_get_desc_n
(now get_rx_bufs) already, so we don't
need an extra call to iov_length to avoid overflow anymore.
Accordingly, copy_iovec_hdr can return void now.
 
 3. for rx, do some further code tweaks:
do not assign len = err as we check that err == len
handle data length in a way similar to how we handle
header length: datalen - sock_len, len - vhost_len.
add sock_hlen as a local variable, for symmetry with vhost_hlen.
 
 4. add some likely/unlikely annotations
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index d61d945..85519b4 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -74,9 +74,9 @@ static int move_iovec_hdr(struct iovec *from, struct 
iovec *to,
 }
 return seg;
  }
 -/* Copy iovec entries for len bytes from iovec. Return segments used. 
*/
 -static int copy_iovec_hdr(const struct iovec *from, struct iovec *to,
 -   size_t len, int iovcount)
 +/* Copy iovec entries for len bytes from iovec. */
 +static void copy_iovec_hdr(const struct iovec *from, struct iovec *to,
 +size_t len, int iovcount)
  {
 int seg = 0;
 size_t size;
 @@ -89,7 +89,6 @@ static int copy_iovec_hdr(const struct iovec *from, 
struct iovec *to,
++to;
++seg;
 }
 -   return seg;
  }
 
  /* Caller must have TX VQ lock */
 @@ -204,7 +203,7 @@ static void handle_tx(struct vhost_net *net)
 unuse_mm(net-dev.mm);
  }
 
 -static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
 +static int peek_head_len(struct vhost_virtqueue *vq, struct sock *sk)
  {
 struct sk_buff *head;
 int len = 0;
 @@ -212,17 +211,70 @@ static int vhost_head_len(struct vhost_virtqueue 
*vq, 
 struct sock *sk)
 lock_sock(sk);
 head = skb_peek(sk-sk_receive_queue);
 if (head)
 -  len = head-len + vq-sock_hlen;
 +  len = head-len;
 release_sock(sk);
 return len;
  }
 
 +/* This is a multi-buffer version of vhost_get_desc, that works if
 + *   vq has read descriptors only.
 + * @vq  - the relevant virtqueue
 + * @datalen   - data length we'll be reading
 + * @iovcount   - returned count of io vectors we fill
 + * @log  - vhost log
 + * @log_num   - log offset
 + *   returns number of buffer heads allocated, negative on error
 + */
 +static int get_rx_bufs(struct vhost_virtqueue *vq,
 + struct vring_used_elem *heads,
 + int datalen,
 + unsigned *iovcount,
 + struct vhost_log *log,
 + unsigned *log_num)
 +{
 +   unsigned int out, in;
 +   int seg = 0;
 +   int headcount = 0;
 +   unsigned d;
 +   int r;
 +
 +   while (datalen  0) {
 +  if (unlikely(headcount = VHOST_NET_MAX_SG)) {
 + r = -ENOBUFS;
 + goto err;
 +  }
 +  d = vhost_get_desc(vq-dev, vq, vq-iov + seg,
 +   ARRAY_SIZE(vq-iov) - seg, out,
 +   in, log, log_num);
 +  if (d == vq-num) {
 + r = 0;
 + goto err;
 +  }
 +  if (unlikely(out || in = 0)) {
 + vq_err(vq, unexpected descriptor format for RX: 
 +out %d, in %d\n, out, in);
 + r = -EINVAL;
 + goto err;
 +  }
 +  heads[headcount].id = d;
 +  heads[headcount].len = iov_length(vq-iov + seg, in);
 +  datalen -= heads[headcount].len;
 +  ++headcount;
 +  seg += in;
 +   }
 +   *iovcount = seg;
 +   return headcount;
 +err:
 +   vhost_discard_desc(vq, headcount);
 +   return r;
 +}
 +
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_rx(struct vhost_net *net)
  {
 struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
 -   unsigned in, log, s;
 +   unsigned uninitialized_var(in), log;
 struct vhost_log *vq_log;
 struct msghdr msg = {
.msg_name = NULL,
 @@ -238,9 +290,10 @@ static void handle_rx(struct vhost_net *net)
   

Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM:

 On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
  This patch adds the mergeable RX buffers feature to vhost.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Looks pretty clean to me.
 Could you send a checkpatch-clean version please?

The original passes checkpatch already, but I guess I must
still be getting whitespace mangling if it didn't for you. (sigh)
Here it is as an attachment:



 We should also check performance implications.
 Do you have any data?

I'm getting on the order of 10-20% improvement in
throughput over stock vhost guest to host, but I do see
a lot of variability in the results, even with no KVM
and just over loopback. I don't know where that's coming
from, but I'll do some runs and post.

Thanks for all the reviews!

+-DLS



MRXBv4.patch
Description: Binary data
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
 
 Thanks!
 There's some whitespace damage, are you sending with your new
 sendmail setup? It seems to have worked for qemu patches ...

Yes, I saw some line wraps in what I received, but I checked
the original draft to be sure and they weren't there. Possibly from
the relay; Sigh.


  @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
 /* TODO: Check specific error and bomb out unless ENOBUFS? */
 err = sock-ops-sendmsg(NULL, sock, msg, len);
 if (unlikely(err  0)) {
  - vhost_discard_vq_desc(vq);
  - tx_poll_start(net, sock);
  + if (err == -EAGAIN) {
  +vhost_discard_desc(vq, 1);
  +tx_poll_start(net, sock);
  + } else {
  +vq_err(vq, sendmsg: errno %d\n, -err);
  +/* drop packet; do not discard/resend */
  +vhost_add_used_and_signal(net-dev, vq, head,
  +   0);
 
 vhost does not currently has a consistent error handling strategy:
 if we drop packets, need to think which other errors should cause
 packet drops.  I prefer to just call vq_err for now,
 and have us look at handling segfaults etc in a consistent way
 separately.

I had to add this to avoid an infinite loop when I wrote a bad
packet on the socket. I agree error handling needs a better look,
but retrying a bad packet continuously while dumping in the log
is what it was doing when I hit an error before this code. Isn't
this better, at least until a second look?


  +}
  +
 
 I wonder whether it makes sense to check
 skb_queue_empty(sk-sk_receive_queue)
 outside the lock, to reduce the cost of this call
 on an empty queue (we know that it happens at least once
 each time we exit the loop on rx)?

I was looking at alternatives to adding the lock in the
first place, but I found I couldn't measure a difference in the
cost with and without the lock.


  +   int retries = 0;
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
  +vq_log, log);
 
 This looks like a bug, I think we need to pass
 datalen + header size to vhost_get_desc_n.
 Not sure how we know the header size that backend will use though.
 Maybe just look at our features.

Yes; we have hdr_size, so I can add it here. It'll be 0 for
the cases where the backend and guest both have vnet header (either
the regular or larger mergeable buffers one), but should be added
in for future raw socket support.

 
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  - if (unlikely(vhost_enable_notify(vq))) {
  +  if (!headcount) {
  + if (retries == 0  unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
   vhost_disable_notify(vq);
  +retries++;
   continue;
}
 
 Hmm. The reason we have the code at all, as the comment says, is because
 guest could have added more buffers between the time we read last index
 and the time we enabled notification. So if we just break like this
 the race still exists. We could remember the
 last head value we observed, and have vhost_enable_notify check
 against this value?

This is to prevent a spin loop in the case where we have some
buffers available, but not enough for the current packet (ie, this
is the replacement code for the rxmaxheadcount business). If they
actually added something new, retrying once should see it, but what
vhost_enable_notify() returns non-zero on is not new buffers but
rather not empty. In the case mentioned, we aren't empty, so
vhost_enable_notify() returns nonzero every time, but the guest hasn't
given us enough buffers to proceed, so we continuously retry; this
code breaks the spin loop until we've really got new buffers from
the guest.

 
 Need to think about it.
 
 Another concern here is that on retries vhost_get_desc_n
 is doing extra work, rescanning the same descriptor
 again and again. Not sure how common this is, might be
 worthwhile to add a TODO to consider this at least.

I had a printk in there to test the code and with the
retries counter, it happens when we fill the ring (once,
because of the retries checks), and then proceeds as
desired when the guest gives us more buffers. Without the
check, it spews until we 

Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net

2010-03-31 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM:
 
 attached patch seems to be whiespace damaged as well.
 Does the origin pass checkpatch.pl for you?

Yes, but I probably broke it in the transfer -- will be more
careful with the next revision.



  +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, 
vq,
  +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 

  NULL);
 
 I this casting confusing.
 Is it really expensive to add an array of heads so that
 we do not need to cast?

It needs the heads and the lengths, which looks a lot
like an iovec. I was trying to resist adding a new
struct XXX { unsigned head; unsigned len; } just for this,
but I could make these parallel arrays, one with head index and
the other with length.

  +   if (vq-rxmaxheadcount  headcount)
  +   vq-rxmaxheadcount = headcount;
 
 This seems the only place where we set the rxmaxheadcount
 value. Maybe it can be moved out of vhost.c to net.c?
 If vhost library needs this it can get it as function
 parameter.

I can move it to vhost_get_heads(), sure.
 
  +   /* Skip header. TODO: support TSO. */
 
 You seem to have removed the code that skips the header.
 Won't this break non-GSO backends such as raw?

From our prior discussion, what I had in mind was that
we'd remove all special-header processing by using the ioctl
you added for TUN and later, an equivalent ioctl for raw (when
supported in qemu-kvm). Qemu would arrange headers with tap
(and later raw) to get what the guest expects, and vhost then
just passes all data as-is between the socket and the ring.
That not only removes the different-header-len code
for mergeable buffers, but also eliminates making a copy of the
header for every packet as before. Vhost has no need to do
anything with the header, or even know its length. It also
doesn't have to care what the type of the backend is-- raw or
tap.
I think that simplifies everything, but it does mean that
raw socket support requires a header ioctl for the different
vnethdr sizes if the guest wants a vnethdr with and without
mergeable buffers. Actually, I thought this was your idea and
the point of the TUNSETVNETHDRSZ ioctl, but I guess you had
something else in mind?

  -   /* TODO: Check specific error and bomb out unless 
EAGAIN? 
  */
 
 Do you think it's not a good idea?

EAGAIN is not possible after the change, because we don't
even enter the loop unless we have an skb on the read queue; the
other cases bomb out, so I figured the comment for future work is
now done. :-)

 
  if (err  0) {
  -   vhost_discard_vq_desc(vq);
  +   vhost_discard_vq_desc(vq, headcount);
  break;
  }
 
 I think we should detect and discard truncated messages,
 since len might not be reliable if userspace pulls
 a packet from under us. Also, if new packet is
 shorter than the old one, there's no truncation but
 headcount is wrong.
 
 So simplest fix IMO would be to compare err with expected len.
 If there's a difference, we hit the race and so
 we would discard the packet.

Sure.

 
 
  /* TODO: Should check and handle checksum. */
  +   if (vhost_has_feature(net-dev, 
VIRTIO_NET_F_MRG_RXBUF)) 
  {
  +   struct virtio_net_hdr_mrg_rxbuf *vhdr =
  +   (struct virtio_net_hdr_mrg_rxbuf *)
  +   vq-iov[0].iov_base;
  +   /* add num_bufs */
  +   if (put_user(headcount, vhdr-num_buffers)) {
  +   vq_err(vq, Failed to write 
num_buffers);
  +   vhost_discard_vq_desc(vq, headcount);
 
 Let's do memcpy_to_iovecend etc so that we do not assume layout.
 This is also why we need move_iovec: sendmsg might modify the iovec.
 It would also be nice not to corrupt memory and
 get a reasonable error if buffer size
 that we get is smaller than expected header size.

I wanted to avoid the header copy here, with the reasonable 
restriction
that the guest gives you a buffer at least big enough for the vnet_hdr. A
length check alone (on iov[0].iov_len) could enforce that without copying
headers back and forth to support silly cases like 8-byte buffers or
num_buffers spanning multiple iovecs, and I think paying the price of the
copy on every packet to support guests that broken isn't worth it.
So, my preference here would be to add:

if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF))
struct virtio_net_mrg_rxbuf *vhdr...

if (vq-iov[0].iov_len  sizeof(*vhdr)) {
vq_err(vq, tiny buffers (len %d  %d) are not 
supported,
vq-iov[0].iov_len, sizeof(*vhdr));
break;
}

Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

2010-03-10 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 11:45:33 PM:

+static int skb_head_len(struct sk_buff_head *skq)
+{
+   struct sk_buff *head;
+
+   head = skb_peek(skq);
+   if (head)
+   return head-len;
+   return 0;
+}
+
   
   This is done without locking, which I think can crash
   if skb is consumed after we peek it but before we read the
   length.
  
  This thread is the only legitimate consumer, right? But
  qemu has the file descriptor and I guess we shouldn't trust
  that it won't give it to someone else; it'd break vhost, but
  a crash would be inappropriate, of course. I'd like to avoid
  the lock, but I have another idea here, so will investigate.

I looked at this some more and actually, I'm not sure I
see a crash here.
First, without qemu, or something it calls, being broken
as root, nothing else should ever read from the socket, in which
case the length will be exactly right for the next packet we
read. No problem.
But if by some error this skb is freed, we'll have valid
memory address that isn't the length field of the next packet
we'll read.
If the length is negative or more than available in the
vring, we'll fail the buffer allocation, exit the loop, and
get the new head length of the receive queue the next time
around -- no problem.
If the length field is 0, we'll exit the loop even
though we have data to read, but will get that packet the
next time we get in here, again, with the right length.
No problem.
If the length field is big enough to allocate buffer
space for it, but smaller than the new head we have to read,
the recvmsg will fail with EMSGSIZE, drop the packet, exit
the loop and be back in business with the next packet. No
problem.
Otherwise, the packet will fit and be delivered.

I don't much like the notion of using skb-head when
it's garbage, but that can only happen if qemu has broken,
and I don't see a crash unless the skb is not only freed
but no longer a valid memory address for reading at all,
and all within the race window.
Since the code handles other failure cases (not
enough ring buffers or packet not fitting in the allocated
buffers), the actual length value only matters in the
sense that it prevents us from using buffers unnecessarily--
something that isn't all that relevant if it's hosed enough
to have unauthorized readers on the socket.

Is this case worth the performance penalty we'll no
doubt pay for either locking the socket or always allocating
for a max-sized packet? I'll experiment with a couple
solutions here, but unless I've missed something, we might
be better off just leaving it as-is.

+-DLS


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

2010-03-07 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 07:31:30 AM:

 On Tue, Mar 02, 2010 at 05:20:15PM -0700, David Stevens wrote:
  This patch generalizes buffer handling functions to
  NULL, NULL);
  +   head.iov_base = (void *)vhost_get_vq_desc(net-dev, 
vq,
  +   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 

  NULL);
 
 Should type for head be changed so that we do not need to cast?
 
 I also prefer aligning descendants on the opening (.

Yes, that's probably better; the indentation with the cast would
still wrap a lot, but I'll see what I can do here.


  err = sock-ops-sendmsg(NULL, sock, msg, len);
  if (unlikely(err  0)) {
  -   vhost_discard_vq_desc(vq);
  +   vhost_discard(vq, 1);
 
 Isn't the original name a bit more descriptive?

During development, I had both and I generally like
shorter names, but I can change it back.

  +static int skb_head_len(struct sk_buff_head *skq)
  +{
  +   struct sk_buff *head;
  +
  +   head = skb_peek(skq);
  +   if (head)
  +   return head-len;
  +   return 0;
  +}
  +
 
 This is done without locking, which I think can crash
 if skb is consumed after we peek it but before we read the
 length.

This thread is the only legitimate consumer, right? But
qemu has the file descriptor and I guess we shouldn't trust
that it won't give it to someone else; it'd break vhost, but
a crash would be inappropriate, of course. I'd like to avoid
the lock, but I have another idea here, so will investigate.

 
 
   /* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
   static void handle_rx(struct vhost_net *net)
   {
  struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
  -   unsigned head, out, in, log, s;
  +   unsigned in, log, s;
  struct vhost_log *vq_log;
  struct msghdr msg = {
  .msg_name = NULL,
  @@ -204,10 +213,11 @@
  };
  
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  -   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  +
  +   if (!sock || !skb_head_len(sock-sk-sk_receive_queue))
  return;
  
 
 Isn't this equivalent? Do you expect zero len skbs in socket?
 If yes, maybe we should discard these, not stop processing ...

A zero return means no skb's. They are equivalent; I
changed this call just to make it identical to the loop check,
but I don't have a strong attachment to this.


  vq_log = unlikely(vhost_has_feature(net-dev, 
VHOST_F_LOG_ALL)) ?
  vq-log : NULL;
  
  -   for (;;) {
  -   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = skb_head_len(sock-sk-sk_receive_queue))) 
{
 
 This peeks in the queue to figure out how much data we have.
 It's a neat trick, but it does introduce new failure modes
 where an skb is consumed after we did the peek:
 - new skb could be shorter, we should return the unconsumed part
 - new skb could be longer, this will drop a packet.
   maybe this last is not an issue as the race should be rare in practice

As before, this loop is the only legitimate consumer of skb's; if
anyone else is reading the socket, it's broken. But since the file
descriptor is available to qemu, it's probably trusting qemu too much.
I don't believe there is a race at all on a working system; the head
can't change until we read it after this test, but I'll see if I can
come up with something better here. Closing the fd for qemu so vhost
has exclusive access might do it, but otherwise we could just get a
max-sized packet worth of buffers and then return them after the read.
That'd force us to wait with small packets when the ring is nearly
full, where we don't have to now, but I expect locking to read the head
length will hurt performance in all cases. Will try these ideas out.k

 
  +   headcount = vhost_get_heads(vq, datalen, in, vq_log, 
  log);
  /* OK, now we need to know about added descriptors. */
  -   if (head == vq-num) {
  +   if (!headcount) {
  if (unlikely(vhost_enable_notify(vq))) {
  /* They have slipped one in as we were
   * doing that: check again. */
  @@ -235,13 +242,6 @@
   * they refilled. */
  break;
  }
  -   /* We don't need to be notified again. */
 
 you find this comment unhelpful

Re: [RFC][ PATCH 2/3] vhost-net: handle vnet_hdr processing for MRG_RX_BUF

2010-03-07 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 08:12:29 AM:

 On Tue, Mar 02, 2010 at 05:20:26PM -0700, David Stevens wrote:
  This patch adds vnet_hdr processing for mergeable buffer
  support to vhost-net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
  
  diff -ruN net-next-p1/drivers/vhost/net.c 
net-next-p2/drivers/vhost/net.c
 
 Could you please add -p to diff flags so that it's easier
 to figure out which function is changes?

Sure.


  @@ -148,25 +146,45 @@
 out %d, int %d\n, out, in);
  break;
  }
  +   if (vq-guest_hlen  vq-sock_hlen) {
  +   if (msg.msg_iov[0].iov_len == vq-guest_hlen)
  +   msg.msg_iov[0].iov_len = 
vq-sock_hlen;
  +   else if (out == ARRAY_SIZE(vq-iov))
  +   vq_err(vq, handle_tx iov overflow!);
  +   else {
  +   int i;
  +
  +   /* give header its own iov */
  +   for (i=out; i0; ++i)
  +   msg.msg_iov[i+1] = 
msg.msg_iov[i];
  +   msg.msg_iov[0].iov_len = 
vq-sock_hlen;
  +   msg.msg_iov[1].iov_base += 
vq-guest_hlen;
  +   msg.msg_iov[1].iov_len -= 
vq-guest_hlen;
  +   out++;
 
 We seem to spend a fair bit of code here and elsewhere trying to cover
 the diff between header size in guest and host.  In hindsight, it was
 not a good idea to put new padding between data and the header:
 virtio-net should have added it *before*. But it is what it is.
 
 Wouldn't it be easier to just add an ioctl to tap so that
 vnet header size is made bigger by 4 bytes?

I'd be ok with that, but if we support raw sockets, we have
some of the issues (easier, since it's 0). num_buffers is only
16 bits, so just add 2 bytes, but yeah-- pushing it to tap would
be easier for vhost.


  /* TODO: Check specific error and bomb out unless 
ENOBUFS? 
  */
  err = sock-ops-sendmsg(NULL, sock, msg, len);
  if (unlikely(err  0)) {
  -   vhost_discard(vq, 1);
  -   tx_poll_start(net, sock);
  +   if (err == -EAGAIN) {
 
 The comment mentions ENOBUFS. Are you sure it's EAGAIN?

The comment's from the original -- the code changes should do
the TODO, so probably should remove the comment.
 
  +   tx_poll_start(net, sock);
 
 Don't we need to call discard to move the last avail header back?

Yes, I think you're right. We might consider dropping the packet
in the EAGAIN case here instead, though. It's not clear retrying the
same packet with a delay here is better than getting new ones when the
socket buffer is full (esp. with queues on both sides already). Maybe
we should fold EAGAIN into the other error cases? Otherwise, you're
right, it looks like it should have a discard here.

 
  +   } else {
  +   vq_err(vq, sendmsg: errno %d\n, 
-err);
  +   /* drop packet; do not discard/resend 
*/
  + vhost_add_used_and_signal(net-dev,vq,head,1);
  +   }
  break;
  -   }
  -   if (err != len)
  +   } else if (err != len)
 
 
 Previous if ends with break so no need for else here.

Yes.


  @@ -232,25 +243,18 @@
  headcount = vhost_get_heads(vq, datalen, in, vq_log, 
  log);
  /* OK, now we need to know about added descriptors. */
  if (!headcount) {
  -   if (unlikely(vhost_enable_notify(vq))) {
  -   /* They have slipped one in as we were
  -* doing that: check again. */
  -   vhost_disable_notify(vq);
  -   continue;
  -   }
  -   /* Nothing new?  Wait for eventfd to tell us
  -* they refilled. */
  +   vhost_enable_notify(vq);
 
 
 You don't think this race can happen?

The notify case has to allow for multiple heads, not just the one,
so if he added just one head, it doesn't mean we're ok to continue-- the
packet may require multiple. Instead, the notifier now checks for enough
to handle a max-sized packet (whether or not NONOTIFY is set). I'll think
about this some more, but I concluded it wasn't needed at the time. :-)


  @@ -519,6 +524,20 @@
  
  vhost_net_disable_vq(n, vq);
  rcu_assign_pointer(vq-private_data, sock);
  +
  +   if (sock  sock-sk) {
  +   if (!vhost_sock_is_raw(sock) ||
 
 I dislike this backend-specific code, it ties us

Re: [RFC][ PATCH 3/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-07 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 08:26:33 AM:

 On Tue, Mar 02, 2010 at 05:20:34PM -0700, David Stevens wrote:
  This patch glues them all together and makes sure we
  notify whenever we don't have enough buffers to receive
  a max-sized packet, and adds the feature bit.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Maybe split this up?

I can. I was looking mostly at size (and this is the smallest
of the bunch). But the feature requires all of them together, of course.
This last one is just everything left over from the other two.

  @@ -110,6 +90,7 @@
  size_t len, total_len = 0;
  int err, wmem;
  struct socket *sock = rcu_dereference(vq-private_data);
  +
 
 I tend not to add empty lines if line below it is already short.

This leaves no blank line between the declarations and the start
of code. It's habit for me-- not sure of kernel coding standards address
that or not, but I don't think I've seen it anywhere else.

 
  if (!sock)
  return;
  
  @@ -166,11 +147,11 @@
  /* Skip header. TODO: support TSO. */
  msg.msg_iovlen = out;
  head.iov_len = len = iov_length(vq-iov, out);
  +
 
 I tend not to add empty lines if line below it is a comment.

I added this to separate the logical skip header block from
the next, unrelated piece. Not important to me, though.

 
  /* Sanity check */
  if (!len) {
  vq_err(vq, Unexpected header len for TX: 
  -  %zd expected %zd\n,
  -  len, vq-guest_hlen);
  +  %zd expected %zd\n, len, 
vq-guest_hlen);
  break;
  }
  /* TODO: Check specific error and bomb out unless 
ENOBUFS? 
  */


  /* TODO: Should check and handle checksum. */
  +   if (vhost_has_feature(net-dev, 
VIRTIO_NET_F_MRG_RXBUF)) 
  {
  +   struct virtio_net_hdr_mrg_rxbuf *vhdr =
  +   (struct virtio_net_hdr_mrg_rxbuf *)
  +   vq-iov[0].iov_base;
  +   /* add num_bufs */
  +   vq-iov[0].iov_len = vq-guest_hlen;
  +   vhdr-num_buffers = headcount;
 
 I don't understand this. iov_base is a userspace pointer, isn't it.
 How can you assign values to it like that?
 Rusty also commented earlier that it's not a good idea to assume
 specific layout, such as first chunk being large enough to
 include virtio_net_hdr_mrg_rxbuf.
 
 I think we need to use memcpy to/from iovec etc.

I guess you mean put_user() or copy_to_user(); yes, I suppose
it could be paged since we read it.
The code doesn't assume that it'll fit so much as arranged for
it to fit. We allocate guest_hlen bytes in the buffer, but set the
iovec to the (smaller) sock_hlen; do the read, then this code adds
back the 2 bytes in the middle that we didn't read into (where
num_buffers goes). But the allocator does require that guest_hlen
will fit in a single buffer (and reports error if it doesn't). The
alternative is significantly more complicated, and only fails if
the guest doesn't give us at least the buffer size the guest header
requires (a truly lame guest). I'm not sure it's worth a lot of
complexity in vhost to support the guest giving us 12 byte buffers;
those guests don't exist now and maybe they never should?


   /* This actually signals the guest, using eventfd. */
   void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
   {
  __u16 flags = 0;
  +
 
 I tend not to add empty lines if a line above it is already short.

Again, separating declarations from code-- never seen different
in any other kernel code.

 
  if (get_user(flags, vq-avail-flags)) {
  vq_err(vq, Failed to get flags);
  return;
  @@ -1125,7 +1140,7 @@
  
  /* If they don't want an interrupt, don't signal, unless 
empty. */
  if ((flags  VRING_AVAIL_F_NO_INTERRUPT) 
  -   (vq-avail_idx != vq-last_avail_idx ||
  +   (vhost_available(vq)  vq-maxheadcount ||
 
 I don't understand this change. It seems to make
 code not match the comments.

It redefines empty. Without mergeable buffers, we can empty
the ring down to nothing before we require notification. With
mergeable buffers, if the packet requires, say, 3 buffers, and we
have only 2 left, we are empty and require notification and new
buffers to read anything. In both cases, we notify when we can't
read another packet without more buffers.
I can think about changing the comment to reflect this more
clearly.

 
   !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY)))
  return;
  
  diff -ruN net-next-p2/drivers/vhost/vhost.h 
  net-next-p3/drivers/vhost/vhost.h

[RFC][ PATCH 3/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-03 Thread David Stevens
This patch glues them all together and makes sure we
notify whenever we don't have enough buffers to receive
a max-sized packet, and adds the feature bit.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruN net-next-p2/drivers/vhost/net.c net-next-p3/drivers/vhost/net.c
--- net-next-p2/drivers/vhost/net.c 2010-03-02 13:01:34.0 
-0800
+++ net-next-p3/drivers/vhost/net.c 2010-03-02 15:25:15.0 
-0800
@@ -54,26 +54,6 @@
enum vhost_net_poll_state tx_poll_state;
 };
 
-/* Pop first len bytes from iovec. Return number of segments used. */
-static int move_iovec_hdr(struct iovec *from, struct iovec *to,
- size_t len, int iov_count)
-{
-   int seg = 0;
-   size_t size;
-   while (len  seg  iov_count) {
-   size = min(from-iov_len, len);
-   to-iov_base = from-iov_base;
-   to-iov_len = size;
-   from-iov_len -= size;
-   from-iov_base += size;
-   len -= size;
-   ++from;
-   ++to;
-   ++seg;
-   }
-   return seg;
-}
-
 /* Caller must have TX VQ lock */
 static void tx_poll_stop(struct vhost_net *net)
 {
@@ -97,7 +77,7 @@
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
-   unsigned out, in, s;
+   unsigned out, in;
struct iovec head;
struct msghdr msg = {
.msg_name = NULL,
@@ -110,6 +90,7 @@
size_t len, total_len = 0;
int err, wmem;
struct socket *sock = rcu_dereference(vq-private_data);
+
if (!sock)
return;
 
@@ -166,11 +147,11 @@
/* Skip header. TODO: support TSO. */
msg.msg_iovlen = out;
head.iov_len = len = iov_length(vq-iov, out);
+
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
-  %zd expected %zd\n,
-  len, vq-guest_hlen);
+  %zd expected %zd\n, len, vq-guest_hlen);
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
@@ -214,7 +195,7 @@
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned in, log, s;
+   unsigned in, log;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -245,30 +226,36 @@
if (!headcount) {
vhost_enable_notify(vq);
break;
-   }
+   } else if (vq-maxheadcount  headcount)
+   vq-maxheadcount = headcount;
/* Skip header. TODO: support TSO/mergeable rx buffers. */
msg.msg_iovlen = in;
len = iov_length(vq-iov, in);
-
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for RX: 
-  %zd expected %zd\n,
-  len, vq-guest_hlen);
+  %zd expected %zd\n, len, vq-guest_hlen);
break;
}
err = sock-ops-recvmsg(NULL, sock, msg,
 len, MSG_DONTWAIT | MSG_TRUNC);
-   /* TODO: Check specific error and bomb out unless EAGAIN? 
*/
if (err  0) {
-   vhost_discard(vq, 1);
+   vhost_discard(vq, headcount);
break;
}
/* TODO: Should check and handle checksum. */
+   if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) 
{
+   struct virtio_net_hdr_mrg_rxbuf *vhdr =
+   (struct virtio_net_hdr_mrg_rxbuf *)
+   vq-iov[0].iov_base;
+   /* add num_bufs */
+   vq-iov[0].iov_len = vq-guest_hlen;
+   vhdr-num_buffers = headcount;
+   }
if (err  len) {
pr_err(Discarded truncated rx packet: 
len %d  %zd\n, err, len);
-   vhost_discard(vq, 1);
+   vhost_discard(vq, headcount);
continue;
}
len = err;
@@ -573,8 +560,6 @@
 
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
-   size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
-   sizeof(struct virtio_net_hdr) : 0;
int i;
mutex_lock(n-dev.mutex);
if ((features  (1  VHOST_F_LOG_ALL)) 
diff -ruN net-next-p2/drivers/vhost/vhost.c 
net-next-p3/drivers/vhost/vhost.c
--- net-next-p2/drivers/vhost/vhost.c   

[RFC][ PATCH 2/3] vhost-net: handle vnet_hdr processing for MRG_RX_BUF

2010-03-03 Thread David Stevens
This patch adds vnet_hdr processing for mergeable buffer
support to vhost-net.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruN net-next-p1/drivers/vhost/net.c net-next-p2/drivers/vhost/net.c
--- net-next-p1/drivers/vhost/net.c 2010-03-01 11:44:22.0 
-0800
+++ net-next-p2/drivers/vhost/net.c 2010-03-02 13:01:34.0 
-0800
@@ -109,7 +109,6 @@
};
size_t len, total_len = 0;
int err, wmem;
-   size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
if (!sock)
return;
@@ -124,7 +123,6 @@
 
if (wmem  sock-sk-sk_sndbuf * 2)
tx_poll_stop(net);
-   hdr_size = vq-hdr_size;
 
for (;;) {
head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq,
@@ -148,25 +146,45 @@
   out %d, int %d\n, out, in);
break;
}
+   if (vq-guest_hlen  vq-sock_hlen) {
+   if (msg.msg_iov[0].iov_len == vq-guest_hlen)
+   msg.msg_iov[0].iov_len = vq-sock_hlen;
+   else if (out == ARRAY_SIZE(vq-iov))
+   vq_err(vq, handle_tx iov overflow!);
+   else {
+   int i;
+
+   /* give header its own iov */
+   for (i=out; i0; ++i)
+   msg.msg_iov[i+1] = msg.msg_iov[i];
+   msg.msg_iov[0].iov_len = vq-sock_hlen;
+   msg.msg_iov[1].iov_base += vq-guest_hlen;
+   msg.msg_iov[1].iov_len -= vq-guest_hlen;
+   out++;
+   }
+   }
/* Skip header. TODO: support TSO. */
-   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
head.iov_len = len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
   %zd expected %zd\n,
-  iov_length(vq-hdr, s), hdr_size);
+  len, vq-guest_hlen);
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard(vq, 1);
-   tx_poll_start(net, sock);
+   if (err == -EAGAIN) {
+   tx_poll_start(net, sock);
+   } else {
+   vq_err(vq, sendmsg: errno %d\n, -err);
+   /* drop packet; do not discard/resend */
+ vhost_add_used_and_signal(net-dev,vq,head,1);
+   }
break;
-   }
-   if (err != len)
+   } else if (err != len)
pr_err(Truncated TX packet: 
len %d != %zd\n, err, len);
vhost_add_used_and_signal(net-dev, vq, head, 1);
@@ -207,14 +225,8 @@
.msg_flags = MSG_DONTWAIT,
};
 
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
-   };
-
size_t len, total_len = 0;
int err, headcount, datalen;
-   size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
 
if (!sock || !skb_head_len(sock-sk-sk_receive_queue))
@@ -223,7 +235,6 @@
use_mm(net-dev.mm);
mutex_lock(vq-mutex);
vhost_disable_notify(vq);
-   hdr_size = vq-hdr_size;
 
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
@@ -232,25 +243,18 @@
headcount = vhost_get_heads(vq, datalen, in, vq_log, 
log);
/* OK, now we need to know about added descriptors. */
if (!headcount) {
-   if (unlikely(vhost_enable_notify(vq))) {
-   /* They have slipped one in as we were
-* doing that: check again. */
-   vhost_disable_notify(vq);
-   continue;
-   }
-   /* Nothing new?  Wait for eventfd to tell us
-* they refilled. */
+   vhost_enable_notify(vq);
break;
}
/* Skip header. TODO: support TSO/mergeable rx buffers. */
-   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq-iov, in);
+
/* Sanity check */
 

[RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-03 Thread David Stevens
These patches add support for mergeable receive buffers to
vhost-net, allowing it to use multiple virtio buffer heads for a single
receive packet.
+-DLS


Signed-off-by: David L Stevens dlstev...@us.ibm.com
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

2010-03-03 Thread David Stevens
This patch generalizes buffer handling functions to
support multiple buffer heads.

In-line for viewing, attached for applying.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruN net-next-p0/drivers/vhost/net.c net-next-p1/drivers/vhost/net.c
--- net-next-p0/drivers/vhost/net.c 2010-02-24 12:59:24.0 
-0800
+++ net-next-p1/drivers/vhost/net.c 2010-03-01 11:44:22.0 
-0800
@@ -97,7 +97,8 @@
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
-   unsigned head, out, in, s;
+   unsigned out, in, s;
+   struct iovec head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -126,12 +127,10 @@
hdr_size = vq-hdr_size;
 
for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-NULL, NULL);
+   head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq,
+   vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, 
NULL);
/* Nothing new?  Wait for eventfd to tell us they 
refilled. */
-   if (head == vq-num) {
+   if (head.iov_base == (void *)vq-num) {
wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf * 3 / 4) {
tx_poll_start(net, sock);
@@ -152,7 +151,7 @@
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
msg.msg_iovlen = out;
-   len = iov_length(vq-iov, out);
+   head.iov_len = len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
@@ -163,14 +162,14 @@
/* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard_vq_desc(vq);
+   vhost_discard(vq, 1);
tx_poll_start(net, sock);
break;
}
if (err != len)
pr_err(Truncated TX packet: 
len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   vhost_add_used_and_signal(net-dev, vq, head, 1);
total_len += len;
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
@@ -182,12 +181,22 @@
unuse_mm(net-dev.mm);
 }
 
+static int skb_head_len(struct sk_buff_head *skq)
+{
+   struct sk_buff *head;
+
+   head = skb_peek(skq);
+   if (head)
+   return head-len;
+   return 0;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned head, out, in, log, s;
+   unsigned in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -204,10 +213,11 @@
};
 
size_t len, total_len = 0;
-   int err;
+   int err, headcount, datalen;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
-   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
+
+   if (!sock || !skb_head_len(sock-sk-sk_receive_queue))
return;
 
use_mm(net-dev.mm);
@@ -218,13 +228,10 @@
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
-   for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = skb_head_len(sock-sk-sk_receive_queue))) {
+   headcount = vhost_get_heads(vq, datalen, in, vq_log, 
log);
/* OK, now we need to know about added descriptors. */
-   if (head == vq-num) {
+   if (!headcount) {
if (unlikely(vhost_enable_notify(vq))) {
/* They have slipped one in as we were
 * doing that: check again. */
@@ -235,13 +242,6 @@
 * they refilled. */
break;
}
-   /* We don't need to be notified again. */
-   if (out) {
-   vq_err(vq, Unexpected descriptor format for RX: 
-  out %d, int 

Re: [RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 03/02/2010 11:54:32 PM:

 On Tue, Mar 02, 2010 at 04:20:03PM -0800, David Stevens wrote:
  These patches add support for mergeable receive buffers to
  vhost-net, allowing it to use multiple virtio buffer heads for a 
single
  receive packet.
  +-DLS
  
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Do you have performance numbers (both with and without mergeable buffers
 in guest)?

Michael,
Nothing formal. I did some TCP single-stream throughput tests
and was seeing 20-25% improvement on a laptop (ie, low-end hardware).
That actually surprised me; I'd think it'd be about the same, except
maybe in a test that has mixed packet sizes. Comparisons with the
net-next kernel these patches are for showed only ~10% improvement.
But I also see a lot of variability both among different
configurations and with the same configuration on different runs.
So, I don't feel like those numbers are very solid, and I haven't
yet done any tests on bigger hardware.

2 notes: I have a modified version of qemu to get the VHOST_FEATURES
flags, including the mergeable RX bufs flag, passed to the guest; I'll
be working with your current qemu git trees next, if any changes are
needed to support it there.
Second, I've found a missing initialization in the patches I
sent on the list, so I'll send an updated patch 2 with the fix, and
qemu patches when they are ready (plus any code-review comments
incorporated).

+-DLS

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net

2010-03-03 Thread David Stevens
 Interesting. Since the feature in question is billed first of all a
 performance optimization...

By whom? Although I see some improved performance, I think its real
benefit is improving memory utilization on the guest. Instead of using
75K for an ARP packet, mergeable RX buffers only uses 4K. :-)

 Since the patches affect code paths when mergeable RX buffers are
 disabled as well, I guess the most important point would be to verify
 whether there's increase in latency and/or CPU utilization, or bandwidth
 cost when the feature bit is *disabled*.

Actually, when the feature bit is disabled, it'll only get a single
head, doesn't use the special vnet_hdr, and the codepath reduces to the
essentially to the original. But the answer is no; I saw no regressions
when using it without the feature bit. The only substantive difference in 
that case
is that the new code avoids copying the vnet header as the original
does, so it should actually be faster, but I don't think that's measurable
above the variability I already see.

 
  2 notes: I have a modified version of qemu to get the VHOST_FEATURES
  flags, including the mergeable RX bufs flag, passed to the guest; I'll
  be working with your current qemu git trees next, if any changes are
  needed to support it there.
 
 This feature also seems to conflict with zero-copy rx patches from Xin
 Xiaohui (subject: Provide a zero-copy method on KVM virtio-net) these
 are not in a mergeable shape yet, so this is not a blocker, but I wonder
 what your thoughts on the subject are: how will we do feature
 negotiation if some backends don't support some features?

The qemu code I have basically sends the set features and get
features all the way to vhost (ie, it's the guest negotiating with
vhost), except, of course, for the magic qemu-only bits. I think that's
the right model. I'll definitely take a look at the patch you mention
and maybe comment further.

+-DLS

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization