Re: [PATCHv7] add mergeable buffers support to vhost_net
Since datalen carries the difference and will be negative by that amount from the original loop, what about just adding something like: } if (headcount) heads[headcount-1].len += datalen; [and really, headcount 0 since datalen 0, so just: heads[headcount-1].len += datalen; +-DLS kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM: On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net * use_mm(net-dev.mm); mutex_lock(vq-mutex); vhost_disable_notify(vq); - hdr_size = vq-hdr_size; + vhost_hlen = vq-vhost_hlen; vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(vq, sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, +datalen + vhost_hlen, +in, vq_log, log); + if (headcount 0) + break; /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net * break; } /* We don't need to be notified again. */ - if (out) { - vq_err(vq, Unexpected descriptor format for RX: -out %d, int %d\n, -out, in); - break; - } - /* Skip header. TODO: support TSO/mergeable rx buffers. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); + if (vhost_hlen) + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in); + else + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: %zd expected %zd\n, -iov_length(vq-hdr, s), hdr_size); +iov_length(vq-hdr, s), vhost_hlen); break; } err = sock-ops-recvmsg(NULL, sock, msg, len, MSG_DONTWAIT | MSG_TRUNC); /* TODO: Check specific error and bomb out unless EAGAIN? */ if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_desc(vq, headcount); break; } - /* TODO: Should check and handle checksum. */ - if (err len) { - pr_err(Discarded truncated rx packet: - len %d %zd\n, err, len); - vhost_discard_vq_desc(vq); + if (err != datalen) { + pr_err(Discarded rx packet: + len %d, expected %zd\n, err, datalen); + vhost_discard_desc(vq, headcount); continue; } len = err; - err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); - if (err) { - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n, -vq-iov-iov_base, err); + if (vhost_hlen + memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0, + vhost_hlen)) { + vq_err(vq, Unable to write vnet_hdr at addr %p\n, +vq-iov-iov_base); break; } - len += hdr_size; - vhost_add_used_and_signal(net-dev, vq, head, len); + /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) + memcpy_toiovecend(vq-hdr, (unsigned char *)headcount, + offsetof(typeof(hdr), num_buffers), + sizeof(hdr.num_buffers))) { + vq_err(vq, Failed num_buffers write); + vhost_discard_desc(vq, headcount); + break; + } + len += vhost_hlen; + vhost_add_used_and_signal_n(net-dev, vq, vq-heads, + headcount); if (unlikely(vq_log)) vhost_log_write(vq, vq_log, log, len); total_len += len; OK I think I see the bug here: vhost_add_used_and_signal_n does not get the actual length, it gets the iovec length from vhost. Guest virtio uses this as packet length, with bad results. So I have applied the follows and it seems to have fixed the problem: diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c16db02..9d7496d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -219,7 +219,7 @@ static int peek_head_len(struct vhost_virtqueue *vq, struct sock *sk) /* This is a multi-buffer
Re: [PATCHv7] add mergeable buffers support to vhost_net
netdev-ow...@vger.kernel.org wrote on 05/10/2010 10:25:57 AM: On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote: Since datalen carries the difference and will be negative by that amount from the original loop, what about just adding something like: } if (headcount) heads[headcount-1].len += datalen; [and really, headcount 0 since datalen 0, so just: heads[headcount-1].len += datalen; +-DLS This works too, just does more checks and comparisons. I am still surprised that you were unable to reproduce the problem. I'm sure it happened, and probably had a performance penalty on my systems too, but not as much as yours. I didn't see any obvious performance difference running with the patch, though; not sure why. I'll instrument to see how often it's happening, I think. But fixed now, good catch! +-DLS ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7] add mergeable buffers support to vhost_net
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM: On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: This patch adds mergeable receive buffer support to vhost_net. Signed-off-by: David L Stevens dlstev...@us.ibm.com I've been doing some more testing before sending out a pull request, and I see a drastic performance degradation in guest to host traffic when this is applied but mergeable buffers are not in used by userspace (existing qemu-kvm userspace). Actually, I wouldn't expect it to work at all; the qemu-kvm patch (particularly the feature bit setting bug fix) is required. Without it, I think the existing code will tell the guest to use mergeable buffers while turning it off in vhost. That was completely non-functional for me -- what version of qemu-kvm are you using? What I did to test w/o mergeable buffers is turn off the bit in VHOST_FEATURES. I'll recheck these, but qemu-kvm definitely must be updated; the original doesn't correctly handle feature bits. +-DLS ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7] add mergeable buffers support to vhost_net
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 08:56:14 AM: On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote: Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM: On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: This patch adds mergeable receive buffer support to vhost_net. Signed-off-by: David L Stevens dlstev...@us.ibm.com I've been doing some more testing before sending out a pull request, and I see a drastic performance degradation in guest to host traffic when this is applied but mergeable buffers are not in used by userspace (existing qemu-kvm userspace). Actually, I wouldn't expect it to work at all; the qemu-kvm patch (particularly the feature bit setting bug fix) is required. Which bugfix is that? Actually, I see you put that upstream already-- commit dc14a397812b91dd0d48b03d1b8f66a251542369 in Avi's tree is the one I was talking about. I'll look further. +-DLS ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv7] add mergeable buffers support to vhost_net
Michael S. Tsirkin m...@redhat.com wrote on 04/29/2010 06:45:15 AM: On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote: This patch adds mergeable receive buffer support to vhost_net. Signed-off-by: David L Stevens dlstev...@us.ibm.com I have applied this, thanks very much! I have also applied some tweaks on top, please take a look. Thanks, MSt Looks fine to me. Acked-by: David L Stevens dlstev...@us.ibm.com commit 2809e94f5f26d89dc5232aaec753ffda95c4d95e Author: Michael S. Tsirkin m...@redhat.com Date: Thu Apr 29 16:18:08 2010 +0300 vhost-net: minor tweaks in mergeable buffer code Applies the following tweaks on top of mergeable buffers patch: 1. vhost_get_desc_n assumes that all desriptors are 'in' only. It's also unlikely to be useful for any vhost frontend besides vhost_net, so just move it to net.c, and rename get_rx_bufs for brevity. 2. for rx, we called iov_length within vhost_get_desc_n (now get_rx_bufs) already, so we don't need an extra call to iov_length to avoid overflow anymore. Accordingly, copy_iovec_hdr can return void now. 3. for rx, do some further code tweaks: do not assign len = err as we check that err == len handle data length in a way similar to how we handle header length: datalen - sock_len, len - vhost_len. add sock_hlen as a local variable, for symmetry with vhost_hlen. 4. add some likely/unlikely annotations Signed-off-by: Michael S. Tsirkin m...@redhat.com diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index d61d945..85519b4 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -74,9 +74,9 @@ static int move_iovec_hdr(struct iovec *from, struct iovec *to, } return seg; } -/* Copy iovec entries for len bytes from iovec. Return segments used. */ -static int copy_iovec_hdr(const struct iovec *from, struct iovec *to, - size_t len, int iovcount) +/* Copy iovec entries for len bytes from iovec. */ +static void copy_iovec_hdr(const struct iovec *from, struct iovec *to, +size_t len, int iovcount) { int seg = 0; size_t size; @@ -89,7 +89,6 @@ static int copy_iovec_hdr(const struct iovec *from, struct iovec *to, ++to; ++seg; } - return seg; } /* Caller must have TX VQ lock */ @@ -204,7 +203,7 @@ static void handle_tx(struct vhost_net *net) unuse_mm(net-dev.mm); } -static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk) +static int peek_head_len(struct vhost_virtqueue *vq, struct sock *sk) { struct sk_buff *head; int len = 0; @@ -212,17 +211,70 @@ static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk) lock_sock(sk); head = skb_peek(sk-sk_receive_queue); if (head) - len = head-len + vq-sock_hlen; + len = head-len; release_sock(sk); return len; } +/* This is a multi-buffer version of vhost_get_desc, that works if + * vq has read descriptors only. + * @vq - the relevant virtqueue + * @datalen - data length we'll be reading + * @iovcount - returned count of io vectors we fill + * @log - vhost log + * @log_num - log offset + * returns number of buffer heads allocated, negative on error + */ +static int get_rx_bufs(struct vhost_virtqueue *vq, + struct vring_used_elem *heads, + int datalen, + unsigned *iovcount, + struct vhost_log *log, + unsigned *log_num) +{ + unsigned int out, in; + int seg = 0; + int headcount = 0; + unsigned d; + int r; + + while (datalen 0) { + if (unlikely(headcount = VHOST_NET_MAX_SG)) { + r = -ENOBUFS; + goto err; + } + d = vhost_get_desc(vq-dev, vq, vq-iov + seg, + ARRAY_SIZE(vq-iov) - seg, out, + in, log, log_num); + if (d == vq-num) { + r = 0; + goto err; + } + if (unlikely(out || in = 0)) { + vq_err(vq, unexpected descriptor format for RX: +out %d, in %d\n, out, in); + r = -EINVAL; + goto err; + } + heads[headcount].id = d; + heads[headcount].len = iov_length(vq-iov + seg, in); + datalen -= heads[headcount].len; + ++headcount; + seg += in; + } + *iovcount = seg; + return headcount; +err: + vhost_discard_desc(vq, headcount); + return r; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned in, log, s; + unsigned uninitialized_var(in), log; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -238,9 +290,10 @@ static void handle_rx(struct vhost_net *net)
Re: [PATCH v4] Add mergeable RX bufs support to vhost
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM: On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote: This patch adds the mergeable RX buffers feature to vhost. Signed-off-by: David L Stevens dlstev...@us.ibm.com Looks pretty clean to me. Could you send a checkpatch-clean version please? The original passes checkpatch already, but I guess I must still be getting whitespace mangling if it didn't for you. (sigh) Here it is as an attachment: We should also check performance implications. Do you have any data? I'm getting on the order of 10-20% improvement in throughput over stock vhost guest to host, but I do see a lot of variability in the results, even with no KVM and just over loopback. I don't know where that's coming from, but I'll do some runs and post. Thanks for all the reviews! +-DLS MRXBv4.patch Description: Binary data ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
Thanks! There's some whitespace damage, are you sending with your new sendmail setup? It seems to have worked for qemu patches ... Yes, I saw some line wraps in what I received, but I checked the original draft to be sure and they weren't there. Possibly from the relay; Sigh. @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { +vhost_discard_desc(vq, 1); +tx_poll_start(net, sock); + } else { +vq_err(vq, sendmsg: errno %d\n, -err); +/* drop packet; do not discard/resend */ +vhost_add_used_and_signal(net-dev, vq, head, + 0); vhost does not currently has a consistent error handling strategy: if we drop packets, need to think which other errors should cause packet drops. I prefer to just call vq_err for now, and have us look at handling segfaults etc in a consistent way separately. I had to add this to avoid an infinite loop when I wrote a bad packet on the socket. I agree error handling needs a better look, but retrying a bad packet continuously while dumping in the log is what it was doing when I hit an error before this code. Isn't this better, at least until a second look? +} + I wonder whether it makes sense to check skb_queue_empty(sk-sk_receive_queue) outside the lock, to reduce the cost of this call on an empty queue (we know that it happens at least once each time we exit the loop on rx)? I was looking at alternatives to adding the lock in the first place, but I found I couldn't measure a difference in the cost with and without the lock. + int retries = 0; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, datalen, in, +vq_log, log); This looks like a bug, I think we need to pass datalen + header size to vhost_get_desc_n. Not sure how we know the header size that backend will use though. Maybe just look at our features. Yes; we have hdr_size, so I can add it here. It'll be 0 for the cases where the backend and guest both have vnet header (either the regular or larger mergeable buffers one), but should be added in for future raw socket support. /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { - if (unlikely(vhost_enable_notify(vq))) { + if (!headcount) { + if (retries == 0 unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ vhost_disable_notify(vq); +retries++; continue; } Hmm. The reason we have the code at all, as the comment says, is because guest could have added more buffers between the time we read last index and the time we enabled notification. So if we just break like this the race still exists. We could remember the last head value we observed, and have vhost_enable_notify check against this value? This is to prevent a spin loop in the case where we have some buffers available, but not enough for the current packet (ie, this is the replacement code for the rxmaxheadcount business). If they actually added something new, retrying once should see it, but what vhost_enable_notify() returns non-zero on is not new buffers but rather not empty. In the case mentioned, we aren't empty, so vhost_enable_notify() returns nonzero every time, but the guest hasn't given us enough buffers to proceed, so we continuously retry; this code breaks the spin loop until we've really got new buffers from the guest. Need to think about it. Another concern here is that on retries vhost_get_desc_n is doing extra work, rescanning the same descriptor again and again. Not sure how common this is, might be worthwhile to add a TODO to consider this at least. I had a printk in there to test the code and with the retries counter, it happens when we fill the ring (once, because of the retries checks), and then proceeds as desired when the guest gives us more buffers. Without the check, it spews until we
Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
Michael S. Tsirkin m...@redhat.com wrote on 03/31/2010 05:02:28 AM: attached patch seems to be whiespace damaged as well. Does the origin pass checkpatch.pl for you? Yes, but I probably broke it in the transfer -- will be more careful with the next revision. + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); I this casting confusing. Is it really expensive to add an array of heads so that we do not need to cast? It needs the heads and the lengths, which looks a lot like an iovec. I was trying to resist adding a new struct XXX { unsigned head; unsigned len; } just for this, but I could make these parallel arrays, one with head index and the other with length. + if (vq-rxmaxheadcount headcount) + vq-rxmaxheadcount = headcount; This seems the only place where we set the rxmaxheadcount value. Maybe it can be moved out of vhost.c to net.c? If vhost library needs this it can get it as function parameter. I can move it to vhost_get_heads(), sure. + /* Skip header. TODO: support TSO. */ You seem to have removed the code that skips the header. Won't this break non-GSO backends such as raw? From our prior discussion, what I had in mind was that we'd remove all special-header processing by using the ioctl you added for TUN and later, an equivalent ioctl for raw (when supported in qemu-kvm). Qemu would arrange headers with tap (and later raw) to get what the guest expects, and vhost then just passes all data as-is between the socket and the ring. That not only removes the different-header-len code for mergeable buffers, but also eliminates making a copy of the header for every packet as before. Vhost has no need to do anything with the header, or even know its length. It also doesn't have to care what the type of the backend is-- raw or tap. I think that simplifies everything, but it does mean that raw socket support requires a header ioctl for the different vnethdr sizes if the guest wants a vnethdr with and without mergeable buffers. Actually, I thought this was your idea and the point of the TUNSETVNETHDRSZ ioctl, but I guess you had something else in mind? - /* TODO: Check specific error and bomb out unless EAGAIN? */ Do you think it's not a good idea? EAGAIN is not possible after the change, because we don't even enter the loop unless we have an skb on the read queue; the other cases bomb out, so I figured the comment for future work is now done. :-) if (err 0) { - vhost_discard_vq_desc(vq); + vhost_discard_vq_desc(vq, headcount); break; } I think we should detect and discard truncated messages, since len might not be reliable if userspace pulls a packet from under us. Also, if new packet is shorter than the old one, there's no truncation but headcount is wrong. So simplest fix IMO would be to compare err with expected len. If there's a difference, we hit the race and so we would discard the packet. Sure. /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) { + struct virtio_net_hdr_mrg_rxbuf *vhdr = + (struct virtio_net_hdr_mrg_rxbuf *) + vq-iov[0].iov_base; + /* add num_bufs */ + if (put_user(headcount, vhdr-num_buffers)) { + vq_err(vq, Failed to write num_buffers); + vhost_discard_vq_desc(vq, headcount); Let's do memcpy_to_iovecend etc so that we do not assume layout. This is also why we need move_iovec: sendmsg might modify the iovec. It would also be nice not to corrupt memory and get a reasonable error if buffer size that we get is smaller than expected header size. I wanted to avoid the header copy here, with the reasonable restriction that the guest gives you a buffer at least big enough for the vnet_hdr. A length check alone (on iov[0].iov_len) could enforce that without copying headers back and forth to support silly cases like 8-byte buffers or num_buffers spanning multiple iovecs, and I think paying the price of the copy on every packet to support guests that broken isn't worth it. So, my preference here would be to add: if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) struct virtio_net_mrg_rxbuf *vhdr... if (vq-iov[0].iov_len sizeof(*vhdr)) { vq_err(vq, tiny buffers (len %d %d) are not supported, vq-iov[0].iov_len, sizeof(*vhdr)); break; }
Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 11:45:33 PM: +static int skb_head_len(struct sk_buff_head *skq) +{ + struct sk_buff *head; + + head = skb_peek(skq); + if (head) + return head-len; + return 0; +} + This is done without locking, which I think can crash if skb is consumed after we peek it but before we read the length. This thread is the only legitimate consumer, right? But qemu has the file descriptor and I guess we shouldn't trust that it won't give it to someone else; it'd break vhost, but a crash would be inappropriate, of course. I'd like to avoid the lock, but I have another idea here, so will investigate. I looked at this some more and actually, I'm not sure I see a crash here. First, without qemu, or something it calls, being broken as root, nothing else should ever read from the socket, in which case the length will be exactly right for the next packet we read. No problem. But if by some error this skb is freed, we'll have valid memory address that isn't the length field of the next packet we'll read. If the length is negative or more than available in the vring, we'll fail the buffer allocation, exit the loop, and get the new head length of the receive queue the next time around -- no problem. If the length field is 0, we'll exit the loop even though we have data to read, but will get that packet the next time we get in here, again, with the right length. No problem. If the length field is big enough to allocate buffer space for it, but smaller than the new head we have to read, the recvmsg will fail with EMSGSIZE, drop the packet, exit the loop and be back in business with the next packet. No problem. Otherwise, the packet will fit and be delivered. I don't much like the notion of using skb-head when it's garbage, but that can only happen if qemu has broken, and I don't see a crash unless the skb is not only freed but no longer a valid memory address for reading at all, and all within the race window. Since the code handles other failure cases (not enough ring buffers or packet not fitting in the allocated buffers), the actual length value only matters in the sense that it prevents us from using buffers unnecessarily-- something that isn't all that relevant if it's hosed enough to have unauthorized readers on the socket. Is this case worth the performance penalty we'll no doubt pay for either locking the socket or always allocating for a max-sized packet? I'll experiment with a couple solutions here, but unless I've missed something, we might be better off just leaving it as-is. +-DLS ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 07:31:30 AM: On Tue, Mar 02, 2010 at 05:20:15PM -0700, David Stevens wrote: This patch generalizes buffer handling functions to NULL, NULL); + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); Should type for head be changed so that we do not need to cast? I also prefer aligning descendants on the opening (. Yes, that's probably better; the indentation with the cast would still wrap a lot, but I'll see what I can do here. err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); + vhost_discard(vq, 1); Isn't the original name a bit more descriptive? During development, I had both and I generally like shorter names, but I can change it back. +static int skb_head_len(struct sk_buff_head *skq) +{ + struct sk_buff *head; + + head = skb_peek(skq); + if (head) + return head-len; + return 0; +} + This is done without locking, which I think can crash if skb is consumed after we peek it but before we read the length. This thread is the only legitimate consumer, right? But qemu has the file descriptor and I guess we shouldn't trust that it won't give it to someone else; it'd break vhost, but a crash would be inappropriate, of course. I'd like to avoid the lock, but I have another idea here, so will investigate. /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned head, out, in, log, s; + unsigned in, log, s; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -204,10 +213,11 @@ }; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); - if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) + + if (!sock || !skb_head_len(sock-sk-sk_receive_queue)) return; Isn't this equivalent? Do you expect zero len skbs in socket? If yes, maybe we should discard these, not stop processing ... A zero return means no skb's. They are equivalent; I changed this call just to make it identical to the loop check, but I don't have a strong attachment to this. vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = skb_head_len(sock-sk-sk_receive_queue))) { This peeks in the queue to figure out how much data we have. It's a neat trick, but it does introduce new failure modes where an skb is consumed after we did the peek: - new skb could be shorter, we should return the unconsumed part - new skb could be longer, this will drop a packet. maybe this last is not an issue as the race should be rare in practice As before, this loop is the only legitimate consumer of skb's; if anyone else is reading the socket, it's broken. But since the file descriptor is available to qemu, it's probably trusting qemu too much. I don't believe there is a race at all on a working system; the head can't change until we read it after this test, but I'll see if I can come up with something better here. Closing the fd for qemu so vhost has exclusive access might do it, but otherwise we could just get a max-sized packet worth of buffers and then return them after the read. That'd force us to wait with small packets when the ring is nearly full, where we don't have to now, but I expect locking to read the head length will hurt performance in all cases. Will try these ideas out.k + headcount = vhost_get_heads(vq, datalen, in, vq_log, log); /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ @@ -235,13 +242,6 @@ * they refilled. */ break; } - /* We don't need to be notified again. */ you find this comment unhelpful
Re: [RFC][ PATCH 2/3] vhost-net: handle vnet_hdr processing for MRG_RX_BUF
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 08:12:29 AM: On Tue, Mar 02, 2010 at 05:20:26PM -0700, David Stevens wrote: This patch adds vnet_hdr processing for mergeable buffer support to vhost-net. Signed-off-by: David L Stevens dlstev...@us.ibm.com diff -ruN net-next-p1/drivers/vhost/net.c net-next-p2/drivers/vhost/net.c Could you please add -p to diff flags so that it's easier to figure out which function is changes? Sure. @@ -148,25 +146,45 @@ out %d, int %d\n, out, in); break; } + if (vq-guest_hlen vq-sock_hlen) { + if (msg.msg_iov[0].iov_len == vq-guest_hlen) + msg.msg_iov[0].iov_len = vq-sock_hlen; + else if (out == ARRAY_SIZE(vq-iov)) + vq_err(vq, handle_tx iov overflow!); + else { + int i; + + /* give header its own iov */ + for (i=out; i0; ++i) + msg.msg_iov[i+1] = msg.msg_iov[i]; + msg.msg_iov[0].iov_len = vq-sock_hlen; + msg.msg_iov[1].iov_base += vq-guest_hlen; + msg.msg_iov[1].iov_len -= vq-guest_hlen; + out++; We seem to spend a fair bit of code here and elsewhere trying to cover the diff between header size in guest and host. In hindsight, it was not a good idea to put new padding between data and the header: virtio-net should have added it *before*. But it is what it is. Wouldn't it be easier to just add an ioctl to tap so that vnet header size is made bigger by 4 bytes? I'd be ok with that, but if we support raw sockets, we have some of the issues (easier, since it's 0). num_buffers is only 16 bits, so just add 2 bytes, but yeah-- pushing it to tap would be easier for vhost. /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard(vq, 1); - tx_poll_start(net, sock); + if (err == -EAGAIN) { The comment mentions ENOBUFS. Are you sure it's EAGAIN? The comment's from the original -- the code changes should do the TODO, so probably should remove the comment. + tx_poll_start(net, sock); Don't we need to call discard to move the last avail header back? Yes, I think you're right. We might consider dropping the packet in the EAGAIN case here instead, though. It's not clear retrying the same packet with a delay here is better than getting new ones when the socket buffer is full (esp. with queues on both sides already). Maybe we should fold EAGAIN into the other error cases? Otherwise, you're right, it looks like it should have a discard here. + } else { + vq_err(vq, sendmsg: errno %d\n, -err); + /* drop packet; do not discard/resend */ + vhost_add_used_and_signal(net-dev,vq,head,1); + } break; - } - if (err != len) + } else if (err != len) Previous if ends with break so no need for else here. Yes. @@ -232,25 +243,18 @@ headcount = vhost_get_heads(vq, datalen, in, vq_log, log); /* OK, now we need to know about added descriptors. */ if (!headcount) { - if (unlikely(vhost_enable_notify(vq))) { - /* They have slipped one in as we were -* doing that: check again. */ - vhost_disable_notify(vq); - continue; - } - /* Nothing new? Wait for eventfd to tell us -* they refilled. */ + vhost_enable_notify(vq); You don't think this race can happen? The notify case has to allow for multiple heads, not just the one, so if he added just one head, it doesn't mean we're ok to continue-- the packet may require multiple. Instead, the notifier now checks for enough to handle a max-sized packet (whether or not NONOTIFY is set). I'll think about this some more, but I concluded it wasn't needed at the time. :-) @@ -519,6 +524,20 @@ vhost_net_disable_vq(n, vq); rcu_assign_pointer(vq-private_data, sock); + + if (sock sock-sk) { + if (!vhost_sock_is_raw(sock) || I dislike this backend-specific code, it ties us
Re: [RFC][ PATCH 3/3] vhost-net: Add mergeable RX buffer support to vhost-net
Michael S. Tsirkin m...@redhat.com wrote on 03/07/2010 08:26:33 AM: On Tue, Mar 02, 2010 at 05:20:34PM -0700, David Stevens wrote: This patch glues them all together and makes sure we notify whenever we don't have enough buffers to receive a max-sized packet, and adds the feature bit. Signed-off-by: David L Stevens dlstev...@us.ibm.com Maybe split this up? I can. I was looking mostly at size (and this is the smallest of the bunch). But the feature requires all of them together, of course. This last one is just everything left over from the other two. @@ -110,6 +90,7 @@ size_t len, total_len = 0; int err, wmem; struct socket *sock = rcu_dereference(vq-private_data); + I tend not to add empty lines if line below it is already short. This leaves no blank line between the declarations and the start of code. It's habit for me-- not sure of kernel coding standards address that or not, but I don't think I've seen it anywhere else. if (!sock) return; @@ -166,11 +147,11 @@ /* Skip header. TODO: support TSO. */ msg.msg_iovlen = out; head.iov_len = len = iov_length(vq-iov, out); + I tend not to add empty lines if line below it is a comment. I added this to separate the logical skip header block from the next, unrelated piece. Not important to me, though. /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for TX: - %zd expected %zd\n, - len, vq-guest_hlen); + %zd expected %zd\n, len, vq-guest_hlen); break; } /* TODO: Check specific error and bomb out unless ENOBUFS? */ /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) { + struct virtio_net_hdr_mrg_rxbuf *vhdr = + (struct virtio_net_hdr_mrg_rxbuf *) + vq-iov[0].iov_base; + /* add num_bufs */ + vq-iov[0].iov_len = vq-guest_hlen; + vhdr-num_buffers = headcount; I don't understand this. iov_base is a userspace pointer, isn't it. How can you assign values to it like that? Rusty also commented earlier that it's not a good idea to assume specific layout, such as first chunk being large enough to include virtio_net_hdr_mrg_rxbuf. I think we need to use memcpy to/from iovec etc. I guess you mean put_user() or copy_to_user(); yes, I suppose it could be paged since we read it. The code doesn't assume that it'll fit so much as arranged for it to fit. We allocate guest_hlen bytes in the buffer, but set the iovec to the (smaller) sock_hlen; do the read, then this code adds back the 2 bytes in the middle that we didn't read into (where num_buffers goes). But the allocator does require that guest_hlen will fit in a single buffer (and reports error if it doesn't). The alternative is significantly more complicated, and only fails if the guest doesn't give us at least the buffer size the guest header requires (a truly lame guest). I'm not sure it's worth a lot of complexity in vhost to support the guest giving us 12 byte buffers; those guests don't exist now and maybe they never should? /* This actually signals the guest, using eventfd. */ void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq) { __u16 flags = 0; + I tend not to add empty lines if a line above it is already short. Again, separating declarations from code-- never seen different in any other kernel code. if (get_user(flags, vq-avail-flags)) { vq_err(vq, Failed to get flags); return; @@ -1125,7 +1140,7 @@ /* If they don't want an interrupt, don't signal, unless empty. */ if ((flags VRING_AVAIL_F_NO_INTERRUPT) - (vq-avail_idx != vq-last_avail_idx || + (vhost_available(vq) vq-maxheadcount || I don't understand this change. It seems to make code not match the comments. It redefines empty. Without mergeable buffers, we can empty the ring down to nothing before we require notification. With mergeable buffers, if the packet requires, say, 3 buffers, and we have only 2 left, we are empty and require notification and new buffers to read anything. In both cases, we notify when we can't read another packet without more buffers. I can think about changing the comment to reflect this more clearly. !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY))) return; diff -ruN net-next-p2/drivers/vhost/vhost.h net-next-p3/drivers/vhost/vhost.h
[RFC][ PATCH 3/3] vhost-net: Add mergeable RX buffer support to vhost-net
This patch glues them all together and makes sure we notify whenever we don't have enough buffers to receive a max-sized packet, and adds the feature bit. Signed-off-by: David L Stevens dlstev...@us.ibm.com diff -ruN net-next-p2/drivers/vhost/net.c net-next-p3/drivers/vhost/net.c --- net-next-p2/drivers/vhost/net.c 2010-03-02 13:01:34.0 -0800 +++ net-next-p3/drivers/vhost/net.c 2010-03-02 15:25:15.0 -0800 @@ -54,26 +54,6 @@ enum vhost_net_poll_state tx_poll_state; }; -/* Pop first len bytes from iovec. Return number of segments used. */ -static int move_iovec_hdr(struct iovec *from, struct iovec *to, - size_t len, int iov_count) -{ - int seg = 0; - size_t size; - while (len seg iov_count) { - size = min(from-iov_len, len); - to-iov_base = from-iov_base; - to-iov_len = size; - from-iov_len -= size; - from-iov_base += size; - len -= size; - ++from; - ++to; - ++seg; - } - return seg; -} - /* Caller must have TX VQ lock */ static void tx_poll_stop(struct vhost_net *net) { @@ -97,7 +77,7 @@ static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; - unsigned out, in, s; + unsigned out, in; struct iovec head; struct msghdr msg = { .msg_name = NULL, @@ -110,6 +90,7 @@ size_t len, total_len = 0; int err, wmem; struct socket *sock = rcu_dereference(vq-private_data); + if (!sock) return; @@ -166,11 +147,11 @@ /* Skip header. TODO: support TSO. */ msg.msg_iovlen = out; head.iov_len = len = iov_length(vq-iov, out); + /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for TX: - %zd expected %zd\n, - len, vq-guest_hlen); + %zd expected %zd\n, len, vq-guest_hlen); break; } /* TODO: Check specific error and bomb out unless ENOBUFS? */ @@ -214,7 +195,7 @@ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned in, log, s; + unsigned in, log; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -245,30 +226,36 @@ if (!headcount) { vhost_enable_notify(vq); break; - } + } else if (vq-maxheadcount headcount) + vq-maxheadcount = headcount; /* Skip header. TODO: support TSO/mergeable rx buffers. */ msg.msg_iovlen = in; len = iov_length(vq-iov, in); - /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for RX: - %zd expected %zd\n, - len, vq-guest_hlen); + %zd expected %zd\n, len, vq-guest_hlen); break; } err = sock-ops-recvmsg(NULL, sock, msg, len, MSG_DONTWAIT | MSG_TRUNC); - /* TODO: Check specific error and bomb out unless EAGAIN? */ if (err 0) { - vhost_discard(vq, 1); + vhost_discard(vq, headcount); break; } /* TODO: Should check and handle checksum. */ + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF)) { + struct virtio_net_hdr_mrg_rxbuf *vhdr = + (struct virtio_net_hdr_mrg_rxbuf *) + vq-iov[0].iov_base; + /* add num_bufs */ + vq-iov[0].iov_len = vq-guest_hlen; + vhdr-num_buffers = headcount; + } if (err len) { pr_err(Discarded truncated rx packet: len %d %zd\n, err, len); - vhost_discard(vq, 1); + vhost_discard(vq, headcount); continue; } len = err; @@ -573,8 +560,6 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features) { - size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? - sizeof(struct virtio_net_hdr) : 0; int i; mutex_lock(n-dev.mutex); if ((features (1 VHOST_F_LOG_ALL)) diff -ruN net-next-p2/drivers/vhost/vhost.c net-next-p3/drivers/vhost/vhost.c --- net-next-p2/drivers/vhost/vhost.c
[RFC][ PATCH 2/3] vhost-net: handle vnet_hdr processing for MRG_RX_BUF
This patch adds vnet_hdr processing for mergeable buffer support to vhost-net. Signed-off-by: David L Stevens dlstev...@us.ibm.com diff -ruN net-next-p1/drivers/vhost/net.c net-next-p2/drivers/vhost/net.c --- net-next-p1/drivers/vhost/net.c 2010-03-01 11:44:22.0 -0800 +++ net-next-p2/drivers/vhost/net.c 2010-03-02 13:01:34.0 -0800 @@ -109,7 +109,6 @@ }; size_t len, total_len = 0; int err, wmem; - size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock) return; @@ -124,7 +123,6 @@ if (wmem sock-sk-sk_sndbuf * 2) tx_poll_stop(net); - hdr_size = vq-hdr_size; for (;;) { head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, @@ -148,25 +146,45 @@ out %d, int %d\n, out, in); break; } + if (vq-guest_hlen vq-sock_hlen) { + if (msg.msg_iov[0].iov_len == vq-guest_hlen) + msg.msg_iov[0].iov_len = vq-sock_hlen; + else if (out == ARRAY_SIZE(vq-iov)) + vq_err(vq, handle_tx iov overflow!); + else { + int i; + + /* give header its own iov */ + for (i=out; i0; ++i) + msg.msg_iov[i+1] = msg.msg_iov[i]; + msg.msg_iov[0].iov_len = vq-sock_hlen; + msg.msg_iov[1].iov_base += vq-guest_hlen; + msg.msg_iov[1].iov_len -= vq-guest_hlen; + out++; + } + } /* Skip header. TODO: support TSO. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; head.iov_len = len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for TX: %zd expected %zd\n, - iov_length(vq-hdr, s), hdr_size); + len, vq-guest_hlen); break; } /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard(vq, 1); - tx_poll_start(net, sock); + if (err == -EAGAIN) { + tx_poll_start(net, sock); + } else { + vq_err(vq, sendmsg: errno %d\n, -err); + /* drop packet; do not discard/resend */ + vhost_add_used_and_signal(net-dev,vq,head,1); + } break; - } - if (err != len) + } else if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); vhost_add_used_and_signal(net-dev, vq, head, 1); @@ -207,14 +225,8 @@ .msg_flags = MSG_DONTWAIT, }; - struct virtio_net_hdr hdr = { - .flags = 0, - .gso_type = VIRTIO_NET_HDR_GSO_NONE - }; - size_t len, total_len = 0; int err, headcount, datalen; - size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock || !skb_head_len(sock-sk-sk_receive_queue)) @@ -223,7 +235,6 @@ use_mm(net-dev.mm); mutex_lock(vq-mutex); vhost_disable_notify(vq); - hdr_size = vq-hdr_size; vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; @@ -232,25 +243,18 @@ headcount = vhost_get_heads(vq, datalen, in, vq_log, log); /* OK, now we need to know about added descriptors. */ if (!headcount) { - if (unlikely(vhost_enable_notify(vq))) { - /* They have slipped one in as we were -* doing that: check again. */ - vhost_disable_notify(vq); - continue; - } - /* Nothing new? Wait for eventfd to tell us -* they refilled. */ + vhost_enable_notify(vq); break; } /* Skip header. TODO: support TSO/mergeable rx buffers. */ - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in); msg.msg_iovlen = in; len = iov_length(vq-iov, in); + /* Sanity check */
[RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net
These patches add support for mergeable receive buffers to vhost-net, allowing it to use multiple virtio buffer heads for a single receive packet. +-DLS Signed-off-by: David L Stevens dlstev...@us.ibm.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver
This patch generalizes buffer handling functions to support multiple buffer heads. In-line for viewing, attached for applying. Signed-off-by: David L Stevens dlstev...@us.ibm.com diff -ruN net-next-p0/drivers/vhost/net.c net-next-p1/drivers/vhost/net.c --- net-next-p0/drivers/vhost/net.c 2010-02-24 12:59:24.0 -0800 +++ net-next-p1/drivers/vhost/net.c 2010-03-01 11:44:22.0 -0800 @@ -97,7 +97,8 @@ static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; - unsigned head, out, in, s; + unsigned out, in, s; + struct iovec head; struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0, @@ -126,12 +127,10 @@ hdr_size = vq-hdr_size; for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -NULL, NULL); + head.iov_base = (void *)vhost_get_vq_desc(net-dev, vq, + vq-iov, ARRAY_SIZE(vq-iov), out, in, NULL, NULL); /* Nothing new? Wait for eventfd to tell us they refilled. */ - if (head == vq-num) { + if (head.iov_base == (void *)vq-num) { wmem = atomic_read(sock-sk-sk_wmem_alloc); if (wmem = sock-sk-sk_sndbuf * 3 / 4) { tx_poll_start(net, sock); @@ -152,7 +151,7 @@ /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; - len = iov_length(vq-iov, out); + head.iov_len = len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { vq_err(vq, Unexpected header len for TX: @@ -163,14 +162,14 @@ /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); + vhost_discard(vq, 1); tx_poll_start(net, sock); break; } if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used_and_signal(net-dev, vq, head, 1); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -182,12 +181,22 @@ unuse_mm(net-dev.mm); } +static int skb_head_len(struct sk_buff_head *skq) +{ + struct sk_buff *head; + + head = skb_peek(skq); + if (head) + return head-len; + return 0; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned head, out, in, log, s; + unsigned in, log, s; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -204,10 +213,11 @@ }; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); - if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) + + if (!sock || !skb_head_len(sock-sk-sk_receive_queue)) return; use_mm(net-dev.mm); @@ -218,13 +228,10 @@ vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = skb_head_len(sock-sk-sk_receive_queue))) { + headcount = vhost_get_heads(vq, datalen, in, vq_log, log); /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { + if (!headcount) { if (unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ @@ -235,13 +242,6 @@ * they refilled. */ break; } - /* We don't need to be notified again. */ - if (out) { - vq_err(vq, Unexpected descriptor format for RX: - out %d, int
Re: [RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net
Michael S. Tsirkin m...@redhat.com wrote on 03/02/2010 11:54:32 PM: On Tue, Mar 02, 2010 at 04:20:03PM -0800, David Stevens wrote: These patches add support for mergeable receive buffers to vhost-net, allowing it to use multiple virtio buffer heads for a single receive packet. +-DLS Signed-off-by: David L Stevens dlstev...@us.ibm.com Do you have performance numbers (both with and without mergeable buffers in guest)? Michael, Nothing formal. I did some TCP single-stream throughput tests and was seeing 20-25% improvement on a laptop (ie, low-end hardware). That actually surprised me; I'd think it'd be about the same, except maybe in a test that has mixed packet sizes. Comparisons with the net-next kernel these patches are for showed only ~10% improvement. But I also see a lot of variability both among different configurations and with the same configuration on different runs. So, I don't feel like those numbers are very solid, and I haven't yet done any tests on bigger hardware. 2 notes: I have a modified version of qemu to get the VHOST_FEATURES flags, including the mergeable RX bufs flag, passed to the guest; I'll be working with your current qemu git trees next, if any changes are needed to support it there. Second, I've found a missing initialization in the patches I sent on the list, so I'll send an updated patch 2 with the fix, and qemu patches when they are ready (plus any code-review comments incorporated). +-DLS ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC][ PATCH 0/3] vhost-net: Add mergeable RX buffer support to vhost-net
Interesting. Since the feature in question is billed first of all a performance optimization... By whom? Although I see some improved performance, I think its real benefit is improving memory utilization on the guest. Instead of using 75K for an ARP packet, mergeable RX buffers only uses 4K. :-) Since the patches affect code paths when mergeable RX buffers are disabled as well, I guess the most important point would be to verify whether there's increase in latency and/or CPU utilization, or bandwidth cost when the feature bit is *disabled*. Actually, when the feature bit is disabled, it'll only get a single head, doesn't use the special vnet_hdr, and the codepath reduces to the essentially to the original. But the answer is no; I saw no regressions when using it without the feature bit. The only substantive difference in that case is that the new code avoids copying the vnet header as the original does, so it should actually be faster, but I don't think that's measurable above the variability I already see. 2 notes: I have a modified version of qemu to get the VHOST_FEATURES flags, including the mergeable RX bufs flag, passed to the guest; I'll be working with your current qemu git trees next, if any changes are needed to support it there. This feature also seems to conflict with zero-copy rx patches from Xin Xiaohui (subject: Provide a zero-copy method on KVM virtio-net) these are not in a mergeable shape yet, so this is not a blocker, but I wonder what your thoughts on the subject are: how will we do feature negotiation if some backends don't support some features? The qemu code I have basically sends the set features and get features all the way to vhost (ie, it's the guest negotiating with vhost), except, of course, for the magic qemu-only bits. I think that's the right model. I'll definitely take a look at the patch you mention and maybe comment further. +-DLS ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization