Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

2011-04-22 Thread Shirley Ma
On Thu, 2011-04-21 at 10:29 -0400, Jon Mason wrote:
 Why are only these 3 drivers getting support?  As far as I can tell,
 the only requirement is HIGHDMA.  If this is the case, is there really
 a need for an additional flag to support this?  If you can key off of
 HIGHDMA, all devices that support this would get the benefit.

Agreed. So far, I have only verified these three 10Gb NICs we have in
our lab. We can enable all devices which support HIGHDMA.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

2011-04-22 Thread Shirley Ma
On Wed, 2011-04-20 at 12:36 -0700, Shirley Ma wrote:
 I am collecting more test results against 2.6.39-rc3 kernel and will
 provide the test matrix later.

Single TCP_STREAM 120 secs test results over ixgbe 10Gb NIC results:

Message BW(Gb/s)qemu-kvm (NumCPU)vhost-net(NumCPU) PerfTop irq/s
4K  7408.57 92.1%   22.6%   1229
4K(Orig)4913.17 118.1%  84.1%   2086
8K  9129.90 89.3%   23.3%   1141
8K(Orig)7094.55 115.9%  84.7%   2157
16K 9178.81 89.1%   23.3%   1139
16K(Orig)8927.1 118.7%  83.4%   2262
64K 9171.43 88.4%   24.9%   1253
64K(Orig)9085.85115.9%  82.4%   2229

For message size less or equal than 2K, there is a known KVM guest TX
overrun issue. With this zerocopy patch, the issue becomes more severe,
guest io_exits has tripled than before, so the performance is not good.
Once the TX overrun problem has been addressed, I will retest the small
message size performance.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

2011-04-21 Thread Jon Mason
On Wed, Apr 20, 2011 at 3:36 PM, Shirley Ma mashi...@us.ibm.com wrote:
 This patchset add supports for TX zero-copy between guest and host
 kernel through vhost. It significantly reduces CPU utilization on the
 local host on which the guest is located (It reduced 30-50% CPU usage
 for vhost thread for single stream test). The patchset is based on
 previous submission and comments from the community regarding when/how
 to handle guest kernel buffers to be released. This is the simplest
 approach I can think of after comparing with several other solutions.

 This patchset includes:

 1/8: Add a new sock zero-copy flag, SOCK_ZEROCOPY;

 2/8: Add a new device flag, NETIF_F_ZEROCOPY for lower level device
 support zero-copy;

 3/8: Add a new struct skb_ubuf_info in skb_share_info for userspace
 buffers release callback when lower device DMA has done for that skb;

 4/8: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
 add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb
 buffers.

 5/8: Add macvtap zero-copy in lower device when sending packet is
 greater than 128 bytes.

 6/8: Add Chelsio 10Gb NIC to zero copy feature flag

 7/8: Add Intel 10Gb NIC zero copy feature flag

 8/8: Add Emulex 10Gb NIC zero copy feature flag

Why are only these 3 drivers getting support?  As far as I can tell,
the only requirement is HIGHDMA.  If this is the case, is there really
a need for an additional flag to support this?  If you can key off of
HIGHDMA, all devices that support this would get the benefit.



 The patchset is built against most recent linux 2.6.git. It has passed
 netperf/netserver multiple streams stress test on above NICs.

 The single stream test results from 2.6.37 kernel on Chelsio:

 64K message size: copy_from_user dropped from 40% to 5%; vhost thread
 cpu utilization dropped from 76% to 28%

 I am collecting more test results against 2.6.39-rc3 kernel and will
 provide the test matrix later.

 Thanks
 Shirley


 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 0/8] macvtap/vhost TX zero copy support

2011-04-20 Thread Shirley Ma
This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced 30-50% CPU usage
for vhost thread for single stream test). The patchset is based on
previous submission and comments from the community regarding when/how
to handle guest kernel buffers to be released. This is the simplest
approach I can think of after comparing with several other solutions.

This patchset includes:

1/8: Add a new sock zero-copy flag, SOCK_ZEROCOPY;

2/8: Add a new device flag, NETIF_F_ZEROCOPY for lower level device
support zero-copy;

3/8: Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when lower device DMA has done for that skb;

4/8: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb
buffers.

5/8: Add macvtap zero-copy in lower device when sending packet is
greater than 128 bytes.

6/8: Add Chelsio 10Gb NIC to zero copy feature flag

7/8: Add Intel 10Gb NIC zero copy feature flag

8/8: Add Emulex 10Gb NIC zero copy feature flag

The patchset is built against most recent linux 2.6.git. It has passed
netperf/netserver multiple streams stress test on above NICs.

The single stream test results from 2.6.37 kernel on Chelsio:

64K message size: copy_from_user dropped from 40% to 5%; vhost thread
cpu utilization dropped from 76% to 28%

I am collecting more test results against 2.6.39-rc3 kernel and will
provide the test matrix later.

Thanks
Shirley


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

2011-04-20 Thread Shirley Ma
Only when buffer size is greater than GOODCOPY_LEN (128), macvtap
enables zero-copy.

Signed-off-by: Shirley MA x...@us.ibm.com
---

 drivers/net/macvtap.c |  124 -
 1 files changed, 112 insertions(+), 12 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..b4e6656 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
  */
 static dev_t macvtap_major;
 #define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN  (L1_CACHE_BYTES  128 ? 128 : L1_CACHE_BYTES)
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
 {
struct net *net = current-nsproxy-net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+   struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;
 
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
q-flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q-vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+   /*
+* so far only VM uses macvtap, enable zero copy between guest
+* kernel and host kernel when lower device supports high memory
+* DMA
+*/
+   if (vlan) {
+   if (vlan-lowerdev-features  NETIF_F_ZEROCOPY)
+   sock_set_flag(q-sk, SOCK_ZEROCOPY);
+   }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(q-sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct 
sock *sk, size_t prepad,
return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec 
*from,
+ int offset, size_t count)
+{
+   int len = iov_length(from, count) - offset;
+   int copy = skb_headlen(skb);
+   int size, offset1 = 0;
+   int i = 0;
+   skb_frag_t *f;
+
+   /* Skip over from offset */
+   while (offset = from-iov_len) {
+   offset -= from-iov_len;
+   ++from;
+   --count;
+   }
+
+   /* copy up to skb headlen */
+   while (copy  0) {
+   size = min_t(unsigned int, copy, from-iov_len - offset);
+   if (copy_from_user(skb-data + offset1, from-iov_base + offset,
+  size))
+   return -EFAULT;
+   if (copy  size) {
+   ++from;
+   --count;
+   }
+   copy -= size;
+   offset1 += size;
+   offset = 0;
+   }
+
+   if (len == offset1)
+   return 0;
+
+   while (count--) {
+   struct page *page[MAX_SKB_FRAGS];
+   int num_pages;
+   unsigned long base;
+
+   len = from-iov_len - offset1;
+   if (!len) {
+   offset1 = 0;
+   ++from;
+   continue;
+   }
+   base = (unsigned long)from-iov_base + offset1;
+   size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)  PAGE_SHIFT;
+   num_pages = get_user_pages_fast(base, size, 0, page[i]);
+   if ((num_pages != size) ||
+   (num_pages  MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags))
+   /* put_page is in skb free */
+   return -EFAULT;
+   skb-data_len += len;
+   skb-len += len;
+   skb-truesize += len;
+   while (len) {
+   f = skb_shinfo(skb)-frags[i];
+   f-page = page[i];
+   f-page_offset = base  ~PAGE_MASK;
+   f-size = min_t(int, len, PAGE_SIZE - f-page_offset);
+   skb_shinfo(skb)-nr_frags++;
+   /* increase sk_wmem_alloc */
+   atomic_add(f-size, skb-sk-sk_wmem_alloc);
+   base += f-size;
+   len -= f-size;
+   i++;
+   }
+   offset1 = 0;
+   ++from;
+   }
+   return 0;
+}
+
 /*
  * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
  * be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
 
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
-   const struct iovec *iv, size_t count,
-   int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+   const struct iovec *iv, unsigned long total_len,
+