Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-03-04 Thread Max Krasnyanskiy
Rusty Russell wrote:
 On Tuesday 04 March 2008 16:08:00 Max Krasnyansky wrote:
 The problem with this approach is that for what I'm doing, the packets
 aren't nicely arranged somewhere; they're in random process memory.
 That's fine. RX/TX descriptors would not contain the data itself. They'd
 contain pointers to actual packets (ie just like the NIC takes physical
 memory address and DMAs data in/out).
 The allows for sending/receiving packets without syscalls and fits nicely
 with the async schemes like GSO.
 
 Yes, yes it does.  That would be a very nice extension (it's orthogonal to 
 this patch though, so should we get Dave to take these for 2.6.25?).
It's orthogonal in general I agree. The only concern is whether we should keep 
extending existing driver API (ie adding more flags like TUN_NO_PI, etc) or go 
straight to the new/better ring based API. I do not have a strong preference 
one way or the other. So I guess I'm saying I'd be ok with Dave taking them 
for 2.6.25, but that API may be obsolete when the ring based thing comes out.

 And as it happens, virtio already has such a structure: virtio_ring.  See 
 linux/virtio_ring.h.
I'll take a look.

 The structure is for virtio, I'm just borrowing it for tap because it's
 already there.  We could rename it and move it out to its own header, but
 if so we should do that before 2.6.25 is released.
 If we do the whole enchilada with the RX/TX rings then we probably do not
 even need it. I'm thinking that RX/TX descriptor would include everything
 you need for the GSO and stuff.
 I meant do not need it for the TUN/TAP driver that is. Is it used anywhere
 else ?
 
 Just for the linux virtio drivers.  Reusing it for tun/tap was an 
 afterthought.  It just meant I could pass the same structure straight thru, 
 though, which is nice.
 
 The userspace-kernel problem is very similar to the guest-host problem, so 
 it doesn't surprise me if we end up with very similar (identical?) 
 interfaces.
 
 Take a look at virtio_ring.h, virtio_ring.c and Documentation/lguest/lguest.c 
 to see how we use it...
Thanx for the pointers.

Max

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-03-03 Thread Rusty Russell
On Friday 08 February 2008 16:39:03 Max Krasnyansky wrote:
 Rusty Russell wrote:
  (Changes since last time: we how have explicit IFF_RECV_CSUM and
  IFF_RECV_GSO bits, and some renaming of virtio_net hdr)
 
  We use the virtio_net_hdr: it is an ABI already and designed to
  encapsulate such metadata as GSO and partial checksums.
 
  IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr'
  at the start of each packet.  You can always write packets with
  partial checksum and gso to the tap device using this header.
 
  IFF_RECV_CSUM means you can handle reading packets with partial
  checksums.  If IFF_RECV_GSO is also set, it means you can handle
  reading (all types of) GSO packets.
 
  Note that there is no easy way to detect if these flags are supported:
  see next patch.

 Again sorry for delay in replying. Here are my thoughts on this.

 I like the approach in general. Certainly the part that creates skbs out of
 the user-space pages looks good. And it's fits nicely into existing TUN
 driver model. However I actually wanted to change the model :). In
 particular I'm talking about syscall per packet
 After messing around with things like libe1000.sf.net I'd like to make
 TUN/TAP driver look more like modern nic's to the user-space. In other
 words I'm thinking about introducing RX and TX rings that the user-space
 can then mmap() and write/read packets descriptors to/from. That will saves
 the number of system calls that the user-space app needs to do. That by
 itself saves a lot of overhead, combined with the GSO it's be lightning
 fast.

The problem with this approach is that for what I'm doing, the packets aren't 
nicely arranged somewhere; they're in random process memory.

I thought about further abusing writev and readv to do multiple packets at 
once.  

 btw We had a long discussion with Eugeniy Polakov on mapping user-pages vs
 mmap()ing large kernel buffer and doing normal memcpy() (ie instead of
 copy_to/fromuser()) in the kernel. On small packets overhead of
 get_user_pages() eats up all the benefits. So we should think of some
 scheme that nicely combines the two. Kind of like copy break that latest
 net drivers do these days.

Yes, the threshold for copy should probably be set around 128 bytes.

 Also btw why call it VIRTIO ? For example I'm actually interested in
 speeding up tunning and general network apps. We have wireless basestation
 apps here that need to handle packets in user-space. Those kind things have
 nothing to with virtualization.

The structure is for virtio, I'm just borrowing it for tap because it's 
already there.  We could rename it and move it out to its own header, but if 
so we should do that before 2.6.25 is released.

Thanks!
Rusty.



 Max


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-03-03 Thread Max Krasnyansky
Rusty Russell wrote:
 On Friday 08 February 2008 16:39:03 Max Krasnyansky wrote:
 Rusty Russell wrote:
 (Changes since last time: we how have explicit IFF_RECV_CSUM and
 IFF_RECV_GSO bits, and some renaming of virtio_net hdr)

 We use the virtio_net_hdr: it is an ABI already and designed to
 encapsulate such metadata as GSO and partial checksums.

 IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr'
 at the start of each packet.  You can always write packets with
 partial checksum and gso to the tap device using this header.

 IFF_RECV_CSUM means you can handle reading packets with partial
 checksums.  If IFF_RECV_GSO is also set, it means you can handle
 reading (all types of) GSO packets.

 Note that there is no easy way to detect if these flags are supported:
 see next patch.
 Again sorry for delay in replying. Here are my thoughts on this.

 I like the approach in general. Certainly the part that creates skbs out of
 the user-space pages looks good. And it's fits nicely into existing TUN
 driver model. However I actually wanted to change the model :). In
 particular I'm talking about syscall per packet
 After messing around with things like libe1000.sf.net I'd like to make
 TUN/TAP driver look more like modern nic's to the user-space. In other
 words I'm thinking about introducing RX and TX rings that the user-space
 can then mmap() and write/read packets descriptors to/from. That will saves
 the number of system calls that the user-space app needs to do. That by
 itself saves a lot of overhead, combined with the GSO it's be lightning
 fast.
 
 The problem with this approach is that for what I'm doing, the packets aren't 
 nicely arranged somewhere; they're in random process memory.
That's fine. RX/TX descriptors would not contain the data itself. They'd
contain pointers to actual packets (ie just like the NIC takes physical memory
address and DMAs data in/out).
The allows for sending/receiving packets without syscalls and fits nicely with
the async schemes like GSO.

btw The code that I sent you does indeed expect packets to be in a mmap()ed
buffer but I agree that it only works for certain cases. In general it's not
flexible. I was thinking of introducing some flags in the descriptor that tell
the kernel how to handle the packet. ie Whether it needs to be just copied
into a fresh SKB or remapped with get_user_pages().

 I thought about further abusing writev and readv to do multiple packets at 
 once.  
I actually was going to abuse them from day one. At that time Alex Kuznetsov
told me that I'm crazy and I gave up on it :)

 Also btw why call it VIRTIO ? For example I'm actually interested in
 speeding up tunning and general network apps. We have wireless basestation
 apps here that need to handle packets in user-space. Those kind things have
 nothing to with virtualization.
 
 The structure is for virtio, I'm just borrowing it for tap because it's 
 already there.  We could rename it and move it out to its own header, but if 
 so we should do that before 2.6.25 is released.
If we do the whole enchilada with the RX/TX rings then we probably do not even
need it. I'm thinking that RX/TX descriptor would include everything you need
for the GSO and stuff.
I meant do not need it for the TUN/TAP driver that is. Is it used anywhere else 
?

Max



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-03-03 Thread Rusty Russell
On Tuesday 04 March 2008 16:08:00 Max Krasnyansky wrote:
  The problem with this approach is that for what I'm doing, the packets
  aren't nicely arranged somewhere; they're in random process memory.

 That's fine. RX/TX descriptors would not contain the data itself. They'd
 contain pointers to actual packets (ie just like the NIC takes physical
 memory address and DMAs data in/out).
 The allows for sending/receiving packets without syscalls and fits nicely
 with the async schemes like GSO.

Yes, yes it does.  That would be a very nice extension (it's orthogonal to 
this patch though, so should we get Dave to take these for 2.6.25?).

And as it happens, virtio already has such a structure: virtio_ring.  See 
linux/virtio_ring.h.

  The structure is for virtio, I'm just borrowing it for tap because it's
  already there.  We could rename it and move it out to its own header, but
  if so we should do that before 2.6.25 is released.

 If we do the whole enchilada with the RX/TX rings then we probably do not
 even need it. I'm thinking that RX/TX descriptor would include everything
 you need for the GSO and stuff.
 I meant do not need it for the TUN/TAP driver that is. Is it used anywhere
 else ?

Just for the linux virtio drivers.  Reusing it for tun/tap was an 
afterthought.  It just meant I could pass the same structure straight thru, 
though, which is nice.

The userspace-kernel problem is very similar to the guest-host problem, so 
it doesn't surprise me if we end up with very similar (identical?) 
interfaces.

Take a look at virtio_ring.h, virtio_ring.c and Documentation/lguest/lguest.c 
to see how we use it...

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-02-08 Thread Max Krasnyansky
Rusty Russell wrote:
 (Changes since last time: we how have explicit IFF_RECV_CSUM and 
 IFF_RECV_GSO bits, and some renaming of virtio_net hdr)
 
 We use the virtio_net_hdr: it is an ABI already and designed to
 encapsulate such metadata as GSO and partial checksums.
 
 IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr'
 at the start of each packet.  You can always write packets with
 partial checksum and gso to the tap device using this header.
 
 IFF_RECV_CSUM means you can handle reading packets with partial
 checksums.  If IFF_RECV_GSO is also set, it means you can handle
 reading (all types of) GSO packets.

 Note that there is no easy way to detect if these flags are supported:
 see next patch.

Again sorry for delay in replying. Here are my thoughts on this.

I like the approach in general. Certainly the part that creates skbs out of the 
user-space
pages looks good. And it's fits nicely into existing TUN driver model.
However I actually wanted to change the model :). In particular I'm talking 
about 
syscall per packet 
After messing around with things like libe1000.sf.net I'd like to make TUN/TAP 
driver look 
more like modern nic's to the user-space. In other words I'm thinking about 
introducing RX and
TX rings that the user-space can then mmap() and write/read packets descriptors 
to/from.
That will saves the number of system calls that the user-space app needs to do. 
That by 
itself saves a lot of overhead, combined with the GSO it's be lightning fast.

I'm going to send you a version that I cooked up awhile ago in a private email. 
Do not want
to spam netdev :). It's not quite the RX/TX ring model but I'll give you an 
idea.
I did some profiling and PPS (packets per second) numbers that user-space can 
handle literally 
sky rocketed.

btw We had a long discussion with Eugeniy Polakov on mapping user-pages vs 
mmap()ing large
kernel buffer and doing normal memcpy() (ie instead of copy_to/fromuser()) in 
the kernel.
On small packets overhead of get_user_pages() eats up all the benefits. So we 
should think
of some scheme that nicely combines the two. Kind of like copy break that 
latest net 
drivers do these days.

Also btw why call it VIRTIO ? For example I'm actually interested in speeding 
up tunning
and general network apps. We have wireless basestation apps here that need to 
handle packets
in user-space. Those kind things have nothing to with virtualization.

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-01-23 Thread Rusty Russell
(Changes since last time: we how have explicit IFF_RECV_CSUM and 
IFF_RECV_GSO bits, and some renaming of virtio_net hdr)

We use the virtio_net_hdr: it is an ABI already and designed to
encapsulate such metadata as GSO and partial checksums.

IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr'
at the start of each packet.  You can always write packets with
partial checksum and gso to the tap device using this header.

IFF_RECV_CSUM means you can handle reading packets with partial
checksums.  If IFF_RECV_GSO is also set, it means you can handle
reading (all types of) GSO packets.

Note that there is no easy way to detect if these flags are supported:
see next patch.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 drivers/net/tun.c  |  259 +++--
 include/linux/if_tun.h |6 +
 2 files changed, 238 insertions(+), 27 deletions(-)

diff -r cb85fb035378 drivers/net/tun.c
--- a/drivers/net/tun.c Wed Jan 23 20:06:56 2008 +1100
+++ b/drivers/net/tun.c Wed Jan 23 20:12:51 2008 +1100
@@ -62,6 +62,7 @@
 #include linux/if_ether.h
 #include linux/if_tun.h
 #include linux/crc32.h
+#include linux/virtio_net.h
 #include net/net_namespace.h
 
 #include asm/system.h
@@ -238,35 +239,188 @@ static unsigned int tun_chr_poll(struct 
return mask;
 }
 
+static struct sk_buff *copy_user_skb(size_t align, struct iovec *iv, size_t 
len)
+{
+   struct sk_buff *skb;
+
+   if (!(skb = alloc_skb(len + align, GFP_KERNEL)))
+   return ERR_PTR(-ENOMEM);
+
+   if (align)
+   skb_reserve(skb, align);
+
+   if (memcpy_fromiovec(skb_put(skb, len), iv, len)) {
+   kfree_skb(skb);
+   return ERR_PTR(-EFAULT);
+   }
+   return skb;
+}
+
+/* This will fail if they give us a crazy iovec, but that's their own fault. */
+static int get_user_skb_frags(const struct iovec *iv, size_t count,
+ struct skb_frag_struct *f)
+{
+   unsigned int i, j, num_pg = 0;
+   int err;
+   struct page *pages[MAX_SKB_FRAGS];
+
+   down_read(current-mm-mmap_sem);
+   for (i = 0; i  count; i++) {
+   int n, npages;
+   unsigned long base, len;
+   base = (unsigned long)iv[i].iov_base;
+   len = (unsigned long)iv[i].iov_len;
+
+   if (len == 0)
+   continue;
+
+   /* How many pages will this take? */
+   npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
+   if (unlikely(num_pg + npages  MAX_SKB_FRAGS)) {
+   err = -ENOSPC;
+   goto fail;
+   }
+   n = get_user_pages(current, current-mm, base, npages,
+  0, 0, pages, NULL);
+   if (unlikely(n  0)) {
+   err = n;
+   goto fail;
+   }
+
+   /* Transfer pages to the frag array */
+   for (j = 0; j  n; j++) {
+   f[num_pg].page = pages[j];
+   if (j == 0) {
+   f[num_pg].page_offset = offset_in_page(base);
+   f[num_pg].size = min(len, PAGE_SIZE -
+f[num_pg].page_offset);
+   } else {
+   f[num_pg].page_offset = 0;
+   f[num_pg].size = min(len, PAGE_SIZE);
+   }
+   len -= f[num_pg].size;
+   base += f[num_pg].size;
+   num_pg++;
+   }
+
+   if (unlikely(n != npages)) {
+   err = -EFAULT;
+   goto fail;
+   }
+   }
+   up_read(current-mm-mmap_sem);
+   return num_pg;
+
+fail:
+   for (i = 0; i  num_pg; i++)
+   put_page(f[i].page);
+   up_read(current-mm-mmap_sem);
+   return err;
+}
+
+
+static struct sk_buff *map_user_skb(const struct virtio_net_hdr *gso,
+   size_t align, struct iovec *iv,
+   size_t count, size_t len)
+{
+   struct sk_buff *skb;
+   struct skb_shared_info *sinfo;
+   int err;
+
+   if (!(skb = alloc_skb(gso-hdr_len + align, GFP_KERNEL)))
+   return ERR_PTR(-ENOMEM);
+
+   if (align)
+   skb_reserve(skb, align);
+
+   sinfo = skb_shinfo(skb);
+   sinfo-gso_size = gso-gso_size;
+   sinfo-gso_type = SKB_GSO_DODGY;
+   switch (gso-gso_type  ~VIRTIO_NET_HDR_GSO_ECN) {
+   case VIRTIO_NET_HDR_GSO_TCPV4:
+   sinfo-gso_type |= SKB_GSO_TCPV4;
+   break;
+   case VIRTIO_NET_HDR_GSO_TCPV6:
+   sinfo-gso_type |= SKB_GSO_TCPV6;
+   break;
+   case VIRTIO_NET_HDR_GSO_UDP:
+   sinfo-gso_type |= SKB_GSO_UDP;
+