Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
Rusty Russell wrote: On Tuesday 04 March 2008 16:08:00 Max Krasnyansky wrote: The problem with this approach is that for what I'm doing, the packets aren't nicely arranged somewhere; they're in random process memory. That's fine. RX/TX descriptors would not contain the data itself. They'd contain pointers to actual packets (ie just like the NIC takes physical memory address and DMAs data in/out). The allows for sending/receiving packets without syscalls and fits nicely with the async schemes like GSO. Yes, yes it does. That would be a very nice extension (it's orthogonal to this patch though, so should we get Dave to take these for 2.6.25?). It's orthogonal in general I agree. The only concern is whether we should keep extending existing driver API (ie adding more flags like TUN_NO_PI, etc) or go straight to the new/better ring based API. I do not have a strong preference one way or the other. So I guess I'm saying I'd be ok with Dave taking them for 2.6.25, but that API may be obsolete when the ring based thing comes out. And as it happens, virtio already has such a structure: virtio_ring. See linux/virtio_ring.h. I'll take a look. The structure is for virtio, I'm just borrowing it for tap because it's already there. We could rename it and move it out to its own header, but if so we should do that before 2.6.25 is released. If we do the whole enchilada with the RX/TX rings then we probably do not even need it. I'm thinking that RX/TX descriptor would include everything you need for the GSO and stuff. I meant do not need it for the TUN/TAP driver that is. Is it used anywhere else ? Just for the linux virtio drivers. Reusing it for tun/tap was an afterthought. It just meant I could pass the same structure straight thru, though, which is nice. The userspace-kernel problem is very similar to the guest-host problem, so it doesn't surprise me if we end up with very similar (identical?) interfaces. Take a look at virtio_ring.h, virtio_ring.c and Documentation/lguest/lguest.c to see how we use it... Thanx for the pointers. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
On Friday 08 February 2008 16:39:03 Max Krasnyansky wrote: Rusty Russell wrote: (Changes since last time: we how have explicit IFF_RECV_CSUM and IFF_RECV_GSO bits, and some renaming of virtio_net hdr) We use the virtio_net_hdr: it is an ABI already and designed to encapsulate such metadata as GSO and partial checksums. IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr' at the start of each packet. You can always write packets with partial checksum and gso to the tap device using this header. IFF_RECV_CSUM means you can handle reading packets with partial checksums. If IFF_RECV_GSO is also set, it means you can handle reading (all types of) GSO packets. Note that there is no easy way to detect if these flags are supported: see next patch. Again sorry for delay in replying. Here are my thoughts on this. I like the approach in general. Certainly the part that creates skbs out of the user-space pages looks good. And it's fits nicely into existing TUN driver model. However I actually wanted to change the model :). In particular I'm talking about syscall per packet After messing around with things like libe1000.sf.net I'd like to make TUN/TAP driver look more like modern nic's to the user-space. In other words I'm thinking about introducing RX and TX rings that the user-space can then mmap() and write/read packets descriptors to/from. That will saves the number of system calls that the user-space app needs to do. That by itself saves a lot of overhead, combined with the GSO it's be lightning fast. The problem with this approach is that for what I'm doing, the packets aren't nicely arranged somewhere; they're in random process memory. I thought about further abusing writev and readv to do multiple packets at once. btw We had a long discussion with Eugeniy Polakov on mapping user-pages vs mmap()ing large kernel buffer and doing normal memcpy() (ie instead of copy_to/fromuser()) in the kernel. On small packets overhead of get_user_pages() eats up all the benefits. So we should think of some scheme that nicely combines the two. Kind of like copy break that latest net drivers do these days. Yes, the threshold for copy should probably be set around 128 bytes. Also btw why call it VIRTIO ? For example I'm actually interested in speeding up tunning and general network apps. We have wireless basestation apps here that need to handle packets in user-space. Those kind things have nothing to with virtualization. The structure is for virtio, I'm just borrowing it for tap because it's already there. We could rename it and move it out to its own header, but if so we should do that before 2.6.25 is released. Thanks! Rusty. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
Rusty Russell wrote: On Friday 08 February 2008 16:39:03 Max Krasnyansky wrote: Rusty Russell wrote: (Changes since last time: we how have explicit IFF_RECV_CSUM and IFF_RECV_GSO bits, and some renaming of virtio_net hdr) We use the virtio_net_hdr: it is an ABI already and designed to encapsulate such metadata as GSO and partial checksums. IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr' at the start of each packet. You can always write packets with partial checksum and gso to the tap device using this header. IFF_RECV_CSUM means you can handle reading packets with partial checksums. If IFF_RECV_GSO is also set, it means you can handle reading (all types of) GSO packets. Note that there is no easy way to detect if these flags are supported: see next patch. Again sorry for delay in replying. Here are my thoughts on this. I like the approach in general. Certainly the part that creates skbs out of the user-space pages looks good. And it's fits nicely into existing TUN driver model. However I actually wanted to change the model :). In particular I'm talking about syscall per packet After messing around with things like libe1000.sf.net I'd like to make TUN/TAP driver look more like modern nic's to the user-space. In other words I'm thinking about introducing RX and TX rings that the user-space can then mmap() and write/read packets descriptors to/from. That will saves the number of system calls that the user-space app needs to do. That by itself saves a lot of overhead, combined with the GSO it's be lightning fast. The problem with this approach is that for what I'm doing, the packets aren't nicely arranged somewhere; they're in random process memory. That's fine. RX/TX descriptors would not contain the data itself. They'd contain pointers to actual packets (ie just like the NIC takes physical memory address and DMAs data in/out). The allows for sending/receiving packets without syscalls and fits nicely with the async schemes like GSO. btw The code that I sent you does indeed expect packets to be in a mmap()ed buffer but I agree that it only works for certain cases. In general it's not flexible. I was thinking of introducing some flags in the descriptor that tell the kernel how to handle the packet. ie Whether it needs to be just copied into a fresh SKB or remapped with get_user_pages(). I thought about further abusing writev and readv to do multiple packets at once. I actually was going to abuse them from day one. At that time Alex Kuznetsov told me that I'm crazy and I gave up on it :) Also btw why call it VIRTIO ? For example I'm actually interested in speeding up tunning and general network apps. We have wireless basestation apps here that need to handle packets in user-space. Those kind things have nothing to with virtualization. The structure is for virtio, I'm just borrowing it for tap because it's already there. We could rename it and move it out to its own header, but if so we should do that before 2.6.25 is released. If we do the whole enchilada with the RX/TX rings then we probably do not even need it. I'm thinking that RX/TX descriptor would include everything you need for the GSO and stuff. I meant do not need it for the TUN/TAP driver that is. Is it used anywhere else ? Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
On Tuesday 04 March 2008 16:08:00 Max Krasnyansky wrote: The problem with this approach is that for what I'm doing, the packets aren't nicely arranged somewhere; they're in random process memory. That's fine. RX/TX descriptors would not contain the data itself. They'd contain pointers to actual packets (ie just like the NIC takes physical memory address and DMAs data in/out). The allows for sending/receiving packets without syscalls and fits nicely with the async schemes like GSO. Yes, yes it does. That would be a very nice extension (it's orthogonal to this patch though, so should we get Dave to take these for 2.6.25?). And as it happens, virtio already has such a structure: virtio_ring. See linux/virtio_ring.h. The structure is for virtio, I'm just borrowing it for tap because it's already there. We could rename it and move it out to its own header, but if so we should do that before 2.6.25 is released. If we do the whole enchilada with the RX/TX rings then we probably do not even need it. I'm thinking that RX/TX descriptor would include everything you need for the GSO and stuff. I meant do not need it for the TUN/TAP driver that is. Is it used anywhere else ? Just for the linux virtio drivers. Reusing it for tun/tap was an afterthought. It just meant I could pass the same structure straight thru, though, which is nice. The userspace-kernel problem is very similar to the guest-host problem, so it doesn't surprise me if we end up with very similar (identical?) interfaces. Take a look at virtio_ring.h, virtio_ring.c and Documentation/lguest/lguest.c to see how we use it... Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
Rusty Russell wrote: (Changes since last time: we how have explicit IFF_RECV_CSUM and IFF_RECV_GSO bits, and some renaming of virtio_net hdr) We use the virtio_net_hdr: it is an ABI already and designed to encapsulate such metadata as GSO and partial checksums. IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr' at the start of each packet. You can always write packets with partial checksum and gso to the tap device using this header. IFF_RECV_CSUM means you can handle reading packets with partial checksums. If IFF_RECV_GSO is also set, it means you can handle reading (all types of) GSO packets. Note that there is no easy way to detect if these flags are supported: see next patch. Again sorry for delay in replying. Here are my thoughts on this. I like the approach in general. Certainly the part that creates skbs out of the user-space pages looks good. And it's fits nicely into existing TUN driver model. However I actually wanted to change the model :). In particular I'm talking about syscall per packet After messing around with things like libe1000.sf.net I'd like to make TUN/TAP driver look more like modern nic's to the user-space. In other words I'm thinking about introducing RX and TX rings that the user-space can then mmap() and write/read packets descriptors to/from. That will saves the number of system calls that the user-space app needs to do. That by itself saves a lot of overhead, combined with the GSO it's be lightning fast. I'm going to send you a version that I cooked up awhile ago in a private email. Do not want to spam netdev :). It's not quite the RX/TX ring model but I'll give you an idea. I did some profiling and PPS (packets per second) numbers that user-space can handle literally sky rocketed. btw We had a long discussion with Eugeniy Polakov on mapping user-pages vs mmap()ing large kernel buffer and doing normal memcpy() (ie instead of copy_to/fromuser()) in the kernel. On small packets overhead of get_user_pages() eats up all the benefits. So we should think of some scheme that nicely combines the two. Kind of like copy break that latest net drivers do these days. Also btw why call it VIRTIO ? For example I'm actually interested in speeding up tunning and general network apps. We have wireless basestation apps here that need to handle packets in user-space. Those kind things have nothing to with virtualization. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 2/3] partial checksum and GSO support for tun/tap.
(Changes since last time: we how have explicit IFF_RECV_CSUM and IFF_RECV_GSO bits, and some renaming of virtio_net hdr) We use the virtio_net_hdr: it is an ABI already and designed to encapsulate such metadata as GSO and partial checksums. IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr' at the start of each packet. You can always write packets with partial checksum and gso to the tap device using this header. IFF_RECV_CSUM means you can handle reading packets with partial checksums. If IFF_RECV_GSO is also set, it means you can handle reading (all types of) GSO packets. Note that there is no easy way to detect if these flags are supported: see next patch. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- drivers/net/tun.c | 259 +++-- include/linux/if_tun.h |6 + 2 files changed, 238 insertions(+), 27 deletions(-) diff -r cb85fb035378 drivers/net/tun.c --- a/drivers/net/tun.c Wed Jan 23 20:06:56 2008 +1100 +++ b/drivers/net/tun.c Wed Jan 23 20:12:51 2008 +1100 @@ -62,6 +62,7 @@ #include linux/if_ether.h #include linux/if_tun.h #include linux/crc32.h +#include linux/virtio_net.h #include net/net_namespace.h #include asm/system.h @@ -238,35 +239,188 @@ static unsigned int tun_chr_poll(struct return mask; } +static struct sk_buff *copy_user_skb(size_t align, struct iovec *iv, size_t len) +{ + struct sk_buff *skb; + + if (!(skb = alloc_skb(len + align, GFP_KERNEL))) + return ERR_PTR(-ENOMEM); + + if (align) + skb_reserve(skb, align); + + if (memcpy_fromiovec(skb_put(skb, len), iv, len)) { + kfree_skb(skb); + return ERR_PTR(-EFAULT); + } + return skb; +} + +/* This will fail if they give us a crazy iovec, but that's their own fault. */ +static int get_user_skb_frags(const struct iovec *iv, size_t count, + struct skb_frag_struct *f) +{ + unsigned int i, j, num_pg = 0; + int err; + struct page *pages[MAX_SKB_FRAGS]; + + down_read(current-mm-mmap_sem); + for (i = 0; i count; i++) { + int n, npages; + unsigned long base, len; + base = (unsigned long)iv[i].iov_base; + len = (unsigned long)iv[i].iov_len; + + if (len == 0) + continue; + + /* How many pages will this take? */ + npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE; + if (unlikely(num_pg + npages MAX_SKB_FRAGS)) { + err = -ENOSPC; + goto fail; + } + n = get_user_pages(current, current-mm, base, npages, + 0, 0, pages, NULL); + if (unlikely(n 0)) { + err = n; + goto fail; + } + + /* Transfer pages to the frag array */ + for (j = 0; j n; j++) { + f[num_pg].page = pages[j]; + if (j == 0) { + f[num_pg].page_offset = offset_in_page(base); + f[num_pg].size = min(len, PAGE_SIZE - +f[num_pg].page_offset); + } else { + f[num_pg].page_offset = 0; + f[num_pg].size = min(len, PAGE_SIZE); + } + len -= f[num_pg].size; + base += f[num_pg].size; + num_pg++; + } + + if (unlikely(n != npages)) { + err = -EFAULT; + goto fail; + } + } + up_read(current-mm-mmap_sem); + return num_pg; + +fail: + for (i = 0; i num_pg; i++) + put_page(f[i].page); + up_read(current-mm-mmap_sem); + return err; +} + + +static struct sk_buff *map_user_skb(const struct virtio_net_hdr *gso, + size_t align, struct iovec *iv, + size_t count, size_t len) +{ + struct sk_buff *skb; + struct skb_shared_info *sinfo; + int err; + + if (!(skb = alloc_skb(gso-hdr_len + align, GFP_KERNEL))) + return ERR_PTR(-ENOMEM); + + if (align) + skb_reserve(skb, align); + + sinfo = skb_shinfo(skb); + sinfo-gso_size = gso-gso_size; + sinfo-gso_type = SKB_GSO_DODGY; + switch (gso-gso_type ~VIRTIO_NET_HDR_GSO_ECN) { + case VIRTIO_NET_HDR_GSO_TCPV4: + sinfo-gso_type |= SKB_GSO_TCPV4; + break; + case VIRTIO_NET_HDR_GSO_TCPV6: + sinfo-gso_type |= SKB_GSO_TCPV6; + break; + case VIRTIO_NET_HDR_GSO_UDP: + sinfo-gso_type |= SKB_GSO_UDP; +