Re: [PATCH 1/1] tun: TUNGETIFF interface to query name and flags
Mark McLoughlin wrote: On Thu, 2008-08-14 at 11:58 +1000, Rusty Russell wrote: On Thursday 14 August 2008 00:30:16 Mark McLoughlin wrote: A very simple approach is attached; I did consider doing a TUNGETFLAGS that would return tun-flags, but I think it's nicer to have a companion to TUNGETIFF since it also allows one to query the interface name from the file descriptor. This seems really sensible to me. If Max acks it, How about it Max? Sorry for the delay Mark. I replied to your other email. ACK. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] stopmachine: add stopmachine_timeout
Heiko Carstens wrote: On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote: Rusty Russell wrote: On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote: Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto: + /* Wait all others come to life */ + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) { + if (time_is_before_jiffies(limit)) + goto timeout; + cpu_relax(); + } + Hmm. I think this could become interesting on virtual machines. The hypervisor might be to busy to schedule a specific cpu at certain load scenarios. This would cause a failure even if the cpu is not really locked up. We had similar problems with the soft lockup daemon on s390. 5 seconds is a fairly long time. If all else fails we could have a config option to simply disable this code. Hmm.. probably a stupid question: but what could happen that a real cpu (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1 prioritized task for 5 seconds? I have a workload where MAX_PRIO RT thread runs and never yeilds. That's what my cpu isolation patches/tree addresses. Stopmachine is the only (that I know of) thing that really brakes in that case. btw In case you're wondering yes we've discussed workqueue threads starvation and stuff in the other threads. So yet it can happen. It would be good to not-use wall-clock time, but really used cpu time instead. Unfortunately I have no idea, if that is possible in a generic way. Heiko, any ideas? Ah, cpu time comes up again. Perhaps we should actually dig that up again; Zach and Jeremy CC'd. Hm, yeah. But in this case, it's tricky. CPU time is an inherently per-cpu quantity. If cpu A is waiting for cpu B, and wants to do the timeout in cpu-seconds, then it has to be in *B*s cpu-seconds (and if A is waiting on B,C,D,E,F... it needs to measure separate timeouts with separate timebases for each other CPU). It also means that if B is unresponsive but also not consuming any time (blocked in IO, administratively paused, etc), then the timeout will never trigger. So I think monotonic wallclock time actually makes the most sense here. This is asking for trouble... a config option to disable this would be nice. But as I don't know which problem this patch originally addresses it might be that this is needed anyway. So lets see why we need it first. How about this. We'll make this a sysctl, as Rusty already did, and set the default to 0 which means never timeout. That way crazy people like me who care about this scenario can enable this feature. btw Rusty, I just had this why didn't I think of that moments. This is actually another way of handling my workload. I mean it certainly does not fix the root case of the problems and we still need other things that we talked about (non-blocking module delete, lock-free module insertion, etc) but at least in the mean time it avoids wedging the machines for good. btw I'd like that timeout in milliseconds. I think 5 seconds is way to long :). Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] stopmachine: add stopmachine_timeout
Hidetoshi Seto wrote: Heiko Carstens wrote: Hmm.. probably a stupid question: but what could happen that a real cpu (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1 prioritized task for 5 seconds? The original problem (once I heard and easily reproduced) was there was an another MAX_RT_PRIO-1 task and the task was spinning in itself by a bug. (Now this would not be a problem since RLIMIT_RTTIME will work for it, but I cannot deny that there are some situations which cannot set the limit.) Yep. As I described in the prev email in my case it's a legitimate thing. Some of the CPU cores are running wireless basestation schedulers and the deadlines are way too tight for them to sleep (it's cpu as a dedicated engine kind of thing, they are properly isolated and stuff). In this case actually RT limit is the first thing that I disable :). I'd rather have stop_machine fail and tell the user that something is wrong. In which case they can simply stop the basestation app that is running when convinient. ie It give control back to the user rather than wedging the box or killing the app. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] tun: Fix/rewrite packet filtering logic
Please see the following thread to get some context on this http://marc.info/?l=linux-netdevm=121564433018903w=2 Basically the issue is that current multi-cast filtering stuff in the TUN/TAP driver is seriously broken. Original patch went in without proper review and ACK. It was broken and confusing to start with and subsequent patches broke it completely. To give you an idea of what's broken here are some of the issues: - Very confusing comments throughout the code that imply that the character device is a network interface in its own right, and that packets are passed between the two nics. Which is completely wrong. - Wrong set of ioctls is used for setting up filters. They look like shortcuts for manipulating state of the tun/tap network interface but in reality manipulate the state of the TX filter. - ioctls that were originally used for setting address of the the TX filter got fixed and now set the address of the network interface itself. Which made filter totaly useless. - Filtering is done too late. Instead of filtering early on, to avoid unnecessary wakeups, filtering is done in the read() call. The list goes on and on :) So the patch cleans all that up. It introduces simple and clean interface for setting up TX filters (TUNSETTXFILTER + tun_filter spec) and does filtering before enqueuing the packets. TX filtering is useful in the scenarios where TAP is part of a bridge, in which case it gets all broadcast, multicast and potentially other packets when the bridge is learning. So for example Ethernet tunnelling app may want to setup TX filters to avoid tunnelling multicast traffic. QEMU and other hypervisors can push RX filtering that is currently done in the guest into the host context therefore saving wakeups and unnecessary data transfer. This is on top of the latest and greatest :). Assuming virt folks are ok with the API this should go into 2.6.27. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- drivers/net/tun.c | 316 +++- include/linux/if_tun.h | 24 +++- 2 files changed, 174 insertions(+), 166 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 2693f88..901551c 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -18,15 +18,11 @@ /* * Changes: * - * Brian Braunstein [EMAIL PROTECTED] 2007/03/23 - *Fixed hw address handling. Now net_device.dev_addr is kept consistent - *with tun.dev_addr when the address is set by this module. - * * Mike Kershaw [EMAIL PROTECTED] 2005/08/14 *Add TUNSETLINK ioctl to set the link encapsulation * * Mark Smith [EMAIL PROTECTED] - * Use random_ether_addr() for tap MAC address. + *Use random_ether_addr() for tap MAC address. * * Harald Roelle [EMAIL PROTECTED] 2004/04/20 *Fixes in packet dropping, queue length setting and queue wakeup. @@ -83,9 +79,16 @@ static int debug; #define DBG1( a... ) #endif +#define FLT_EXACT_COUNT 8 +struct tap_filter { + unsigned intcount;/* Number of addrs. Zero means disabled */ + u32 mask[2]; /* Mask of the hashed addrs */ + unsigned char addr[FLT_EXACT_COUNT][ETH_ALEN]; +}; + struct tun_struct { struct list_headlist; - unsigned long flags; + unsigned intflags; int attached; uid_t owner; gid_t group; @@ -94,19 +97,119 @@ struct tun_struct { struct sk_buff_head readq; struct net_device *dev; + struct fasync_struct*fasync; - struct fasync_struct*fasync; - - unsigned long if_flags; - u8 dev_addr[ETH_ALEN]; - u32 chr_filter[2]; - u32 net_filter[2]; + struct tap_filter txflt; #ifdef TUN_DEBUG int debug; #endif }; +/* TAP filterting */ +static void addr_hash_set(u32 *mask, const u8 *addr) +{ + int n = ether_crc(ETH_ALEN, addr) 26; + mask[n 5] |= (1 (n 31)); +} + +static unsigned int addr_hash_test(const u32 *mask, const u8 *addr) +{ + int n = ether_crc(ETH_ALEN, addr) 26; + return mask[n 5] (1 (n 31)); +} + +static int update_filter(struct tap_filter *filter, void __user *arg) +{ + struct { u8 u[ETH_ALEN]; } *addr; + struct tun_filter uf; + int err, alen, n, nexact; + + if (copy_from_user(uf, arg, sizeof(uf))) + return -EFAULT; + + if (!uf.count) { + /* Disabled */ + filter-count = 0; + return 0; + } + + alen = ETH_ALEN * uf.count; + addr = kmalloc(alen, GFP_KERNEL); + if (!addr) + return -ENOMEM; + + if (copy_from_user(addr, arg + sizeof(uf), alen)) { + err = -EFAULT; + goto done; + } + + /* The filter is updated without holding any locks. Which is +* perfectly safe. We disable it first and in the worst
Re: Multicast and receive filtering in TUN/TAP
Rusty Russell wrote: On Friday 11 July 2008 12:20:07 Max Krasnyansky wrote: I haven't looked at the virtio stuff much, I was assuming that the host side of it is still the TUN driver. Is it not ? Yes, the host side is still tun/tap. The problem is that qemu doesnt know which multicast addresses are used inside the guest. Ah, now I see what you meant by virtio_net does not do multicast. I guess it should trivial to add. Rusty will clarify it I guess. Yes, it could certainly be added; that's what feature bits are for :) Sounds good. I'll send the patch that lets you guys setup tx filters on the TAP devices. Hypervisors will then need to translate rx filters set by the guest OS into TAP tx filters. I'm thinking of doing it just like E1000 for example. 14 exact filters and the rest is hashed. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Multicast and receive filtering in TUN/TAP
Christian Borntraeger wrote: Am Donnerstag, 10. Juli 2008 schrieb Max Krasnyansky: [...] The second question is do you guys think that QEMU/KVM/LGUEST/etc would benefit if receive filtering was done by the host OS. Here is a specific example of what I'm talking about. We can do what qemu/hw/e1000.c:receive_filter() does in the _host_ context (that function currently runs in the guest context). By looking at libvirt, typical QEMU based setup is that you have a single bridge and all the TAPs from different VMs are hooked up to that bridge. What that means is that if one VM is getting MC traffic or when the bridge sees MACADDR that is not in its tables the packets get delivered to all the VMs. ie We have to wake all of the up only to so that they could drop that packet. Instead, we could setup filters in the host's side of the TAP device. Does that sound like something useful for QEMU/KVM ? If yes we can talk about the API. If not then I'll just nuke it. Max, I know that on s390 the shared OSA network card have multicast filter capabilities. So I guess it is worthwile for a virtualization environments with lots of guests. I also think, that this kind of filtering should be straightforward to implement with the qemu e1000 code. Qemu already knows the multicast addresses. Sure. It's straightforward to do inside QEMU, and it's already doing it. The question is should we do it in the host context instead and avoid some wakeups. Thing is, we are heading towards virtio. Even for Windows ? Unfortunately, virtio_net currently does not offer a method to register multicast addresses. I haven't looked at the virtio stuff much, I was assuming that the host side of it is still the TUN driver. Is it not ? Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Multicast and receive filtering in TUN/TAP
Brian Braunstein wrote: Sorry that I was confused here and it seems I am still confused. I was thinking that for any one instance of a TAP interface, there should be only 1 MAC address, since there is only 1 network interface, since the character device is not a network interface but rather the interface for the application to send and receive on that virtual network interface. Exactly. Your understanding is perfectly correct. See my previous reply. It should clear up all the confusion. For the MC stuff, I have to admit I haven't looked into it much, but it seems like the basic operation of setting the MAC address of the network interface should be supported, and it seems like an ioctl called SIOCSIFHWADDR should Set the InterFace HardWare ADDRess. Sorry if I was wrong about this. It might be good to add a comment to SIOCSIFHWADDR that says This does not actually set the network interface hardware address, this is for multicast filtering or whatever it actually is suppose to do. Or perhaps create a new ioctl that has something about multicast filtering in the name, and leave SIOCSIFHWADDR doing what it is doing now. Yep. That's what I'm going to do (ie a different ioctl). Again see my prev email. We're totally on the same page :). Max brian On Thu, Jul 10, 2008 at 2:38 PM, Shaun Jackman [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Hi Max, The original patch implemented receive multicast filtering by emulating the implementation used by many physical Ethernet interfaces: hashing the multicast address. TUN emulates two network cards (and communication via the virtual link between them), the guest and the host, or the character device and the network device, so there are two receive filters: chr_filter and net_filter. I implemented the filtering at the character device using chr_filter in tun_chr_readv, and left filtering at the network device for someone else to implement. I'm not sure what you mean by TX filtering. Multicast filtering is implemented uniquely at the receiver. There are, however, two receivers: the character device and the network device. I believe Brian's patch was mistaken. Two entirely distinct Ethernet addresses are required: one for the character device and one for the network device, or put another way, one for the virtual Ethernet interface at the guest and one for the virtual Ethernet interface at the host. For the same reason, there are two distinct multicast filters. Looking over the original patch, I believe I see a bug in tun_net_mclist: memset(tun-chr_filter, 0, sizeof tun-chr_filter); should be memset(tun-net_filter, 0, sizeof tun-net_filter); Cheers, Shaun On Wed, Jul 9, 2008 at 3:58 PM, Max Krasnyansky [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Yesterday while fixing xoff stuckiness issue in the TUN/TAP driver I got a chance to look into the multicast filtering code in there. And immediately realized how terribly broken confusing it is. The patch was originally done by Shaun (CC'ed) and went in without any proper ACK from me, Dave or Jeff. Here is the original ref http://marc.info/?l=linux-netdevm=110490502102308w=2 http://marc.info/?l=linux-netdevm=110490502102308w=2 I'm not going to dive into too much details on what's wrong with the current code. The main issues are that it mixes RX and TX filtering which are orthogonal, and it reuses ioctl names and stuff for manipulating TX filter state as if it was a normal RX multicast state. Later on Brian's patch added insult to the injury http://git.kernel.org/?p=linux/kernel/git/\ http://git.kernel.org/?p=linux/kernel/git/%5C torvalds/linux-2.6.git;\ a=commit;h=36226a8ded46b89a94f9de5976f554bb5e02d84c Brian missed the point of the original patch (not his fault, as I said the original patch was not the best) that the separate address introduced by the MC patch was used for filtering _TX_ packets. It had nothing to do with the HW addr of the local network interface. The problem is that MC stuff is now even more broken and ioctls that were used originally now mean something different. So my first thinking was to just rip the MC stuff out because it's broken and probably nobody uses it (given that we got no complains after Brian's patch broke it completely). But then I realized that if done properly it might be very useful for virtualization. --- So the first question is are there any users out there that ever used the original patch. Shaun, any insight ? How did you intend to use
Multicast and receive filtering in TUN/TAP
Yesterday while fixing xoff stuckiness issue in the TUN/TAP driver I got a chance to look into the multicast filtering code in there. And immediately realized how terribly broken confusing it is. The patch was originally done by Shaun (CC'ed) and went in without any proper ACK from me, Dave or Jeff. Here is the original ref http://marc.info/?l=linux-netdevm=110490502102308w=2 I'm not going to dive into too much details on what's wrong with the current code. The main issues are that it mixes RX and TX filtering which are orthogonal, and it reuses ioctl names and stuff for manipulating TX filter state as if it was a normal RX multicast state. Later on Brian's patch added insult to the injury http://git.kernel.org/?p=linux/kernel/git/\ torvalds/linux-2.6.git;\ a=commit;h=36226a8ded46b89a94f9de5976f554bb5e02d84c Brian missed the point of the original patch (not his fault, as I said the original patch was not the best) that the separate address introduced by the MC patch was used for filtering _TX_ packets. It had nothing to do with the HW addr of the local network interface. The problem is that MC stuff is now even more broken and ioctls that were used originally now mean something different. So my first thinking was to just rip the MC stuff out because it's broken and probably nobody uses it (given that we got no complains after Brian's patch broke it completely). But then I realized that if done properly it might be very useful for virtualization. --- So the first question is are there any users out there that ever used the original patch. Shaun, any insight ? How did you intend to use it ? --- The second question is do you guys think that QEMU/KVM/LGUEST/etc would benefit if receive filtering was done by the host OS. Here is a specific example of what I'm talking about. We can do what qemu/hw/e1000.c:receive_filter() does in the _host_ context (that function currently runs in the guest context). By looking at libvirt, typical QEMU based setup is that you have a single bridge and all the TAPs from different VMs are hooked up to that bridge. What that means is that if one VM is getting MC traffic or when the bridge sees MACADDR that is not in its tables the packets get delivered to all the VMs. ie We have to wake all of the up only to so that they could drop that packet. Instead, we could setup filters in the host's side of the TAP device. Does that sound like something useful for QEMU/KVM ? If yes we can talk about the API. If not then I'll just nuke it. Thanx Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] tun: TUNSETFEATURES to set gso features.
Rusty Russell wrote: ethtool is useful for setting (some) device fields, but it's root-only. Finer feature control is available through a tun-specific ioctl. (Includes Mark McLoughlin [EMAIL PROTECTED]'s fix to hold rtnl sem). Signed-off-by: Rusty Russell [EMAIL PROTECTED] Agree. And it's often way more convenient to muck with the TUN settings directly on the fd instead of using external tool. Acked-by: Max Krasnyansky [EMAIL PROTECTED] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] tun: Interface to query tun/tap features.
Rusty Russell wrote: The problem with introducing checksum offload and gso to tun is they need to set dev-features to enable GSO and/or checksumming, which is supposed to be done before register_netdevice(), ie. as part of TUNSETIFF. Unfortunately, TUNSETIFF has always just ignored flags it doesn't understand, so there's no good way of detecting whether the kernel supports new IFF_ flags. This patch implements a TUNGETFEATURES ioctl which returns all the valid IFF flags. It could be extended later to include other features. snip Looks good. Dave, do you want me to put all outstanding TUN patches into a git tree so that you can pull them in one shot ? Otherwise if you're ok with applying them one by one please apply this one. Acked-by: Max Krasnyansky [EMAIL PROTECTED] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/4] tun: Allow GSO using virtio_net_hdr
Rusty Russell wrote: Add a IFF_VNET_HDR flag. This uses the same ABI as virtio_net (ie. prepending struct virtio_net_hdr to packets) to indicate GSO and checksum information. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- drivers/net/tun.c | 90 - include/linux/if_tun.h |2 + 2 files changed, 91 insertions(+), 1 deletion(-) diff -r d94590c1550a drivers/net/tun.c --- a/drivers/net/tun.c Thu Jun 26 00:21:11 2008 +1000 +++ b/drivers/net/tun.c Thu Jun 26 00:21:59 2008 +1000 @@ -63,6 +63,7 @@ #include linux/if_tun.h #include linux/crc32.h #include linux/nsproxy.h +#include linux/virtio_net.h #include net/net_namespace.h #include net/netns/generic.h @@ -283,12 +284,24 @@ static __inline__ ssize_t tun_get_user(s struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) }; struct sk_buff *skb; size_t len = count, align = 0; + struct virtio_net_hdr gso = { 0 }; if (!(tun-flags TUN_NO_PI)) { if ((len -= sizeof(pi)) count) return -EINVAL; if(memcpy_fromiovec((void *)pi, iv, sizeof(pi))) + return -EFAULT; + } + + if (tun-flags TUN_VNET_HDR) { + if ((len -= sizeof(gso)) count) + return -EINVAL; + + if (gso.hdr_len len) + return -EINVAL; + + if (memcpy_fromiovec((void *)gso, iv, sizeof(gso))) return -EFAULT; } Unless I'm missing something the 'if (gso.hdr_len len)' must be after memcpy_fromiovec(). @@ -322,8 +335,45 @@ static __inline__ ssize_t tun_get_user(s break; }; - if (tun-flags TUN_NOCHECKSUM) + if (gso.flags VIRTIO_NET_HDR_F_NEEDS_CSUM) { + if (!skb_partial_csum_set(skb, gso.csum_start, + gso.csum_offset)) { + tun-dev-stats.rx_dropped++; + kfree_skb(skb); + return -EINVAL; + } + } else if (tun-flags TUN_NOCHECKSUM) skb-ip_summed = CHECKSUM_UNNECESSARY; + + if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) { + pr_debug(GSO!\n); + switch (gso.gso_type ~VIRTIO_NET_HDR_GSO_ECN) { + case VIRTIO_NET_HDR_GSO_TCPV4: + skb_shinfo(skb)-gso_type = SKB_GSO_TCPV4; + break; + case VIRTIO_NET_HDR_GSO_TCPV6: + skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6; + break; + default: + tun-dev-stats.rx_dropped++; + kfree_skb(skb); + return -EINVAL; + } We should use stats.rx_frame_errors instead of stats.rx_dropped to indicated that we dropped it because something was wrong with the framing (headers, etc). Applies to both of the cases above. + + if (gso.gso_type VIRTIO_NET_HDR_GSO_ECN) + skb_shinfo(skb)-gso_type |= SKB_GSO_TCP_ECN; + + skb_shinfo(skb)-gso_size = gso.gso_size; + if (skb_shinfo(skb)-gso_size == 0) { + tun-dev-stats.rx_dropped++; + kfree_skb(skb); + return -EINVAL; + } Same here. Everything else looks good. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/4] lguest: Use GSO/IFF_VNET_HDR extensions on tun/tap
Rusty Russell wrote: On Thursday 26 June 2008 05:07:18 Anthony Liguori wrote: Rusty Russell wrote: @@ -1563,6 +1561,16 @@ static void setup_tun_net(char *arg) /* Tell Guest what MAC address to use. */ add_feature(dev, VIRTIO_NET_F_MAC); add_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY); + /* Expect Guest to handle everything except UFO */ + add_feature(dev, VIRTIO_NET_F_CSUM); You're setting this feature twice. Hmm, not in the version here? + add_feature(dev, VIRTIO_NET_F_GUEST_CSUM); You set this feature, but I never see the virtio-net driver acknowledge the feature. Curiously, my implementation with KVM is struggling because UDP packet checksums are not correct so the DHCP client is ignoring them. If I disable CSUM offload, things it works fine (using the virtio-net header). The problem is only host=guest, guest=host is fine. OK, found this: wrong args to skb_partial_csum_set. It was found by Mark McLoughlin before, I just lost the fix when I extracted this into a separate patch. I chose to move the call to skb_partial_csum_set(), rather than use his fix (which assumed a tap not tun device). Here's two fixes on top of previous patch: diff -u b/drivers/net/tun.c b/drivers/net/tun.c --- b/drivers/net/tun.c Thu Jun 26 00:21:59 2008 +1000 +++ b/drivers/net/tun.c Thu Jun 26 14:35:03 2008 +1000 @@ -298,11 +298,11 @@ if ((len -= sizeof(gso)) count) return -EINVAL; - if (gso.hdr_len len) - return -EINVAL; - if (memcpy_fromiovec((void *)gso, iv, sizeof(gso))) return -EFAULT; + + if (gso.hdr_len len) + return -EINVAL; } Yep, looks better now. if ((tun-flags TUN_TYPE_MASK) == TUN_TAP_DEV) { @@ -324,6 +324,16 @@ return -EFAULT; } + if (gso.flags VIRTIO_NET_HDR_F_NEEDS_CSUM) { + if (!skb_partial_csum_set(skb, gso.csum_start, + gso.csum_offset)) { + tun-dev-stats.rx_dropped++; + kfree_skb(skb); + return -EINVAL; + } + } else if (tun-flags TUN_NOCHECKSUM) + skb-ip_summed = CHECKSUM_UNNECESSARY; + switch (tun-flags TUN_TYPE_MASK) { case TUN_TUN_DEV: skb_reset_mac_header(skb); @@ -335,16 +345,6 @@ break; }; - if (gso.flags VIRTIO_NET_HDR_F_NEEDS_CSUM) { - if (!skb_partial_csum_set(skb, gso.csum_start, - gso.csum_offset)) { - tun-dev-stats.rx_dropped++; - kfree_skb(skb); - return -EINVAL; - } - } else if (tun-flags TUN_NOCHECKSUM) - skb-ip_summed = CHECKSUM_UNNECESSARY; - if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) { pr_debug(GSO!\n); switch (gso.gso_type ~VIRTIO_NET_HDR_GSO_ECN) { Do you want to resent the GSO patch with all the latest fixes ? ie other things (stat counters) I pointed out in the prev email. I'll ack it. Thanx Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 3/5] tun: vringfd receive support.
Rusty Russell wrote: This patch modifies tun to allow a vringfd to specify the receive buffer. Because we can't copy to userspace in bh context, we queue like normal then use the pull hook to actually do the copy. More thought needs to be put into the possible races with ring registration and a simultaneous close, for example (see FIXME). We use struct virtio_net_hdr prepended to packets in the ring to allow userspace to receive GSO packets in future (at the moment, the tun driver doesn't tell the stack it can handle them, so these cases are never taken). In general the code looks good. The only thing I could not convince myself in is whether having generic ring buffer makes sense or not. At least the TUN driver would be more efficient if it had its own simple ring implementation. Less indirection, fewer callbacks, fewer if()s, etc. TUN already has the file descriptor and having two additional fds for rx and tx ring is a waste (think of a VPN server that has to have a bunch of TUN fds). Also as I mentioned before Jamal and I wanted to expose some of the SKB fields through TUN device. With the rx/tx rings the natural way of doing that would be the ring descriptor itself. It can of course be done the same way we copy proto info (PI) and GSO stuff before the packet but that means more copy_to_user() calls and yet more checks. So. What am I missing ? Why do we need generic ring for the TUN ? I looked at the lguest code a bit and it seems that we need a bunch of network specific code anyway. The cool thing is that you can now mmap the rings into the guest directly but the same thing can be done with TUN specific rings. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
Rusty Russell wrote: On Friday 08 February 2008 16:39:03 Max Krasnyansky wrote: Rusty Russell wrote: (Changes since last time: we how have explicit IFF_RECV_CSUM and IFF_RECV_GSO bits, and some renaming of virtio_net hdr) We use the virtio_net_hdr: it is an ABI already and designed to encapsulate such metadata as GSO and partial checksums. IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr' at the start of each packet. You can always write packets with partial checksum and gso to the tap device using this header. IFF_RECV_CSUM means you can handle reading packets with partial checksums. If IFF_RECV_GSO is also set, it means you can handle reading (all types of) GSO packets. Note that there is no easy way to detect if these flags are supported: see next patch. Again sorry for delay in replying. Here are my thoughts on this. I like the approach in general. Certainly the part that creates skbs out of the user-space pages looks good. And it's fits nicely into existing TUN driver model. However I actually wanted to change the model :). In particular I'm talking about syscall per packet After messing around with things like libe1000.sf.net I'd like to make TUN/TAP driver look more like modern nic's to the user-space. In other words I'm thinking about introducing RX and TX rings that the user-space can then mmap() and write/read packets descriptors to/from. That will saves the number of system calls that the user-space app needs to do. That by itself saves a lot of overhead, combined with the GSO it's be lightning fast. The problem with this approach is that for what I'm doing, the packets aren't nicely arranged somewhere; they're in random process memory. That's fine. RX/TX descriptors would not contain the data itself. They'd contain pointers to actual packets (ie just like the NIC takes physical memory address and DMAs data in/out). The allows for sending/receiving packets without syscalls and fits nicely with the async schemes like GSO. btw The code that I sent you does indeed expect packets to be in a mmap()ed buffer but I agree that it only works for certain cases. In general it's not flexible. I was thinking of introducing some flags in the descriptor that tell the kernel how to handle the packet. ie Whether it needs to be just copied into a fresh SKB or remapped with get_user_pages(). I thought about further abusing writev and readv to do multiple packets at once. I actually was going to abuse them from day one. At that time Alex Kuznetsov told me that I'm crazy and I gave up on it :) Also btw why call it VIRTIO ? For example I'm actually interested in speeding up tunning and general network apps. We have wireless basestation apps here that need to handle packets in user-space. Those kind things have nothing to with virtualization. The structure is for virtio, I'm just borrowing it for tap because it's already there. We could rename it and move it out to its own header, but if so we should do that before 2.6.25 is released. If we do the whole enchilada with the RX/TX rings then we probably do not even need it. I'm thinking that RX/TX descriptor would include everything you need for the GSO and stuff. I meant do not need it for the TUN/TAP driver that is. Is it used anywhere else ? Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.
Rusty Russell wrote: (Changes since last time: we how have explicit IFF_RECV_CSUM and IFF_RECV_GSO bits, and some renaming of virtio_net hdr) We use the virtio_net_hdr: it is an ABI already and designed to encapsulate such metadata as GSO and partial checksums. IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr' at the start of each packet. You can always write packets with partial checksum and gso to the tap device using this header. IFF_RECV_CSUM means you can handle reading packets with partial checksums. If IFF_RECV_GSO is also set, it means you can handle reading (all types of) GSO packets. Note that there is no easy way to detect if these flags are supported: see next patch. Again sorry for delay in replying. Here are my thoughts on this. I like the approach in general. Certainly the part that creates skbs out of the user-space pages looks good. And it's fits nicely into existing TUN driver model. However I actually wanted to change the model :). In particular I'm talking about syscall per packet After messing around with things like libe1000.sf.net I'd like to make TUN/TAP driver look more like modern nic's to the user-space. In other words I'm thinking about introducing RX and TX rings that the user-space can then mmap() and write/read packets descriptors to/from. That will saves the number of system calls that the user-space app needs to do. That by itself saves a lot of overhead, combined with the GSO it's be lightning fast. I'm going to send you a version that I cooked up awhile ago in a private email. Do not want to spam netdev :). It's not quite the RX/TX ring model but I'll give you an idea. I did some profiling and PPS (packets per second) numbers that user-space can handle literally sky rocketed. btw We had a long discussion with Eugeniy Polakov on mapping user-pages vs mmap()ing large kernel buffer and doing normal memcpy() (ie instead of copy_to/fromuser()) in the kernel. On small packets overhead of get_user_pages() eats up all the benefits. So we should think of some scheme that nicely combines the two. Kind of like copy break that latest net drivers do these days. Also btw why call it VIRTIO ? For example I'm actually interested in speeding up tunning and general network apps. We have wireless basestation apps here that need to handle packets in user-space. Those kind things have nothing to with virtualization. Max ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization