Re: [RFC] net: napi fix
Joonwoo Park wrote: 2007/12/13, Kok, Auke [EMAIL PROTECTED]: David Miller wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Wed, 12 Dec 2007 12:29:23 -0500 Is the netif_running() check even required? No, it is not. When a device is brought down, one of the first things that happens is that we wait for all pending NAPI polls to complete, then block any new polls from starting. I think this was previously (pre-2.6.24) not the case, which is why e1000 et al has this check as well and that's exactly what is causing most of the net_rx_action oopses in the first place. Without the netif_running() check previously the drivers were just unusable with NAPI and prone to many races with down (i.e. touching some ethtool ioctl which wants to do a reset while routing small packets at high numbers). that's why we added the netif_running() check in the first place :) There might be more drivers lurking that need this change... Auke Also in my case, without netif_running() check, I cannot do ifconfig down. It stucked if packet generator was sending packets. If the netif_running() check is indeed required to make a device break out of napi polling and respond to an ifconfig down, then I think the netif_running() check should be moved up into net_rx_action() to avoid potential for driver complexity and bugs like the ones you found. Drew -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: napi fix
Stephen Hemminger wrote: On Thu, 13 Dec 2007 06:19:38 -0800 (PST) David Miller [EMAIL PROTECTED] wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 09:13:54 -0500 If the netif_running() check is indeed required to make a device break out of napi polling and respond to an ifconfig down, then I think the netif_running() check should be moved up into net_rx_action() to avoid potential for driver complexity and bugs like the ones you found. That, or something like it, definitely sounds reasonable and much better than putting the check into every driver :-) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html It is not possible to do netif_running() check in generic code as currently written because of the case of devices where a single NAPI object is being used to handle two devices. The association between napi and netdevice is M to N. There are cases like niu that have multiple NAPI's and one netdevice; and devices like sky2 that can have one NAPI and 2 netdevice's. Ah, now I see. I forgot that not every device has a 1:1::napi:netdev relationship. Could we make an optional *dev_state field in the napi structure. It would be initialized to __LINK_STATE_START. Devices which have a 1:1 NAPI:netdevice relationship would set it to netdev-state. The generic code would then do a test_bit(__LINK_STATE_START, napi-dev_state), and 1:1 drivers could remove this check. M:N drivers would pay for a useless (to them) test_bit, and would have to provide their own netif_running check to get termination under heavy load. Just an idea, perhaps there is a better way which is less hacky. Or perhaps we should just leave things as is. Drew -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: napi fix
David Miller wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Wed, 12 Dec 2007 12:29:23 -0500 Is the netif_running() check even required? No, it is not. When a device is brought down, one of the first things that happens is that we wait for all pending NAPI polls to complete, then block any new polls from starting. Great, thanks. I will submit a patch to remove the bogus check. This should fix myri10ge properly. Thank you, Drew -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: napi fix
[I apologize for loosing threading, I'm replying from the archives] The problem is that the driver is doing a NAPI completion and re-enabling chip interrupts with work_done == weight, and that is illegal. The only time at least myri10ge will do this is due to the !netif_running(netdev) check. Eg, from myri10ge's poll: work_done = myri10ge_clean_rx_done(mgp, budget); if (work_done budget || !netif_running(netdev)) { netif_rx_complete(netdev, napi); put_be32(htonl(3), mgp-irq_claim); } Is the netif_running() check even required? Is this just a bad way to solve a race with running NAPI at down() time that would be better solved by putting a napi_synchronize() in the driver's down() routine? I'd rather fix this right than add another check to a questionable code path. Thanks, Drew -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][NETDEV]: remove netif_running() check from myri10ge_poll()
Remove the bogus netif_running() check from myri10ge_poll(). This eliminates any chance that myri10ge_poll() can trigger an oops by calling netif_rx_complete() and returning with work_done == budget. Signed-off-by: Andrew Gallatin [EMAIL PROTECTED] diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c index 8def865..c90958f 100644 --- a/drivers/net/myri10ge/myri10ge.c +++ b/drivers/net/myri10ge/myri10ge.c @@ -1239,7 +1239,7 @@ static int myri10ge_poll(struct napi_str /* process as many rx events as NAPI will allow */ work_done = myri10ge_clean_rx_done(mgp, budget); - if (work_done budget || !netif_running(netdev)) { + if (work_done budget) { netif_rx_complete(netdev, napi); put_be32(htonl(3), mgp-irq_claim); }
[PATCH (resubmit)]: fix lro_gen_skb() alignment
Add a field to the lro_mgr struct so that drivers can specify how much padding is required to align layer 3 headers when a packet is copied into a freshly allocated skb by inet_lro.c:lro_gen_skb(). Without padding, skbs generated by LRO will cause alignment warnings on architectures which require strict alignment (seen on sparc64). Myri10GE is updated to use this field. Signed off by: Andrew Gallatin [EMAIL PROTECTED] diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c index 0f306dd..8def865 100644 --- a/drivers/net/myri10ge/myri10ge.c +++ b/drivers/net/myri10ge/myri10ge.c @@ -1979,6 +1979,7 @@ static int myri10ge_open(struct net_devi lro_mgr-lro_arr = mgp-rx_done.lro_desc; lro_mgr-get_frag_header = myri10ge_get_frag_header; lro_mgr-max_aggr = myri10ge_lro_max_pkts; + lro_mgr-frag_align_pad = 2; if (lro_mgr-max_aggr MAX_SKB_FRAGS) lro_mgr-max_aggr = MAX_SKB_FRAGS; diff --git a/include/linux/inet_lro.h b/include/linux/inet_lro.h index 1246d46..80335b7 100644 --- a/include/linux/inet_lro.h +++ b/include/linux/inet_lro.h @@ -91,6 +91,9 @@ #define LRO_F_EXTRACT_VLAN_ID 2 /* Set int max_desc; /* Max number of LRO descriptors */ int max_aggr; /* Max number of LRO packets to be aggregated */ + int frag_align_pad; /* Padding required to properly align layer 3 +* headers in generated skb when using frags */ + struct net_lro_desc *lro_arr; /* Array of LRO descriptors */ /* diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c index ac3b1d3..9a96c27 100644 --- a/net/ipv4/inet_lro.c +++ b/net/ipv4/inet_lro.c @@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc int data_len = len; int hdr_len = min(len, hlen); - skb = netdev_alloc_skb(lro_mgr-dev, hlen); + skb = netdev_alloc_skb(lro_mgr-dev, hlen + lro_mgr-frag_align_pad); if (!skb) return NULL; + skb_reserve(skb, lro_mgr-frag_align_pad); skb-len = len; skb-data_len = len - hdr_len; skb-truesize += true_size;
Re: [PATCH]: fix lro_gen_skb() alignment
Herbert Xu wrote: On Fri, Nov 30, 2007 at 02:35:43PM -0500, Andrew Gallatin wrote: Isn't the value of 2 ethernet-specific (to round the 14-byte header up to 16)? Given that the rest of the lro code is fairly careful to calculate mac_hdr_len etc it seems as if it would be cleaner to make this independent of the specific L2 being used. (And I plan on using the LRO module for IP-over-InfiniBand so this is not completely theoretical) Good point! We really should rename NET_IP_ALIGN so that both Ethernet and DMA occur in it somehow :) Good point. I tend to think all the world is ethernet. Perhaps the better way would be to simply add an alignment pad field to lro_mgr? When the driver initializes it, it specifies any padding needed. Ethernet drivers would specify 2. Just pass in the mac_hdr_len, and calculate the padding as That was my first thought as well, but it turns out that when lro_gen_skb() is called via the out1 label, mac_hdr_len may not be known. It seemed simplest and cleanest to just make it a field in lro_mgr. Drew -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]: fix lro_gen_skb() alignment
Roland Dreier wrote: - skb = netdev_alloc_skb(lro_mgr-dev, hlen); + skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN); NET_IP_ALIGN should only be used if you're DMAing into the skb head. Otherwise you should say 2. It would be nice to have another macro for that I suppose. It is certainly simple enough to say 2. Thank you for pointing this out. I have attached a patch to do that.. Signed off by: Andrew Gallatin [EMAIL PROTECTED] Isn't the value of 2 ethernet-specific (to round the 14-byte header up to 16)? Given that the rest of the lro code is fairly careful to calculate mac_hdr_len etc it seems as if it would be cleaner to make this independent of the specific L2 being used. (And I plan on using the LRO module for IP-over-InfiniBand so this is not completely theoretical) Good point. I tend to think all the world is ethernet. Perhaps the better way would be to simply add an alignment pad field to lro_mgr? When the driver initializes it, it specifies any padding needed. Ethernet drivers would specify 2. Is this acceptable? Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]: fix lro_gen_skb() alignment
Herbert Xu wrote: Andrew Gallatin [EMAIL PROTECTED] wrote: diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c index ac3b1d3..91e9371 100644 --- a/net/ipv4/inet_lro.c +++ b/net/ipv4/inet_lro.c @@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc int data_len = len; int hdr_len = min(len, hlen); - skb = netdev_alloc_skb(lro_mgr-dev, hlen); + skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN); NET_IP_ALIGN should only be used if you're DMAing into the skb head. Otherwise you should say 2. It would be nice to have another macro for that I suppose. It is certainly simple enough to say 2. Thank you for pointing this out. I have attached a patch to do that.. Signed off by: Andrew Gallatin [EMAIL PROTECTED] Drew diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c index ac3b1d3..35b816e 100644 --- a/net/ipv4/inet_lro.c +++ b/net/ipv4/inet_lro.c @@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc int data_len = len; int hdr_len = min(len, hlen); - skb = netdev_alloc_skb(lro_mgr-dev, hlen); + skb = netdev_alloc_skb(lro_mgr-dev, hlen + 2); if (!skb) return NULL; + skb_reserve(skb, 2); skb-len = len; skb-data_len = len - hdr_len; skb-truesize += true_size;
[PATCH]: fix lro_gen_skb() alignment
The inet_lro.c:lro_gen_skb() function fails to include NET_IP_ALIGN padding at the front of the sk_buffs it creates, leading to alignment warnings on architectures which require strict alignment (seen on sparc64). The attached patch adds NET_IP_ALIGN padding. Signed off by: Andrew Gallatin [EMAIL PROTECTED] Drew diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c index ac3b1d3..91e9371 100644 --- a/net/ipv4/inet_lro.c +++ b/net/ipv4/inet_lro.c @@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc int data_len = len; int hdr_len = min(len, hlen); - skb = netdev_alloc_skb(lro_mgr-dev, hlen); + skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN); if (!skb) return NULL; + skb_reserve(skb, NET_IP_ALIGN); skb-len = len; skb-data_len = len - hdr_len; skb-truesize += true_size;
Re: [PATCH] LRO ack aggregation
David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? That's one idea. I'd certainly prefer the option to have a tunable to having our customers see performance regressions when they switch to the kernel's LRO. But if it's there the risk it to have it end up being turned on always by distribution vendors, making our off-default pointless and the internet gets crapped up with misbehaving Linux TCP stacks anyways. If vendors are going to pick this up, there is the risk of them just applying this patch (which currently has no tunable to disable it), leaving their users stuck with it enabled. At least with a tunable, it would be easy for them to turn it off. And the comments surrounding it could make it clear why it should default to off. FWIW, we've seen TCP perform well in a WAN setting using our NICs and our LRO which does this ack aggregation. For example, the last 2 Supercomputing Bandwidth Challenges (making the most of a 10Gb/s WAN connection) were won by teams using our NICs, with drivers that did this sort of ack aggregation. Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
David Miller wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 06:47:57 -0500 David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? That's one idea. I'd certainly prefer the option to have a tunable to having our customers see performance regressions when they switch to the kernel's LRO. Please qualify this because by itself it's an inaccurate statement. It would cause a performance regression in situations where the is nearly no packet loss, no packet reordering, and the receiver has strong enough cpu power. Yes, a regression of nearly 1Gb/s in some cases as I mentioned when I submitted the patch. Show me something over real backbones, talking to hundres or thousands of clients scattered all over the world. That's what people will be using these high end NICs for front facing services, and that's where loss happens and stretch ACKs hurt performance. I can't. I think most 10GbE on endstations is used either in the sever room, or on dedicated links. My experience with 10GbE users is limited to my interactions with people using our NICs who contact our support. Of those, I can recall only a tiny handful who were using 10GbE on a normal internet facing connection (and the ones I dealt with were actually running a different OS). The vast majority were in a well controlled, lossless environment. It is quite ironic. The very fact that I cannot provide you with examples of internet facing people using LRO (w/ack aggr) in more normal applications tends to support my point that most 10GbE users seem to be in lossless environments. ACK stretching is bad bad bad for everything outside of some well controlled test network bubble. I just want those in the bubble to continue have the best performance possible in their situation. If it is a tunable the defaults to off, that is great. Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH]: Fix myri10ge NAPI oops warnings
When testing the myri10ge driver with 2.6.24-rc1, I found that the machine crashed under heavy load: Unable to handle kernel paging request at 00100108 RIP: [803cc8dd] net_rx_action+0x11b/0x184 The address corresponds to the list_move_tail() in netif_rx_complete(): if (unlikely(work == weight)) list_move_tail(n-poll_list, list); Eventually, I traced the crashes to calling netif_rx_complete() with work_done == budget. From looking at other drivers, it appears that one should only call netif_rx_complete() when work_done budget. To fix it, I changed the test in myri10ge_poll() so that it refers to to work_done rather than looking at the rx ring status. If work_done is budget, then that implies we have no more packets to process. Any races will be resolved by the NIC when the write to irq_claim is made. In myri10ge_clean_rx_done(), if we ever exceeded our budget, it would report a work_done one larger than was acutally done. This is because the increment was done in the conditional, so work_done would be incremented regardless of whether or not the test passed or failed. This would lead to the WARN_ON_ONCE(work weight); warning in net_rx_action triggering. I've moved the increment of work_done inside the loop. Note that this would only be a problem when we had exceeded our budget. Signed off by: Andrew Gallatin [EMAIL PROTECTED] Andrew Gallatin Myricom Inc diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c index 366e62a..0f306dd 100644 --- a/drivers/net/myri10ge/myri10ge.c +++ b/drivers/net/myri10ge/myri10ge.c @@ -1151,7 +1151,7 @@ static inline int myri10ge_clean_rx_done u16 length; __wsum checksum; - while (rx_done-entry[idx].length != 0 work_done++ budget) { + while (rx_done-entry[idx].length != 0 work_done budget) { length = ntohs(rx_done-entry[idx].length); rx_done-entry[idx].length = 0; checksum = csum_unfold(rx_done-entry[idx].checksum); @@ -1167,6 +1167,7 @@ static inline int myri10ge_clean_rx_done rx_bytes += rx_ok * (unsigned long)length; cnt++; idx = cnt (myri10ge_max_intr_slots - 1); + work_done++; } rx_done-idx = idx; rx_done-cnt = cnt; @@ -1233,13 +1234,12 @@ static int myri10ge_poll(struct napi_str struct myri10ge_priv *mgp = container_of(napi, struct myri10ge_priv, napi); struct net_device *netdev = mgp-dev; - struct myri10ge_rx_done *rx_done = mgp-rx_done; int work_done; /* process as many rx events as NAPI will allow */ work_done = myri10ge_clean_rx_done(mgp, budget); - if (rx_done-entry[rx_done-idx].length == 0 || !netif_running(netdev)) { + if (work_done budget || !netif_running(netdev)) { netif_rx_complete(netdev, napi); put_be32(htonl(3), mgp-irq_claim); }
Re: [PATCH]: Fix myri10ge NAPI oops warnings
Stephen Hemminger wrote: On Wed, 31 Oct 2007 17:40:06 -0400 Andrew Gallatin [EMAIL PROTECTED] wrote: When testing the myri10ge driver with 2.6.24-rc1, I found that the machine crashed under heavy load: Unable to handle kernel paging request at 00100108 RIP: [803cc8dd] net_rx_action+0x11b/0x184 The address corresponds to the list_move_tail() in netif_rx_complete(): if (unlikely(work == weight)) list_move_tail(n-poll_list, list); Eventually, I traced the crashes to calling netif_rx_complete() with work_done == budget. From looking at other drivers, it appears that one should only call netif_rx_complete() when work_done budget. To fix it, I changed the test in myri10ge_poll() so that it refers to to work_done rather than looking at the rx ring status. If work_done is budget, then that implies we have no more packets to process. Any races will be resolved by the NIC when the write to irq_claim is made. In myri10ge_clean_rx_done(), if we ever exceeded our budget, it would report a work_done one larger than was acutally done. This is because the increment was done in the conditional, so work_done would be incremented regardless of whether or not the test passed or failed. This would lead to the WARN_ON_ONCE(work weight); warning in net_rx_action triggering. I've moved the increment of work_done inside the loop. Note that this would only be a problem when we had exceeded our budget. Signed off by: Andrew Gallatin [EMAIL PROTECTED] Andrew Gallatin Myricom Inc Yes, this looks right. How could the check in netif_rx_complete be changed to catch this better? I'm sorry, but I don't really know. I'm afraid that I am rather clueless about the new NAPI, and found this by stumbling around in the dark. Naively, it seems like some global option to override all napi weights (eg, to 1) would be helpful so that if a driver had this problem, it could be easily reproduced by the first packet received. At least I found setting our weight to one made the bug rather obvious to me. Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] LRO ack aggregation
Hi, We recently did some performance comparisons between the new inet_lro LRO support in the kernel, and our Myri10GE in-driver LRO. For receive, we found they were nearly identical. However, for transmit, we found that Myri10GE's LRO shows much lower CPU utilization. We traced the CPU utilization difference to our driver LRO aggregating TCP acks, and the inet_lro module not doing so. I've attached a patch which adds support to inet_lro for aggregating pure acks. Aggregating pure acks (segments with TCP_PAYLOAD_LENGTH == 0) entails freeing the skb (or putting the page in the frags case). The patch also handles trimming (typical for 54-byte pure ack frames which have been padded to the ethernet minimum 60 byte frame size). In the frags case, I tried to keep things simple by only doing the trim when the entire frame fits in the first frag. To be safe, I ensure that the padding is all 0 (or, more exactly, was some pattern whose checksum is -0) so that it doesn't impact hardware checksums. This patch also fixes a small bug in the skb LRO path dealing with vlans that I found when doing my own testing. Specifically, in the desc-active case, the existing code always fails the lro_tcp_ip_check() for NICs without LRO_F_EXTRACT_VLAN_ID, because it fails to subtract the vlan_hdr_len from the skb-len. Jan-Bernd Themann ([EMAIL PROTECTED]) has tested the patch using the eHEA driver (skb codepath), and I have tested it using Myri10GE (both frags and skb codepath). Using a pair of identical low-end 2.0GHz Athlon 64 x2 3800+ with Myri10GE 10GbE NICs, I ran 10 iterations of netperf TCP_SENDFILE tests, taking the median run for comparison purposes. The receiver was running Myri10GE + patched inet_lro: TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to rome-my (192.168.1.16) port 0 AF_INET : cpu bind Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Myri10GE driver-specific LRO: 87380 65536 6553660.02 9442.65 16.2469.310.282 1.203 Myri10GE + unpatched inet_lro: 87380 65536 6553660.02 9442.88 20.1069.110.349 1.199 Myri10GE + patched inet_lro: 87380 65536 6553660.02 9443.30 16.9568.970.294 1.197 The important bit here is the sender's CPU utilization, and service demand (cost per byte). As you can see, without aggregating ack's, the overhead on the sender is roughly 20% higher, even when sending to a receiver which uses LRO. The differences are even more dramatic when sending to a receiver which does not use LRO (and hence sends more frequent acks). Below is the same benchmark, run between a pair of 4-way 3.0GHz Xeon 5160 machines (Dell 2950) with Myri10GE NICs. The receiver is running Solaris 10U4, which does not do LRO, and is acking at approximately 8:1 (or ~100K acks/sec): Myri10GE driver-specific LRO: 196712 65536 6553660.01 9280.09 7.14 45.370.252 1.602 Myri10GE + unpatched inet_lro: 196712 65536 6553660.01 8530.80 10.5144.600.404 1.713 Myri10GE + patched inet_lro: 196712 65536 6553660.00 9249.65 7.21 45.900.255 1.626 Signed off by: Andrew Gallatin [EMAIL PROTECTED] Andrew Gallatin Myricom Inc. diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c index ac3b1d3..eba145b 100644 --- a/net/ipv4/inet_lro.c +++ b/net/ipv4/inet_lro.c @@ -58,9 +58,6 @@ static int lro_tcp_ip_check(struct iphdr if (ntohs(iph-tot_len) != len) return -1; - if (TCP_PAYLOAD_LENGTH(iph, tcph) == 0) - return -1; - if (iph-ihl != IPH_LEN_WO_OPTIONS) return -1; @@ -223,6 +220,11 @@ static void lro_add_packet(struct net_lr lro_add_common(lro_desc, iph, tcph, tcp_data_len); + if (tcp_data_len == 0) { + dev_kfree_skb_any(skb); + return; + } + skb_pull(skb, (skb-len - tcp_data_len)); parent-truesize += skb-truesize; @@ -244,6 +246,11 @@ static void lro_add_frags(struct net_lro lro_add_common(lro_desc, iph, tcph, tcp_data_len); + if (tcp_data_len == 0) { + put_page(skb_frags[0].page); + return; + } + skb-truesize += truesize; skb_frags[0].page_offset += hlen; @@ -338,6 +345,8 @@ static int __lro_proc_skb(struct net_lro struct tcphdr *tcph; u64 flags; int vlan_hdr_len = 0; + int pkt_len; + int trim; if (!lro_mgr-get_skb_header || lro_mgr-get_skb_header(skb, (void *)iph, (void *)tcph, @@ -355,6 +364,17 @@ static int __lro_proc_skb(struct net_lro !test_bit(LRO_F_EXTRACT_VLAN_ID, lro_mgr-features)) vlan_hdr_len = VLAN_HLEN; + /* strip padding
Re: myri10ge net-2.6.24 build fix
David Miller wrote: I had to add the following patch to fix the build after the LRO changes, I have no idea how you could have compile tested that patch let alone done any real testing on it :-/ Whoops. I'm very sorry about that. Future patches will be submitted by our Linux guy, who knows the correct procedures. :) FWIW, the patch I sent came from a script which filters our internal driver (which I always work with), and that line was erroneously filtered by the script we use to remove all the stuff you guys frown on (like our LRO, compat shims for older kernels, optional support to receive into skbs rather than pages, etc). My testing was done with our un-filtered upstream driver. Since our driver specific LRO is now gone (hurray!), I removed the LRO filtering, re-ran the filter script, and arrived at the same patch you committed. Thanks for fixing my mistake! Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lro: myri10ge example how to use LRO
To follow up on Jan-Bernd Themann's LRO patch earlier today, this patch shows how the generic LRO interface can be used for page based drivers. Again, many thanks to Jan-Bernd Themann for leading this effort. Drew Singed off by: Andrew Gallatin [EMAIL PROTECTED] diff -urNp a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c --- a/drivers/net/myri10ge/myri10ge.c 2007-07-24 15:57:12.0 -0400 +++ b/drivers/net/myri10ge/myri10ge.c 2007-08-03 13:07:48.0 -0400 @@ -48,6 +48,7 @@ #include linux/etherdevice.h #include linux/if_ether.h #include linux/if_vlan.h +#include linux/inet_lro.h #include linux/ip.h #include linux/inet.h #include linux/in.h @@ -62,6 +63,8 @@ #include linux/io.h #include linux/log2.h #include net/checksum.h +#include net/ip.h +#include net/tcp.h #include asm/byteorder.h #include asm/io.h #include asm/processor.h @@ -89,6 +92,7 @@ MODULE_LICENSE(Dual BSD/GPL); #define MYRI10GE_EEPROM_STRINGS_SIZE 256 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2) +#define MYRI10GE_MAX_LRO_DESCRIPTORS 8 #define MYRI10GE_NO_CONFIRM_DATA htonl(0x) #define MYRI10GE_NO_RESPONSE_RESULT 0x @@ -151,6 +155,8 @@ struct myri10ge_rx_done { dma_addr_t bus; int cnt; int idx; + struct net_lro_mgr lro_mgr; + struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS]; }; struct myri10ge_priv { @@ -276,6 +282,14 @@ static int myri10ge_debug = -1;/* defau module_param(myri10ge_debug, int, 0); MODULE_PARM_DESC(myri10ge_debug, Debug level (0=none,...,16=all)); +static int myri10ge_lro = 1; +module_param(myri10ge_lro, int, S_IRUGO); +MODULE_PARM_DESC(myri10ge_lro, Enable large receive offload\n); + +static int myri10ge_lro_max_pkts = MYRI10GE_LRO_MAX_PKTS; +module_param(myri10ge_lro_max_pkts, int, S_IRUGO); +MODULE_PARM_DESC(myri10ge_lro, Number of LRO packets to be aggregated\n); + static int myri10ge_fill_thresh = 256; module_param(myri10ge_fill_thresh, int, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(myri10ge_fill_thresh, Number of empty rx slots allowed\n); @@ -1019,6 +1033,15 @@ myri10ge_rx_done(struct myri10ge_priv *m remainder -= MYRI10GE_ALLOC_SIZE; } + if (mgp-csum_flag myri10ge_lro) { + rx_frags[0].page_offset += MXGEFW_PAD; + rx_frags[0].size -= MXGEFW_PAD; + len -= MXGEFW_PAD; + lro_receive_frags(mgp-rx_done.lro_mgr, rx_frags, + len, len, (void *)(unsigned long)csum, csum); + return 1; + } + hlen = MYRI10GE_HLEN len ? len : MYRI10GE_HLEN; /* allocate an skb to attach the page(s) to. */ @@ -1137,6 +1160,9 @@ static inline void myri10ge_clean_rx_don mgp-stats.rx_packets += rx_packets; mgp-stats.rx_bytes += rx_bytes; + if (myri10ge_lro) + lro_flush_all(rx_done-lro_mgr); + /* restock receive rings if needed */ if (mgp-rx_small.fill_cnt - mgp-rx_small.cnt myri10ge_fill_thresh) myri10ge_alloc_rx_pages(mgp, mgp-rx_small, @@ -1378,7 +1404,8 @@ static const char myri10ge_gstrings_stat dropped_pause, dropped_bad_phy, dropped_bad_crc32, dropped_unicast_filtered, dropped_multicast_filtered, dropped_runt, dropped_overrun, dropped_no_small_buffer, - dropped_no_big_buffer + dropped_no_big_buffer, LRO aggregated, LRO flushed, + LRO avg aggr, LRO no_desc }; #define MYRI10GE_NET_STATS_LEN 21 @@ -1444,6 +1471,14 @@ myri10ge_get_ethtool_stats(struct net_de data[i++] = (unsigned int)ntohl(mgp-fw_stats-dropped_overrun); data[i++] = (unsigned int)ntohl(mgp-fw_stats-dropped_no_small_buffer); data[i++] = (unsigned int)ntohl(mgp-fw_stats-dropped_no_big_buffer); + data[i++] = mgp-rx_done.lro_mgr.stats.aggregated; + data[i++] = mgp-rx_done.lro_mgr.stats.flushed; + if (mgp-rx_done.lro_mgr.stats.flushed) + data[i++] = mgp-rx_done.lro_mgr.stats.aggregated / + mgp-rx_done.lro_mgr.stats.flushed; + else + data[i++] = 0; + data[i++] = mgp-rx_done.lro_mgr.stats.no_desc; } static void myri10ge_set_msglevel(struct net_device *netdev, u32 value) @@ -1717,10 +1752,69 @@ static void myri10ge_free_irq(struct myr pci_disable_msi(pdev); } +static int +myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr, +void **ip_hdr, void **tcpudp_hdr, +u64 * hdr_flags, void *priv) +{ + struct ethhdr *eh; + struct vlan_ethhdr *veh; + struct iphdr *iph; + u8 *va = page_address(frag-page) + frag-page_offset; + unsigned long ll_hlen; + __wsum csum = (__wsum) (unsigned long)priv; + + /* find the mac header, aborting if not IPv4 */ + + eh = (struct ethhdr *)va; + *mac_hdr = eh; + ll_hlen = ETH_HLEN; + if (eh-h_proto != htons
Re: [PATCH] lro: myri10ge example how to use LRO
Kok, Auke wrote: Andrew Gallatin wrote: To follow up on Jan-Bernd Themann's LRO patch earlier today, this patch shows how the generic LRO interface can be used for page based drivers. Again, many thanks to Jan-Bernd Themann for leading this effort. Drew Singed off by: Andrew Gallatin [EMAIL PROTECTED] please take a look at my lro patch for ethtool and see if it works for you, instead of adding another generic module parameter that doesn't need to be there. That looks very nice, and will indeed work for me. Thanks, Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic
Jan-Bernd Themann wrote: On Monday 30 July 2007 22:32, Andrew Gallatin wrote: Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY when modified packets are flushed, else the stack will see bad checksums for packets from CHECKSUM_COMPLETE drivers using the skb interface. Fixed in the attached patch. I thought about it... As we do update the TCP checksum for aggregated packets we could add a second ip_summed field in the net_lro_mgr struct used for aggregated packets to support HW that does not have any checksum helper functionality. These drivers could set this ip_summed field to CHECKSUM_NONE, and thus leave the checksum check to the stack. I'm not sure if these old devices benefit a lot from LRO. So what do you think? This might be handy, and it would also fix the problem with CHECKSUM_PARTIAL drivers using the skb interface by allowing them to set it to CHECKSUM_UNNECESSARY. Fourth, I did some traffic sniffing to try to figure out what's going on above, and saw tcpdump complain about bad checksums. Have you tried running tcpdump -s 65535 -vvv? Have you also seen bad checksums? I seem to see this for both page- and skb-based versions of the driver. Hmmm, can't confirm that. For our skb-based version I see correct checksums for aggregated packets and for the page-based version as well. I used: (tcpdump -i ethX -s 0 -w dump.bin) in combination with ethereal. Don't see problems as well with your tcpdump command. I'm still trying to get a handle on this. It happens both with page based and skb based receive for me.. I would not be surprised if I was doing something wrong in myri10ge. But I don't see it without LRO, or with my LRO. I'll let you know when I figure it out.. In the meantime, in case you have any insight, I've left a capture of a small netcat transfer of a 64KB file full of zeros at http://www.myri.com/staff/gallatin/lro/netcat_dump.bz2 Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic
Here is a quick reply before something more official can be written up: Linas Vepstas wrote: -- what is LRO? Large Receive Offload -- Basic principles of operation? LRO is analogous to a receive side version of TSO. The NIC (or driver) merges several consecutive segments from the same connection, fixing up checksums, etc. Rather than up to 45 separate 1500 byte frames (meaning up to 45 trips through the network stack), the driver merges them into one 65212 byte frame. It currently works only with TCP over IPv4. LRO was, AFAIK, first though of by Neterion. They had a paper about it at OLS2005. http://www.linuxinsight.com/files/ols2005/grossman-reprint.pdf -- Can I use it in my driver? Yes, it can be used in any driver. -- Does my hardware have to have some special feature before I can use it? No. -- What sort of performance improvement does it provide? Throughput? Latency? CPU usage? How does it affect DMA allocation? Does it improve only a certain type of traffic (large/small packets, etc.) The benefit is directly proportional to the packet rate. See my reply to the previous RFC for performance information. The executive summary is that for the myri10ge 10GbE driver on low end hardware with 1500b frames, I've seen it increase throughput by a factor of nearly 2.5x, while at the same time reducing CPU utilization by 17%. The affect for jumbo frames is less dramatic, but still impressive (1.10x, 14% CPU reduction) You can achieve better speedups if your driver receives into high-order pages. -- Example code? What's the API? How should my driver use it? The 3/4 in this patch showed an example of converting a driver to use LRO for skb based receive buffers. I'm working on a patch for myri10ge that shows how you would use it in a driver which receives into pages. Cheers, Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic
Jan-Bernd Themann wrote: Hi, this patch set contains the latest generic LRO code, a Kconfig / Makefile and an eHEA patch demonstrating how the aggregate SKB interface has to to be used. Drew, could you provide a patch for the myri10ge driver to show how the receive in page interface works? Hi, Thank you for the many fixes, and especially for the counters! I was working on testing the myri10ge patch, and I ran into a few problems. I've attached a patch to inet_lro.c to fix some of them, and a patch to myri10ge.c to show how to use the page based interface. Both patches are signed off by Andrew Gallatin [EMAIL PROTECTED] First, the LRO_MAX_PG_HLEN is still a problem. Minimally sized 60 byte frames still cause problems in lro_gen_skb due to skb-len going negative. Fixed in the attached patch. It may be simpler to just drop LRO_MAX_PG_HLEN to ETH_ZLEN, but I'm not sure if that is enough. Are there smart NICs which might chop off padding themselves? Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY when modified packets are flushed, else the stack will see bad checksums for packets from CHECKSUM_COMPLETE drivers using the skb interface. Fixed in the attached patch. Third, for some reason I have yet to figure out, this version of the patch takes a while to ramp up, but only when the receiver MTU is 9000 and the sender MTU is 1500. If the receiver MTU is also 1500, even a 10 second netperf easily shows line rate. I don't see this with our LRO, and I'm so far at a loss to explain it. Here is the first 3 seconds of output from a simple program which diffs the interface counters once / sec. Ipkts IBytesOpkts Obytes rx MTU=9000: 0000 7299092 14 1102 105 1589707 420 750324 113598708417804 1068240 814427 123304247818529 740 rx MTU=1500 0000 44134466818016813132 788182 815938 123532865618716 1122960 817698 123799477218628 1117680 . Once it has ramped up, the bandwidth is fine, and there doesn't seem to be much difference between setting the receiver MTU to 1500 or 9000. But it takes a very long netperf run to overcome the ramp up period. Fourth, I did some traffic sniffing to try to figure out what's going on above, and saw tcpdump complain about bad checksums. Have you tried running tcpdump -s 65535 -vvv? Have you also seen bad checksums? I seem to see this for both page- and skb-based versions of the driver. Last, the pointer to the TCP header in __lro_proc_segment() in the unaccelerated vlan hdr case needs to also be offset by vlan_hdr_len from skb-data. I've fixed this in the attached patch. Thanks, Drew --- linux-2.6.23-rc1.orig/net/ipv4/inet_lro.c 2007-07-30 13:10:49.0 -0400 +++ linux-2.6.23-rc1/net/ipv4/inet_lro.c2007-07-30 15:17:51.0 -0400 @@ -126,6 +126,7 @@ static void lro_update_tcp_ip_header(str htons(lro_desc-ip_tot_len) - IP_HDR_LEN(iph), IPPROTO_TCP, lro_desc-data_csum); + lro_desc-parent-ip_summed = CHECKSUM_UNNECESSARY; } static __wsum lro_tcp_data_csum(struct iphdr *iph, struct tcphdr *tcph, int len) @@ -396,6 +397,9 @@ static struct sk_buff *lro_gen_skb(struc if (!skb) return NULL; + if (hlen len) + hlen = len; + skb-len = len; skb-data_len = len - hlen; skb-truesize += true_size; diff -urp a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c --- a/drivers/net/myri10ge/myri10ge.c 2007-07-24 15:57:12.0 -0400 +++ b/drivers/net/myri10ge/myri10ge.c 2007-07-30 16:12:06.0 -0400 @@ -48,6 +48,7 @@ #include linux/etherdevice.h #include linux/if_ether.h #include linux/if_vlan.h +#include linux/inet_lro.h #include linux/ip.h #include linux/inet.h #include linux/in.h @@ -62,6 +63,8 @@ #include linux/io.h #include linux/log2.h #include net/checksum.h +#include net/ip.h +#include net/tcp.h #include asm/byteorder.h #include asm/io.h #include asm/processor.h @@ -89,6 +92,7 @@ MODULE_LICENSE(Dual BSD/GPL); #define MYRI10GE_EEPROM_STRINGS_SIZE 256 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2) +#define MYRI10GE_MAX_LRO_DESCRIPTORS 8 #define MYRI10GE_NO_CONFIRM_DATA htonl(0x) #define MYRI10GE_NO_RESPONSE_RESULT 0x @@ -151,6 +155,8 @@ struct myri10ge_rx_done { dma_addr_t bus; int cnt; int idx; + struct net_lro_mgr lro_mgr; + struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS]; }; struct myri10ge_priv { @@ -276,6 +282,14 @@ static int myri10ge_debug = -1;/* defau module_param(myri10ge_debug, int, 0
Re: [RFC 0/1] lro: Generic Large Receive Offload for TCP traffic
Jan-Bernd Themann wrote: On Wednesday 25 July 2007 19:17, Andrew Gallatin wrote: 3) Padded frames. I may be missing something, but I don't see where you either strip padding from frames or reject padded frames. (see the pskb_trim_rcsum() in net/ipv4/ip_input.c:ip_rcv() I think I missed something :-) Will fix that. In lro_tcp_ip_check we check for the SKB aggregation interface the skb-len against ip-tot_len. This catches padded frames as eth_type_trans(skb, dev) reduces the length of the SKB. However, the possible VLAN header is not taken into account. And for the receive in pages interface a wrong length is passed as argument as well. I'd suggest to reject padded frames for aggregation as we do now (when bugs are fixed) and thus keep the code simple. I guess that padded frames don't occur too often in high load situations. If we detect a real performance issue we can still change that later. The one case where performance may be at issue is in aggregating Acks on connections w/o TCP timestamps where a frame size of 54 bytes is padded out to 60. I did some very quick measurements using netperf -tTCP_SENDFILE on the same athlons mentioned earlier using our 1.3.1 driver. I see a roughly 8% CPU increase (~17% - ~25%) when I disable LRO (and hence Ack aggregation) on the sender. This works out to an increase in service demand from ~0.3 to ~0.44 according to netperf. With LRO enabled, our counters show we're aggregating acks at a roughly 3:1 ratio. This is probably an optimization that can be done later, but it is helpful. This reminds me.. what would you think about adding some sort of counters, ideally per-interface, to expose how well LRO is working? At the simplest level, you could add them to the lro mgr struct and let drivers export them via ethtool. I think a central approach might be more appropriate. At any rate, I'd prefer the final version to at least have counters to indicate how many packets were aggregated, how many packets were flushed, and how many times we failed to aggregate something due to insufficient net_lro_desc descriptors. Thanks again for taking the lead on this, Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/1] lro: Generic Large Receive Offload for TCP traffic
Hi, I've ported myri10ge to use the new LRO interface. I have attached a preliminary patch to myri10ge. I'm very pleased to note that the performance is on-par with my own LRO used by our out-of-tree driver. (except when using mixed MTUS, see performance data below). As I expected, actually porting our driver to use the LRO interface gave me a far better understanding of the interface, and allowed for better feedback. I have attached a patch to the LRO code which addresses some of the issues I mention below. Please find below a performance summary, as well as my comments on the LRO code, and patches to myri10ge and inet_lro. Both patches are Signed-off-by: Andrew J. Gallatin [EMAIL PROTECTED] Cheers, Drew === Performance: === Here is a performance summary taken on my very low-end 2.0GHz AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ running 2.6.23-rc1 and receiving a netperf TCP_SENDFILE test from an identical sender (which was running 2.6.22 and our 1.3.1 out of tree driver). The netserver process was bound to a different core than the interrupt handler. The data reported is from the median of 5 60 second netperf tests. The following settings were in /etc/sysctl.conf on both machines: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 2500 net.ipv4.tcp_timestamps = 0 RX Performance for Sender MTU=1500, Receiver MTU=1500 expressed as x Gb/s, y %CPU receiver utilization: rxbuf(1) 1.3.1(2) inet_lro no LRO ---- --- 4K pg8.9 78% 8.8 77% 3.7 89% 8K pg9.2 77% 9.1 77% 3.7 89% 16Kpg9.4 73% 9.4 73% 3.8 89% 32Kpg9.4 72% 9.4 72% 3.9 89% skb N/A N/A 5.5 90% 4.1 92% RX Performance for Sender MTU=1500, Receiver MTU=9000 expressed as x Gb/s, y %CPU receiver utilization: rxbuf(1) 1.3.1(2) inet_lro no LRO ---- --- 4K pg8.9 78% 7.3 79% 3.7 89% 8K pg9.2 77% 7.6 79% 3.7 89% 16Kpg9.4 73% 8.0 78% 3.8 89% 32Kpg9.4 72% 8.2 79% 3.9 89% skb N/A N/A 4.9 92% 4.1 92% RX Performance for Sender MTU=9000, Receiver MTU=9000 expressed as x Gb/s, y %CPU receiver utilization: rxbuf(1) 1.3.1(2) inet_lro no LRO ---- --- 4K pg9.9 63% 9.6 66% 8.3 71% 8K pg9.9 60% 9.9 63% 8.4 72% 16Kpg9.9 55% 9.9 55% 8.7 70% 32Kpg9.9 53% 9.9 53% 8.9 67% skb N/A N/A 9.9 68% 8.7 72% (1) xK pg means the driver was configured to adjust the receive page size using MYRI10GE_ALLOC_ORDER. skb means an internal variant of our driver which receives into skbs rather than pages was used. (2) 1.3.1 is our latest out of tree driver which uses the myri10ge specific frags-based LRO code previously submitted and rejected. === Code review / comments: === 1) Checksum information for CHECKSUM_COMPLETE drivers. Our NIC passes partial checksums to our driver. In the current code, it seems impossible for page based CHECKSUM_COMPLETE drivers to behave correctly in the case of rejected frames. Eg, there is no way to pass the partial checksum to the LRO module so that it gets set in the skb header and passed up the stack. Further, there seems to be no (easy) way to use CHECKSUM_COMPLETE on an aggregated packet at LRO flush time. By the time a packet is aggregated, the partial checksum from the first segment is out of date. I think it would make sense to require that a CHECKSUM_COMPLETE style driver verify the checksum in its get_frag_header / get_skb_header callback. This allows the LRO code to unconditionally set CHECKSUM_UNNECESSARY. The attached a patch modifies the code to do this. 2) Non-accelerated VLAN tags Our firmware currently does not do vlan tag insertion and removal. This causes a problem in __lro_proc_segment() where the tcp and ip headers are setup to point into the newly created skb. A frame containing an unstripped vlan tag will cause the headers to be garbage. The attached patch modifies __lro_proc_segment() to offset those pointers by VLAN_HLEN when required. 3) Padded frames. I may be missing something, but I don't see where you either strip padding from frames or reject padded frames. (see the pskb_trim_rcsum() in net/ipv4/ip_input.c:ip_rcv() I did not add such a feature as I was confused about the intended use of len/true_size. Also, trimming is a pain when dealing with pure frags (without a containing skb). We have code in our out-of-kernel driver to deal with it which you are welcome to use. 4) LRO_MIN_PG_HLEN (== 80) This confuses me. Can you please explain what you're trying to do? Because of this, I kept getting crashes in the skb_pull() done by eth_type_trans() because I was passing segments which were 60 bytes and the skb-data_len of the skb constructed by lro_gen_skb()
Re: [RFC 0/1] lro: Generic Large Receive Offload for TCP traffic
On 7/20/07, Jan-Bernd Themann [EMAIL PROTECTED] wrote: Hi, Thanks a lot for your comments so far. This generic LRO patch differs from the last one in several points. A new interface for a receive in pages mode has been added and tested with an eHEA prototype. Seems to work well. Does this extended interface seem to be sufficient? Thank you for this! At least for me, I find it is best to try to use an interface rather than simply reading a diff. So I will port Myri10GE to use the new interface so that I can give better feedback, I'll try my best to do this by early next week. Thank you again, Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/3] lro: Generic Large Receive Offload for TCP traffic
On 7/11/07, Jan-Bernd Themann [EMAIL PROTECTED] wrote: Generic Large Receive Offload proposal I'm very glad that somebody is stepping up to take responsibility for this! I'm the primary author of the Myricom Myri10GE driver, and its LRO mechanism (which has been rejected several times when posted here). I don't subscribe to the LKML or netdev lists (the b/w is way too much for me). Thankfully, my colleague Brice (who maintains our driver in the kernel) forwarded me this message. I looked at your patch, and I believe that we can improve the performance further by using a slightly different approach, or at least making that approach optional. When I first did our LRO implementation, it aggregated packets essentially the same way you are doing it -- by appending packets to the frag_list. I did quite extensive profiling, and the most expensive operations seemed to be the allocation and freeing of memory. A colleague of mine (Loic, CC'ed) came up with the brilliant idea of receiving into pages. When I implemented his suggestion, it turned out to be much, much more efficient to receive into pages, and to accumulate the pages to the frags array. The benchmarks I did on very low end machines in my lab (2GHz amd64 x2 3800+) showed that the receiver was saturated at roughly 4.2Gb/s without LRO, 5.7Gb/s with frag_list based LRO, 8.6Gb/s with frags array based LRO, and somewhat idle at line rate with frags array based LRO and 32KB order=3 pages. (This is 1500b frames, BTW). The savings comes from being able to do fewer allocations. For example, 2 1500b packets fit in a single page. So, for a full frag array, we have 16 1/2 4KB pages and a single skb holding them. This works out to be 9 allocations for roughly 23KB of payload, rather than 16. Using order 3 (32KB) pages, it gets even better, and we have just 2 allocations per full skb frag list. So... It would be wonderful if your patch could also deal with data residing in pages, rather than in skbs. I can understand how you might not want to modify your driver to do this, which is why I'm asking about making it optional. However, your driver would probably benefit from receiving into pages, and I encourage you to investigate that. I'm picturing an additional path into lro, such as: int lro_maybe_receive_frags(struct net_lro_mgr *lro_mgr, struct skb_frag_struct *frags, int len, void *priv); This would return 0 if the packet was accepted, and an error if it was not. It would then call a modified __lro_proc_skb() (perhaps better named __lro_proc_segment()) which would have the length as an argument (so as to avoid the need to pass the skb to lro_tcp_ip_check()) in addition to the *frags. The only real additional work would be in having an alternate path inside lro_init_desc() which allocated an skb to hang the pages from if the passed skb was null, and in having an alternate lro_add_packet() path which added the frag(s), rather than chaining an skb. Also, your patch does not seem to maintain TCP checksums. To be fair, we did not maintain checksums in our earlier LRO work either, and it was much simpler that way! :) However, not maintaining the TCP checksum can lead to user complaints (invalid csum reported by traffic sniffers), as well as problems when various filtering software is used which attempts to re-write the TCP header. I would very much appreciate it if you could look at my LRO implementation in our Myri10GE driver which has been posted (and rejected) several times in the past by Brice. The source code is also available at: http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.3.0.tgz. It uses a page based approach, and it maintains correct TCP checksums. This driver is Dual BSD/GPL licensed, and you are free to take whatever you like. Thank you again for stepping up with a generic implementation! Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vlan net_device features flag is 0?
On 4/5/06, Ben Greear [EMAIL PROTECTED] wrote: Andrew Gallatin wrote: I'm working on a driver for a 10GbE nic. I've just gotten to the point where I am verifying that 802.1q vlans work without hardware vlan offload. It seems like the netdev features flags (NETIF_F_SG|NETIF_F_IP_CSUM|NETIF_F_TSO) are not being inherited by the vlan device. This leads to very high CPU utilization, especially when running applications which use sendfile, since it forces data copies. I have verified this is the problem by printing the vlan device's features from the end of register_vlan_device(). If copy real_dev's features to new_dev, then the performance problems disappear. This seems logical to me. We might need some dynamic logic to deal with vlans on top of bridge devices (I believe bridges adjust *their* capabilities dynamically as devices are added to and removed from the bridge.) So you are sayiing that you'd be willing to copy real_dev's features flags when creating a vlan? Great! Unfortunately, I know nothing about bridges. How would this dynamic logic work? We would also need to catch ethtool updates to real_dev's features flags, so that ethtook -K would work as expected on vlan devices. Or perhaps the dynamic logic to deal with vlans on top of bridges would be able to deal with it. This is especially confusing, since it appears ethtool's -K/-k commands fall through to the real device. So it appears like the ethX.VLAN dev has scatter-gather and checksum offload enabled, when actually it does not. Am I hitting a weird corner case in having a device which does checksum/sg/tso and not vlan offload? Will this magically go away if I implement some vlan hardware assist feature? I doubt that will matter, but it's possible. DaveM and others know more about the hardware assist path that I do... Yes, I think you are correct, and it should not matter. i was just grasping at straws. Thank you, Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html