Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Tom Herbert Date: Wed, 6 Apr 2016 14:42:26 -0300 > On Wed, Apr 6, 2016 at 12:43 PM, David Miller wrote: >> From: Tom Herbert >> Date: Wed, 6 Apr 2016 10:53:52 -0300 >> >>> Packets that are forwarded really should not be GRO'ed in the first >>> place because of the loss of information and added latency. >> >> First of all GRO is supposed to be lossless, so please stop saying this >> would be a reason to turn it off on a router. >> >> Second of all, the biggest piece of overhead is the routing lookup, >> therefore GRO batching helps enormously with routing workloads, and >> therefore is appropriate to be enabled on routers. >> >> Yes, I agree that for locally terminated stuff it helps more, but don't >> turn this into a "GRO on routers, meh..." type argument. It simply is >> not true at all. >> > GRO on routers will help in a limited case where there is little load > and the traffic is nicely groomed high tput TCP connections. But for > routers with significant load, handling large quantities other > protocols like UDP, GRO is not necessarily helpful and presents a > nondeterministic performance improvement. For instance, I cannot > provision a router with any assumptions that GRO will be effective for > any % of packets to save any % of CPU, we need to provision based > purely on ability to forward by pps assuming no benefit from GRO. Just because you cannot predict how effective a facility will be, that doesn't mean you shouldn't use it at all.
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Wed, Apr 6, 2016 at 12:43 PM, David Miller wrote: > From: Tom Herbert > Date: Wed, 6 Apr 2016 10:53:52 -0300 > >> Packets that are forwarded really should not be GRO'ed in the first >> place because of the loss of information and added latency. > > First of all GRO is supposed to be lossless, so please stop saying this > would be a reason to turn it off on a router. > > Second of all, the biggest piece of overhead is the routing lookup, > therefore GRO batching helps enormously with routing workloads, and > therefore is appropriate to be enabled on routers. > > Yes, I agree that for locally terminated stuff it helps more, but don't > turn this into a "GRO on routers, meh..." type argument. It simply is > not true at all. > GRO on routers will help in a limited case where there is little load and the traffic is nicely groomed high tput TCP connections. But for routers with significant load, handling large quantities other protocols like UDP, GRO is not necessarily helpful and presents a nondeterministic performance improvement. For instance, I cannot provision a router with any assumptions that GRO will be effective for any % of packets to save any % of CPU, we need to provision based purely on ability to forward by pps assuming no benefit from GRO. Technically, this provisioning argument applies to end hosts also. We have seen real cases on video servers where servers were mis-provisioned with the assumption that GSO is always effective. So when were getting good GSO benefits we might be using something like 80% CPU, but if some connectivity event occurs forcing all the cwnds of the active connections drop, we find we need 110% of CPU to recover. This discussion is relevant because there is a big push now to replace dedicated HW with commodity HW running Linux, this is already happened significantly in load balancers but I expect this to extend to even some cases of basic switching. Besides, I seriously doubt you'll find any commercial router in the world that does GRO. Yes, GRO needs to work correctly in all cases, but for system performance in routing we a bound to per packet actions like route lookup. As I said, optimizing GRO as Edward is suggesting for forwarding case seems to have diminishing benefits. Tom
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Wed, 2016-04-06 at 16:55 +0100, Edward Cree wrote: > On 06/04/16 16:39, Eric Dumazet wrote: > > Look at the mess of some helpers in net/core/skbuff.c, and imagine the > > super mess it would be if using a concept of 'super packet with various > > headers on each segment'. > Maybe I'm still not explaining this very well, but there is _no_ concept of > 'super packet [anything]' in this idea. There is just 'list of skbs that > were all received in the same NAPI poll, and have not yet been determined to > be different'. > > Any layer that doesn't want to deal with this stuff will always have the > option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);" > and in fact I'd make that happen by default for anything that hadn't > registered a function to take a list. > > netfilter is already complex, it would become a nightmare. > A netfilter hook could, for instance, run on each packet in the list, then > partition the list into sub-lists of packets that all had the same verdict > (letting go of any that were DROP or STOLEN). That doesn't seem like it > should be nightmarish. Okay, I will let you try this alone ;)
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 06/04/16 16:39, Eric Dumazet wrote: > Look at the mess of some helpers in net/core/skbuff.c, and imagine the > super mess it would be if using a concept of 'super packet with various > headers on each segment'. Maybe I'm still not explaining this very well, but there is _no_ concept of 'super packet [anything]' in this idea. There is just 'list of skbs that were all received in the same NAPI poll, and have not yet been determined to be different'. Any layer that doesn't want to deal with this stuff will always have the option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);" and in fact I'd make that happen by default for anything that hadn't registered a function to take a list. > netfilter is already complex, it would become a nightmare. A netfilter hook could, for instance, run on each packet in the list, then partition the list into sub-lists of packets that all had the same verdict (letting go of any that were DROP or STOLEN). That doesn't seem like it should be nightmarish. -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Tom Herbert Date: Wed, 6 Apr 2016 10:53:52 -0300 > Packets that are forwarded really should not be GRO'ed in the first > place because of the loss of information and added latency. First of all GRO is supposed to be lossless, so please stop saying this would be a reason to turn it off on a router. Second of all, the biggest piece of overhead is the routing lookup, therefore GRO batching helps enormously with routing workloads, and therefore is appropriate to be enabled on routers. Yes, I agree that for locally terminated stuff it helps more, but don't turn this into a "GRO on routers, meh..." type argument. It simply is not true at all.
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Wed, 2016-04-06 at 15:26 +0100, Edward Cree wrote: > On 06/04/16 14:53, Tom Herbert wrote: > > But again, this scheme is optimizing for forwarding case and doesn't > > help (and probably hurts) the use case of locally terminated > > connections > Not really. I think this has a chance to outperform GRO for locally > terminated connections, for a number of reasons: > * Doesn't look at higher-layer or inner headers until later in packet > processing, for instance we (maybe) process every L3 header in a NAPI poll > before looking at a single L4. This could delay touching the second > cacheline of some packets. > * Doesn't have to re-write headers to produce a coherent superframe. > Instead, each segment carries its original headers around with it. Also > means we can skip _checking_ some headers to see if we're 'allowed' to > coalesce (e.g. TCP TS differences, and the current fun with IP IDs). > * Can be used for protocols like UDP where the original packet boundaries > are important (so you can't coalesce into a superframe). > Really the last of those was the original reason for this idea, helping with > forwarding is just another nice bonus that we (might) get from it. > And it's all speculative and I don't know for sure what the performance > would be like, but I won't know until I try it! Look at the mess of some helpers in net/core/skbuff.c, and imagine the super mess it would be if using a concept of 'super packet with various headers on each segment'. netfilter is already complex, it would become a nightmare. GRO on the other hand presents one virtual set of headers, then the payload in one or multiple frags.
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 06/04/16 14:53, Tom Herbert wrote: > But again, this scheme is optimizing for forwarding case and doesn't > help (and probably hurts) the use case of locally terminated > connections Not really. I think this has a chance to outperform GRO for locally terminated connections, for a number of reasons: * Doesn't look at higher-layer or inner headers until later in packet processing, for instance we (maybe) process every L3 header in a NAPI poll before looking at a single L4. This could delay touching the second cacheline of some packets. * Doesn't have to re-write headers to produce a coherent superframe. Instead, each segment carries its original headers around with it. Also means we can skip _checking_ some headers to see if we're 'allowed' to coalesce (e.g. TCP TS differences, and the current fun with IP IDs). * Can be used for protocols like UDP where the original packet boundaries are important (so you can't coalesce into a superframe). Really the last of those was the original reason for this idea, helping with forwarding is just another nice bonus that we (might) get from it. And it's all speculative and I don't know for sure what the performance would be like, but I won't know until I try it! > which I would claim is more important. No argument there :-) > Packets that are > forwarded really should not be GRO'ed in the first place because of > the loss of information and added latency. The difficultly is that we > don't currently make the forwarding decision before GRO, if we can > change this to decide whether packets are to be forwarded earlier then > we can avoid doing GRO for those. That certainly would be nice, XDP is exciting and I look forward to it. -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Wed, Apr 6, 2016 at 8:21 AM, Edward Cree wrote: > On 06/04/16 00:45, David Miller wrote: >> From: Edward Cree >> Date: Tue, 5 Apr 2016 16:07:49 +0100 >> >>> On the gripping hand, I feel like GRO+TSO is the wrong model for >>> speeding up forwarding/routing workloads. Instead we should be >>> looking into having lists of SKBs traverse the stack together, >>> splitting the list whenever e.g. the destination changes. >> "Destination" is a very complicated beast. It's not just a >> destination IP address. >> >> It's not even just a full saddr/daddr/TOS triplet. >> >> Packets can be forwarded around based upon any key whatsoever in the >> headers. Netfilter can mangle them based upon arbitrary bits in the >> packet, as can the packet scheduler classifier actions. >> >> It's therefore not profitable to try this at all, it's completely >> pointless unless all the keys match up exactly. > Possibly I wasn't completely clear (or maybe I was and I'm just > wrong...), but I meant that _each layer_ in the stack would split the > list whenever it wants to treat two packets differently. Whether > that's a protocol receive handler, or a netfilter or tc operation. > > Obviously if you want to decide at the _beginning_ whether "all the > keys match", then you do essentially need GRO's flow-matching logic. > But even then, I find myself wondering if having GRO coalesce the > segments into a superpacket is really better than having it just make > lists of segments, and have that list traverse the stack as a single > entity. That way lossless resegmentation remains easy. But I suppose > that could make life difficult for things like BPF, if they want to > act upon the superframe (because we haven't built it). If instead > they act on each of the segments, we might get different actions for > each segment and that might also be awkward; so you'd still need this > concept of 'any layer in the stack can decide to split lists up'. > But again, this scheme is optimizing for forwarding case and doesn't help (and probably hurts) the use case of locally terminated connections which I would claim is more important. Packets that are forwarded really should not be GRO'ed in the first place because of the loss of information and added latency. The difficultly is that we don't currently make the forwarding decision before GRO, if we can change this to decide whether packets are to be forwarded earlier then we can avoid doing GRO for those. Tom > -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 06/04/16 00:45, David Miller wrote: > From: Edward Cree > Date: Tue, 5 Apr 2016 16:07:49 +0100 > >> On the gripping hand, I feel like GRO+TSO is the wrong model for >> speeding up forwarding/routing workloads. Instead we should be >> looking into having lists of SKBs traverse the stack together, >> splitting the list whenever e.g. the destination changes. > "Destination" is a very complicated beast. It's not just a > destination IP address. > > It's not even just a full saddr/daddr/TOS triplet. > > Packets can be forwarded around based upon any key whatsoever in the > headers. Netfilter can mangle them based upon arbitrary bits in the > packet, as can the packet scheduler classifier actions. > > It's therefore not profitable to try this at all, it's completely > pointless unless all the keys match up exactly. Possibly I wasn't completely clear (or maybe I was and I'm just wrong...), but I meant that _each layer_ in the stack would split the list whenever it wants to treat two packets differently. Whether that's a protocol receive handler, or a netfilter or tc operation. Obviously if you want to decide at the _beginning_ whether "all the keys match", then you do essentially need GRO's flow-matching logic. But even then, I find myself wondering if having GRO coalesce the segments into a superpacket is really better than having it just make lists of segments, and have that list traverse the stack as a single entity. That way lossless resegmentation remains easy. But I suppose that could make life difficult for things like BPF, if they want to act upon the superframe (because we haven't built it). If instead they act on each of the segments, we might get different actions for each segment and that might also be awkward; so you'd still need this concept of 'any layer in the stack can decide to split lists up'. -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Tue, Apr 05, 2016 at 12:36:40PM -0300, Tom Herbert wrote: > On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree wrote: > > On 05/04/16 05:32, Herbert Xu wrote: > >> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: > >>> The question I would have is what are you really losing with increment > >>> from 0 versus fixed 0? From what I see it is essentially just garbage > >>> in/garbage out. > >> GRO is meant to be lossless, that is, you should not be able to > >> detect its presence from the outside. If you lose information then > >> you're breaking this rule and people will soon start asking for it > >> to be disabled in various situations. > >> > >> I'm not against doing this per se but it should not be part of the > >> default configuration. > > I'm certainly in favour of this being configurable - indeed IMHO it should > > also be possible to configure GRO with the 'looser' semantics of LRO, so > > that people who want that can get it without all the horrible "don't confuse > > Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it > > exists are (a) improved performance from the 'loose' semantics and (b) old > > kernels without GRO. We may not be able to kill (b) but we can certainly > > address (a)). > > > > But I don't agree that the default has to be totally lossless; anyone who is > > caring about the ID fields in atomic datagrams is breaking the RFCs, and can > > be assumed to Know What They're Doing sufficiently to configure this. > > > > On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up > > forwarding/routing workloads. Instead we should be looking into having > > lists > > of SKBs traverse the stack together, splitting the list whenever e.g. the > > destination changes. That seems like it ought to be much more efficient > > than > > rewriting headers twice, once to coalesce a superframe and once to segment > > it > > again - and it also means this worry about GRO being lossless can go away. > > But until someone tries implementing skb batches, we won't know for sure if > > it works (and I don't have time right now ;) > > > Ed, > > I thought about that some. It seems like we would want to do both GRO > and retain all the individual packets in the skb so that we could use > those for forwarding instead of GSO as I think you're saying. This Retaining the individual packets would also help to make GRO feasible for SCTP. SCTP needs to know where each packet ended because of AUTH chunks and we cannot rely on something like gso_size as each original packet had it's own size. I could do it for tx side (see my SCTP/GSO RFC patches) using skb_gro_receive() with a specially crafted header skb, but I'm not seeing a way to do it in rx side as I cannot guarantee incoming skbs will follow that pattern. Marcelo > would would work great in the plain forwarding case, but one problem > is what to do if the host modifies the super packet (for instance when > forwarding over a tunnel we might add encapsulation header). This > should work in GSO (although we need to address the limitations around > 1 encap level), not sure this is easy if we need to add a header to > each packet in a batch. > > Tom > > > > > -Ed >
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Edward Cree Date: Tue, 5 Apr 2016 16:07:49 +0100 > On the gripping hand, I feel like GRO+TSO is the wrong model for > speeding up forwarding/routing workloads. Instead we should be > looking into having lists of SKBs traverse the stack together, > splitting the list whenever e.g. the destination changes. "Destination" is a very complicated beast. It's not just a destination IP address. It's not even just a full saddr/daddr/TOS triplet. Packets can be forwarded around based upon any key whatsoever in the headers. Netfilter can mangle them based upon arbitrary bits in the packet, as can the packet scheduler classifier actions. It's therefore not profitable to try this at all, it's completely pointless unless all the keys match up exactly. This is why GRO _is_ the proper model to speed this stuff and do bulk processing, because it still presents a full "packet" to all of these layers to mangle, rewrite, route, and do whatever else however they like.
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Tue, Apr 5, 2016 at 2:06 PM, Edward Cree wrote: > On 05/04/16 16:36, Tom Herbert wrote: >> I thought about that some. It seems like we would want to do both GRO >> and retain all the individual packets in the skb so that we could use >> those for forwarding instead of GSO as I think you're saying. > I didn't quite mean that, I meant just pass around the skb list, don't > do GRO at all. The receiving TCP stack ends up just getting called > several times, in quick succession, without I$ loss from network stack > traversal in between. > >> This >> would would work great in the plain forwarding case, but one problem >> is what to do if the host modifies the super packet (for instance when >> forwarding over a tunnel we might add encapsulation header). This >> should work in GSO (although we need to address the limitations around >> 1 encap level), not sure this is easy if we need to add a header to >> each packet in a batch. > This is indeed a problem with what I was proposing; perhaps the answer > is that as you process these SKB lists you also update something like a > GRO CB, then if you do decide to transform the packets you can coalesce > them at that point. But doing 'the rest' of GRO might cost as much as > just transforming all the packets, in which case you only win if you want > to transform them multiple times. > And if we assume that we're going to be touching all the headers anyway, > it probably doesn't cost us much to transform all the packets in the list > since our code and data are both hot in cache. Well, the code is cold > for the first packet in the list, but equally it's cold for the > superpacket in the current approach. > > If this is as performant as GRO in the normal (non-forwarding) receive > case (and that's a *big* if, which can only be resolved by trying it), it > might make sense to just not have GRO, while TSO only gets used for > locally-generated sends, and for the case where you're forwarding between > networks with different MTUs (e.g. from a 64k VM netdevice to the wire). > > What do you think? Am I crazy? (More than my usual?) > It's not clear to me how important optimizing the GRO to GSO forwarding case really is. We definitely need GRO for locally received packets, but how much benefit do we really get when applying GRO to forwarding and does that really outweigh the latency we're adding in the forwarding path to to GRO? Also, once we implement high performance forwarding in XDP there is really is no need to consider GRO, we'll do some sort of I/O batching for sure but that would be batching packets to same TX queue not same destination. Tom > -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 05/04/16 16:36, Tom Herbert wrote: > I thought about that some. It seems like we would want to do both GRO > and retain all the individual packets in the skb so that we could use > those for forwarding instead of GSO as I think you're saying. I didn't quite mean that, I meant just pass around the skb list, don't do GRO at all. The receiving TCP stack ends up just getting called several times, in quick succession, without I$ loss from network stack traversal in between. > This > would would work great in the plain forwarding case, but one problem > is what to do if the host modifies the super packet (for instance when > forwarding over a tunnel we might add encapsulation header). This > should work in GSO (although we need to address the limitations around > 1 encap level), not sure this is easy if we need to add a header to > each packet in a batch. This is indeed a problem with what I was proposing; perhaps the answer is that as you process these SKB lists you also update something like a GRO CB, then if you do decide to transform the packets you can coalesce them at that point. But doing 'the rest' of GRO might cost as much as just transforming all the packets, in which case you only win if you want to transform them multiple times. And if we assume that we're going to be touching all the headers anyway, it probably doesn't cost us much to transform all the packets in the list since our code and data are both hot in cache. Well, the code is cold for the first packet in the list, but equally it's cold for the superpacket in the current approach. If this is as performant as GRO in the normal (non-forwarding) receive case (and that's a *big* if, which can only be resolved by trying it), it might make sense to just not have GRO, while TSO only gets used for locally-generated sends, and for the case where you're forwarding between networks with different MTUs (e.g. from a 64k VM netdevice to the wire). What do you think? Am I crazy? (More than my usual?) -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Tue, Apr 5, 2016 at 9:30 AM, Eric Dumazet wrote: > On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote: > >> >> I disagree I think it will have to be part of the default >> configuration. The problem is the IP ID is quickly becoming >> meaningless. When you consider that a 40Gb/s link can wrap the IP ID >> value nearly 50 times a second using a 1500 MTU the IP ID field should >> just be ignored anyway because you cannot guarantee that it will be >> unique without limiting the Tx window size. That was the whole point >> of RFC6864. Basically the IP ID field is so small that as we push >> into the higher speeds you cannot guarantee that the field will have >> any meaning so for any case where you don't need to use it you >> shouldn't because it will likely not provide enough useful data. > > Just because a few flows reach 40Gbit , we should remind that vast > majority of the Internet runs with < 50Mbits flows. > > I prefer the argument of IPv6 not having ID ;) Okay, maybe I'll try to use that argument more often then.. :-) > We should do our best to keep interoperability, this is the selling > point. > > And quite frankly your last patch makes perfect sense to me : Yes. It was a compromise, though I might still have to go through and refine it more. It might make sense for the IP header associated with the TCP flow, but for outer headers it actually is worse because we end up blocking several different possibilities. What I might need to do is capture the state of the DF bit as we work a flow up through the stack and once it is in the list of GRO SKBs use that DF bit as a flag to indicate if we support a incrementing or fixed pattern for the values. That way tunnels can optionally ignore the IP ID if the DF bit is set since their values may not be as clean as that of TCP. > The aggregation is done only if the TCP headers of consecutive packets > matches. So who cares of IPv4 ID really ? > This is a very minor detail. The possible gains outperform the > theoretical 'problem' > > GRO already reorder flows, it never had a guarantee of being 'ínvisible' > as Herbert claims. I can see what he is trying to get at. I just think it is a bit too strict on the interpretation of what values have to be maintained. My plan going forward is to add a sysctl that will probably allow us some wiggle room in regards to IP ID for GRO and GSO so that when it is disabled we will not perform GSO partial nor allow for repeating IP ID in GRO on devices that cannot get the IP ID right. - Alex
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote: > > I disagree I think it will have to be part of the default > configuration. The problem is the IP ID is quickly becoming > meaningless. When you consider that a 40Gb/s link can wrap the IP ID > value nearly 50 times a second using a 1500 MTU the IP ID field should > just be ignored anyway because you cannot guarantee that it will be > unique without limiting the Tx window size. That was the whole point > of RFC6864. Basically the IP ID field is so small that as we push > into the higher speeds you cannot guarantee that the field will have > any meaning so for any case where you don't need to use it you > shouldn't because it will likely not provide enough useful data. Just because a few flows reach 40Gbit , we should remind that vast majority of the Internet runs with < 50Mbits flows. I prefer the argument of IPv6 not having ID ;) We should do our best to keep interoperability, this is the selling point. And quite frankly your last patch makes perfect sense to me : The aggregation is done only if the TCP headers of consecutive packets matches. So who cares of IPv4 ID really ? This is a very minor detail. The possible gains outperform the theoretical 'problem' GRO already reorder flows, it never had a guarantee of being 'ínvisible' as Herbert claims.
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Mon, Apr 4, 2016 at 9:32 PM, Herbert Xu wrote: > On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: >> >> The problem is right now we are mangling the IP ID for outer headers >> on tunnels. We end up totally ignoring the delta between the values >> so if you have two flows that get interleaved over the same tunnel GRO >> will currently mash the IP IDs for the two tunnels so that they end up >> overlapping. > > Then it should be fixed. I never reviewed those patches or I would > have objected at the time. The problem is what you are proposing is much larger than I am comfortable proposing for a net patch so I will probably go back to targeting net-next. >> The reason why I keep referencing RFC 6864 is because it specifies >> that the IP ID field must not be read if the DF bit is set, and that >> if we are manipulating headers we can handle the IP ID as though we >> are the transmitting station. What this means is that if DF is not >> set we have to have unique values per packet, otherwise we can ignore >> the values if DF is set. > > As I said GRO itself should not be visible. The fact that it is > for tunnels is a bug. Right. But the fact that we are even trying to get reliable data out of IP ID is also a bug. The fact is the IP ID isn't going to be linear in the case of tunnels. There is a good liklihood that the IP ID will jump around if we are doing encapsulation using a third party device such as a switch. That was one of the reasons why my first implementation just completely ignored the IP ID in the case that the DF bit is set. Honestly I am leaning more toward taking the approach of going back to that implementation and adding a sysctl that would let you disable it. Maybe something like net.ipv4.gro.rfc6864 since that is the RFC that spells out that we should treat IP ID as a floating input/output if DF is set. Basically we need to be able to do that if the GSO partial code is going to work. >> The question I would have is what are you really losing with increment >> from 0 versus fixed 0? From what I see it is essentially just garbage >> in/garbage out. > > GRO is meant to be lossless, that is, you should not be able to > detect its presence from the outside. If you lose information then > you're breaking this rule and people will soon start asking for it > to be disabled in various situations. Right. But in this case we technically aren't losing information. That is the thing I have been trying to point out with RFC 6864. It calls out that you MUST not read IP ID if the DF bit is set. > I'm not against doing this per se but it should not be part of the > default configuration. I disagree I think it will have to be part of the default configuration. The problem is the IP ID is quickly becoming meaningless. When you consider that a 40Gb/s link can wrap the IP ID value nearly 50 times a second using a 1500 MTU the IP ID field should just be ignored anyway because you cannot guarantee that it will be unique without limiting the Tx window size. That was the whole point of RFC6864. Basically the IP ID field is so small that as we push into the higher speeds you cannot guarantee that the field will have any meaning so for any case where you don't need to use it you shouldn't because it will likely not provide enough useful data. - Alex
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree wrote: > On 05/04/16 05:32, Herbert Xu wrote: >> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: >>> The question I would have is what are you really losing with increment >>> from 0 versus fixed 0? From what I see it is essentially just garbage >>> in/garbage out. >> GRO is meant to be lossless, that is, you should not be able to >> detect its presence from the outside. If you lose information then >> you're breaking this rule and people will soon start asking for it >> to be disabled in various situations. >> >> I'm not against doing this per se but it should not be part of the >> default configuration. > I'm certainly in favour of this being configurable - indeed IMHO it should > also be possible to configure GRO with the 'looser' semantics of LRO, so > that people who want that can get it without all the horrible "don't confuse > Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it > exists are (a) improved performance from the 'loose' semantics and (b) old > kernels without GRO. We may not be able to kill (b) but we can certainly > address (a)). > > But I don't agree that the default has to be totally lossless; anyone who is > caring about the ID fields in atomic datagrams is breaking the RFCs, and can > be assumed to Know What They're Doing sufficiently to configure this. > > On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up > forwarding/routing workloads. Instead we should be looking into having lists > of SKBs traverse the stack together, splitting the list whenever e.g. the > destination changes. That seems like it ought to be much more efficient than > rewriting headers twice, once to coalesce a superframe and once to segment it > again - and it also means this worry about GRO being lossless can go away. > But until someone tries implementing skb batches, we won't know for sure if > it works (and I don't have time right now ;) > Ed, I thought about that some. It seems like we would want to do both GRO and retain all the individual packets in the skb so that we could use those for forwarding instead of GSO as I think you're saying. This would would work great in the plain forwarding case, but one problem is what to do if the host modifies the super packet (for instance when forwarding over a tunnel we might add encapsulation header). This should work in GSO (although we need to address the limitations around 1 encap level), not sure this is easy if we need to add a header to each packet in a batch. Tom > -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 05/04/16 05:32, Herbert Xu wrote: > On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: >> The question I would have is what are you really losing with increment >> from 0 versus fixed 0? From what I see it is essentially just garbage >> in/garbage out. > GRO is meant to be lossless, that is, you should not be able to > detect its presence from the outside. If you lose information then > you're breaking this rule and people will soon start asking for it > to be disabled in various situations. > > I'm not against doing this per se but it should not be part of the > default configuration. I'm certainly in favour of this being configurable - indeed IMHO it should also be possible to configure GRO with the 'looser' semantics of LRO, so that people who want that can get it without all the horrible "don't confuse Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it exists are (a) improved performance from the 'loose' semantics and (b) old kernels without GRO. We may not be able to kill (b) but we can certainly address (a)). But I don't agree that the default has to be totally lossless; anyone who is caring about the ID fields in atomic datagrams is breaking the RFCs, and can be assumed to Know What They're Doing sufficiently to configure this. On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up forwarding/routing workloads. Instead we should be looking into having lists of SKBs traverse the stack together, splitting the list whenever e.g. the destination changes. That seems like it ought to be much more efficient than rewriting headers twice, once to coalesce a superframe and once to segment it again - and it also means this worry about GRO being lossless can go away. But until someone tries implementing skb batches, we won't know for sure if it works (and I don't have time right now ;) -Ed
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote: > > The problem is right now we are mangling the IP ID for outer headers > on tunnels. We end up totally ignoring the delta between the values > so if you have two flows that get interleaved over the same tunnel GRO > will currently mash the IP IDs for the two tunnels so that they end up > overlapping. Then it should be fixed. I never reviewed those patches or I would have objected at the time. > The reason why I keep referencing RFC 6864 is because it specifies > that the IP ID field must not be read if the DF bit is set, and that > if we are manipulating headers we can handle the IP ID as though we > are the transmitting station. What this means is that if DF is not > set we have to have unique values per packet, otherwise we can ignore > the values if DF is set. As I said GRO itself should not be visible. The fact that it is for tunnels is a bug. > The question I would have is what are you really losing with increment > from 0 versus fixed 0? From what I see it is essentially just garbage > in/garbage out. GRO is meant to be lossless, that is, you should not be able to detect its presence from the outside. If you lose information then you're breaking this rule and people will soon start asking for it to be disabled in various situations. I'm not against doing this per se but it should not be part of the default configuration. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Mon, Apr 4, 2016 at 8:44 PM, Herbert Xu wrote: > On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote: >> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other >> than fragmentation and reassembly. Currently we are looking at this field >> as a way of identifying what frames can be aggregated and which cannot for >> GRO. While this is valid for frames that do not have DF set, it is invalid >> to do so if the bit is set. > > This justification is bogus. GRO is a completely local optimisation > that should have zero visibility to third parties. So it makes no > sense to talk about RFC compliance of GRO. The Linux network stack > as a whole is subject to RFC compliance, but not GRO per se. The problem is right now we are mangling the IP ID for outer headers on tunnels. We end up totally ignoring the delta between the values so if you have two flows that get interleaved over the same tunnel GRO will currently mash the IP IDs for the two tunnels so that they end up overlapping. The reason why I keep referencing RFC 6864 is because it specifies that the IP ID field must not be read if the DF bit is set, and that if we are manipulating headers we can handle the IP ID as though we are the transmitting station. What this means is that if DF is not set we have to have unique values per packet, otherwise we can ignore the values if DF is set. >> In the case of the non-incrementing IP ID we will end up losing the data >> that the IP ID is fixed. However as per RFC 6864 we should be able to >> write any value into the IP ID when the DF bit is set so this should cause >> minimal harm. > > No we should not do that, at least not by default. GRO was designed > to be completely lossless, that is its main advantage of the various > forms of LRO which preceded it. Well the problem is it isn't right now. Instead in the case of tunnels it allows you to generate overlapping sequences of IP IDs. > If you lose that people will start asking it to be disabled for > routers/bridges and we'll be back in the same old mess that we > had with LRO. The question I would have is what are you really losing with increment from 0 versus fixed 0? From what I see it is essentially just garbage in/garbage out. > So please do this properly and preserve the information in the packet. > As I said earlier all it takes is one single bit, like we do with ECN. > If you put it in the feature bit you'll also allow us to distinguish > between TSO drivers that produce fixed IDs vs. incrementing IDs. Actually it will take at least 2. One for the outer headers and one for the inner headers. I'll also have to add tracking for each to the GRO code so that we don't merge frames that go from a fixed ID to incrementing one or visa-versa since I suspect that is probably floating around out there too as my GSO partial code was going to end up doing some of that. - Alex
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote: > RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other > than fragmentation and reassembly. Currently we are looking at this field > as a way of identifying what frames can be aggregated and which cannot for > GRO. While this is valid for frames that do not have DF set, it is invalid > to do so if the bit is set. This justification is bogus. GRO is a completely local optimisation that should have zero visibility to third parties. So it makes no sense to talk about RFC compliance of GRO. The Linux network stack as a whole is subject to RFC compliance, but not GRO per se. > In the case of the non-incrementing IP ID we will end up losing the data > that the IP ID is fixed. However as per RFC 6864 we should be able to > write any value into the IP ID when the DF bit is set so this should cause > minimal harm. No we should not do that, at least not by default. GRO was designed to be completely lossless, that is its main advantage of the various forms of LRO which preceded it. If you lose that people will start asking it to be disabled for routers/bridges and we'll be back in the same old mess that we had with LRO. So please do this properly and preserve the information in the packet. As I said earlier all it takes is one single bit, like we do with ECN. If you put it in the feature bit you'll also allow us to distinguish between TSO drivers that produce fixed IDs vs. incrementing IDs. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 2016-04-04 10:31, Alexander Duyck wrote: RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other than fragmentation and reassembly. Currently we are looking at this field as a way of identifying what frames can be aggregated and which cannot for GRO. While this is valid for frames that do not have DF set, it is invalid to do so if the bit is set. In addition we were generating IPv4 ID collisions when 2 or more flows were interleaved over the same tunnel. To prevent that we store the result of all IP ID checks via a "|=" instead of overwriting previous values. With this patch we support two different approaches for the IP ID field. The first is a non-incrementing IP ID with DF bit set. In such a case we simply won't write to the flush_id field in the GRO context block. The other option is the legacy option in which the IP ID must increment by 1 for every packet we aggregate. In the case of the non-incrementing IP ID we will end up losing the data that the IP ID is fixed. However as per RFC 6864 we should be able to write any value into the IP ID when the DF bit is set so this should cause minimal harm. Signed-off-by: Alexander Duyck --- v2: Updated patch so that we now only support one of two options. Either the IP ID is fixed with DF bit set, or the IP ID is incrementing. That allows us to support the fixed ID case as occurs with IPv6 to IPv4 header translation and what is likely already out there for some devices with tunnel headers. net/core/dev.c |1 + net/ipv4/af_inet.c | 25 ++--- net/ipv6/ip6_offload.c |3 --- 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 77a71cd68535..3429632398a4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) unsigned long diffs; NAPI_GRO_CB(p)->flush = 0; + NAPI_GRO_CB(p)->flush_id = 0; if (hash != skb_get_hash_raw(p)) { NAPI_GRO_CB(p)->same_flow = 0; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9e481992dbae..33f6335448a2 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, for (p = *head; p; p = p->next) { struct iphdr *iph2; + u16 flush_id; if (!NAPI_GRO_CB(p)->same_flow) continue; @@ -1347,14 +1348,24 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, (iph->tos ^ iph2->tos) | ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); - /* Save the IP ID check to be included later when we get to -* the transport layer so only the inner most IP ID is checked. -* This is because some GSO/TSO implementations do not -* correctly increment the IP ID for the outer hdrs. -*/ - NAPI_GRO_CB(p)->flush_id = - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); NAPI_GRO_CB(p)->flush |= flush; + + /* We must save the offset as it is possible to have multiple +* flows using the same protocol and address pairs so we +* need to wait until we can validate this is part of the +* same flow with a 5-tuple or better to avoid unnecessary +* collisions between flows. We can support one of two +* possible scenarios, either a fixed value with DF bit set +* or an incrementing value with DF either set or unset. +* In the case of a fixed value we will end up losing the +* data that the IP ID was a fixed value, however per RFC +* 6864 in such a case the actual value of the IP ID is +* meant to be ignored anyway. +*/ + flush_id = (u16)(id - ntohs(iph2->id)); + if (flush_id || !(iph2->frag_off & htons(IP_DF))) + NAPI_GRO_CB(p)->flush_id |= flush_id ^ + NAPI_GRO_CB(p)->count; } NAPI_GRO_CB(skb)->flush |= flush; diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 82e9f3076028..9aa53f64dffd 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head, /* flush if Traffic Class fields are different */ NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0)); NAPI_GRO_CB(p)->flush |= flush; - - /* Clear flush_id, there's really no concept of ID in IPv6. */ - NAPI_GRO_CB(p)->flush_id = 0; }
[net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other than fragmentation and reassembly. Currently we are looking at this field as a way of identifying what frames can be aggregated and which cannot for GRO. While this is valid for frames that do not have DF set, it is invalid to do so if the bit is set. In addition we were generating IPv4 ID collisions when 2 or more flows were interleaved over the same tunnel. To prevent that we store the result of all IP ID checks via a "|=" instead of overwriting previous values. With this patch we support two different approaches for the IP ID field. The first is a non-incrementing IP ID with DF bit set. In such a case we simply won't write to the flush_id field in the GRO context block. The other option is the legacy option in which the IP ID must increment by 1 for every packet we aggregate. In the case of the non-incrementing IP ID we will end up losing the data that the IP ID is fixed. However as per RFC 6864 we should be able to write any value into the IP ID when the DF bit is set so this should cause minimal harm. Signed-off-by: Alexander Duyck --- v2: Updated patch so that we now only support one of two options. Either the IP ID is fixed with DF bit set, or the IP ID is incrementing. That allows us to support the fixed ID case as occurs with IPv6 to IPv4 header translation and what is likely already out there for some devices with tunnel headers. net/core/dev.c |1 + net/ipv4/af_inet.c | 25 ++--- net/ipv6/ip6_offload.c |3 --- 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 77a71cd68535..3429632398a4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) unsigned long diffs; NAPI_GRO_CB(p)->flush = 0; + NAPI_GRO_CB(p)->flush_id = 0; if (hash != skb_get_hash_raw(p)) { NAPI_GRO_CB(p)->same_flow = 0; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9e481992dbae..33f6335448a2 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, for (p = *head; p; p = p->next) { struct iphdr *iph2; + u16 flush_id; if (!NAPI_GRO_CB(p)->same_flow) continue; @@ -1347,14 +1348,24 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, (iph->tos ^ iph2->tos) | ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); - /* Save the IP ID check to be included later when we get to -* the transport layer so only the inner most IP ID is checked. -* This is because some GSO/TSO implementations do not -* correctly increment the IP ID for the outer hdrs. -*/ - NAPI_GRO_CB(p)->flush_id = - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); NAPI_GRO_CB(p)->flush |= flush; + + /* We must save the offset as it is possible to have multiple +* flows using the same protocol and address pairs so we +* need to wait until we can validate this is part of the +* same flow with a 5-tuple or better to avoid unnecessary +* collisions between flows. We can support one of two +* possible scenarios, either a fixed value with DF bit set +* or an incrementing value with DF either set or unset. +* In the case of a fixed value we will end up losing the +* data that the IP ID was a fixed value, however per RFC +* 6864 in such a case the actual value of the IP ID is +* meant to be ignored anyway. +*/ + flush_id = (u16)(id - ntohs(iph2->id)); + if (flush_id || !(iph2->frag_off & htons(IP_DF))) + NAPI_GRO_CB(p)->flush_id |= flush_id ^ + NAPI_GRO_CB(p)->count; } NAPI_GRO_CB(skb)->flush |= flush; diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 82e9f3076028..9aa53f64dffd 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head, /* flush if Traffic Class fields are different */ NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0)); NAPI_GRO_CB(p)->flush |= flush; - - /* Clear flush_id, there's really no concept of ID in IPv6. */ - NAPI_GRO_CB(p)->flush_id = 0; } NAPI_GRO_CB(skb)->flush |= flush;