Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread David Miller
From: Tom Herbert 
Date: Wed, 6 Apr 2016 14:42:26 -0300

> On Wed, Apr 6, 2016 at 12:43 PM, David Miller  wrote:
>> From: Tom Herbert 
>> Date: Wed, 6 Apr 2016 10:53:52 -0300
>>
>>> Packets that are forwarded really should not be GRO'ed in the first
>>> place because of the loss of information and added latency.
>>
>> First of all GRO is supposed to be lossless, so please stop saying this
>> would be a reason to turn it off on a router.
>>
>> Second of all, the biggest piece of overhead is the routing lookup,
>> therefore GRO batching helps enormously with routing workloads, and
>> therefore is appropriate to be enabled on routers.
>>
>> Yes, I agree that for locally terminated stuff it helps more, but don't
>> turn this into a "GRO on routers, meh..." type argument.  It simply is
>> not true at all.
>>
> GRO on routers will help in a limited case where there is little load
> and the traffic is nicely groomed high tput TCP connections. But for
> routers with significant load, handling large quantities other
> protocols like UDP, GRO is not necessarily helpful and presents a
> nondeterministic performance improvement. For instance, I cannot
> provision a router with any assumptions that GRO will be effective for
> any % of packets to save any % of CPU, we need to provision based
> purely on ability to forward by pps assuming no benefit from GRO.

Just because you cannot predict how effective a facility will be,
that doesn't mean you shouldn't use it at all.



Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Tom Herbert
On Wed, Apr 6, 2016 at 12:43 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Wed, 6 Apr 2016 10:53:52 -0300
>
>> Packets that are forwarded really should not be GRO'ed in the first
>> place because of the loss of information and added latency.
>
> First of all GRO is supposed to be lossless, so please stop saying this
> would be a reason to turn it off on a router.
>
> Second of all, the biggest piece of overhead is the routing lookup,
> therefore GRO batching helps enormously with routing workloads, and
> therefore is appropriate to be enabled on routers.
>
> Yes, I agree that for locally terminated stuff it helps more, but don't
> turn this into a "GRO on routers, meh..." type argument.  It simply is
> not true at all.
>
GRO on routers will help in a limited case where there is little load
and the traffic is nicely groomed high tput TCP connections. But for
routers with significant load, handling large quantities other
protocols like UDP, GRO is not necessarily helpful and presents a
nondeterministic performance improvement. For instance, I cannot
provision a router with any assumptions that GRO will be effective for
any % of packets to save any % of CPU, we need to provision based
purely on ability to forward by pps assuming no benefit from GRO.
Technically, this provisioning argument applies to end hosts also. We
have seen real cases on video servers where servers were
mis-provisioned with the assumption that GSO is always effective. So
when were getting good GSO benefits we might be using something like
80% CPU, but if some connectivity event occurs forcing all the cwnds
of the active connections drop, we find we need 110% of CPU to
recover.

This discussion is relevant because there is a big push now to replace
dedicated HW with commodity HW running Linux, this is already happened
significantly in load balancers but I expect this to extend to even
some cases of basic switching. Besides, I seriously doubt you'll find
any commercial router in the world that does GRO.

Yes, GRO needs to work correctly in all cases, but for system
performance in routing we a bound to per packet actions like route
lookup. As I said, optimizing GRO as Edward is suggesting for
forwarding case seems to have diminishing benefits.

Tom


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Eric Dumazet
On Wed, 2016-04-06 at 16:55 +0100, Edward Cree wrote:
> On 06/04/16 16:39, Eric Dumazet wrote:
> > Look at the mess of some helpers in net/core/skbuff.c, and imagine the
> > super mess it would be if using a concept of 'super packet with various
> > headers on each segment'.
> Maybe I'm still not explaining this very well, but there is _no_ concept of
> 'super packet [anything]' in this idea.  There is just 'list of skbs that
> were all received in the same NAPI poll, and have not yet been determined to
> be different'.
> 
> Any layer that doesn't want to deal with this stuff will always have the
> option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);"
> and in fact I'd make that happen by default for anything that hadn't
> registered a function to take a list.
> > netfilter is already complex, it would become a nightmare.
> A netfilter hook could, for instance, run on each packet in the list, then
> partition the list into sub-lists of packets that all had the same verdict
> (letting go of any that were DROP or STOLEN).  That doesn't seem like it
> should be nightmarish.

Okay, I will let you try this alone ;)





Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Edward Cree
On 06/04/16 16:39, Eric Dumazet wrote:
> Look at the mess of some helpers in net/core/skbuff.c, and imagine the
> super mess it would be if using a concept of 'super packet with various
> headers on each segment'.
Maybe I'm still not explaining this very well, but there is _no_ concept of
'super packet [anything]' in this idea.  There is just 'list of skbs that
were all received in the same NAPI poll, and have not yet been determined to
be different'.

Any layer that doesn't want to deal with this stuff will always have the
option of "while (skb = skb_dequeue(list)) my_normal_receive_function(skb);"
and in fact I'd make that happen by default for anything that hadn't
registered a function to take a list.
> netfilter is already complex, it would become a nightmare.
A netfilter hook could, for instance, run on each packet in the list, then
partition the list into sub-lists of packets that all had the same verdict
(letting go of any that were DROP or STOLEN).  That doesn't seem like it
should be nightmarish.

-Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread David Miller
From: Tom Herbert 
Date: Wed, 6 Apr 2016 10:53:52 -0300

> Packets that are forwarded really should not be GRO'ed in the first
> place because of the loss of information and added latency.

First of all GRO is supposed to be lossless, so please stop saying this
would be a reason to turn it off on a router.

Second of all, the biggest piece of overhead is the routing lookup,
therefore GRO batching helps enormously with routing workloads, and
therefore is appropriate to be enabled on routers.

Yes, I agree that for locally terminated stuff it helps more, but don't
turn this into a "GRO on routers, meh..." type argument.  It simply is
not true at all.



Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Eric Dumazet
On Wed, 2016-04-06 at 15:26 +0100, Edward Cree wrote:
> On 06/04/16 14:53, Tom Herbert wrote:
> > But again, this scheme is optimizing for forwarding case and doesn't
> > help (and probably hurts) the use case of locally terminated
> > connections
> Not really.  I think this has a chance to outperform GRO for locally
> terminated connections, for a number of reasons:
> * Doesn't look at higher-layer or inner headers until later in packet
>   processing, for instance we (maybe) process every L3 header in a NAPI poll
>   before looking at a single L4.  This could delay touching the second
>   cacheline of some packets.
> * Doesn't have to re-write headers to produce a coherent superframe.
>   Instead, each segment carries its original headers around with it.  Also
>   means we can skip _checking_ some headers to see if we're 'allowed' to
>   coalesce (e.g. TCP TS differences, and the current fun with IP IDs).
> * Can be used for protocols like UDP where the original packet boundaries
>   are important (so you can't coalesce into a superframe).
> Really the last of those was the original reason for this idea, helping with
> forwarding is just another nice bonus that we (might) get from it.
> And it's all speculative and I don't know for sure what the performance
> would be like, but I won't know until I try it!

Look at the mess of some helpers in net/core/skbuff.c, and imagine the
super mess it would be if using a concept of 'super packet with various
headers on each segment'. netfilter is already complex, it would become
a nightmare.

GRO on the other hand presents one virtual set of headers, then the
payload in one or multiple frags.






Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Edward Cree
On 06/04/16 14:53, Tom Herbert wrote:
> But again, this scheme is optimizing for forwarding case and doesn't
> help (and probably hurts) the use case of locally terminated
> connections
Not really.  I think this has a chance to outperform GRO for locally
terminated connections, for a number of reasons:
* Doesn't look at higher-layer or inner headers until later in packet
  processing, for instance we (maybe) process every L3 header in a NAPI poll
  before looking at a single L4.  This could delay touching the second
  cacheline of some packets.
* Doesn't have to re-write headers to produce a coherent superframe.
  Instead, each segment carries its original headers around with it.  Also
  means we can skip _checking_ some headers to see if we're 'allowed' to
  coalesce (e.g. TCP TS differences, and the current fun with IP IDs).
* Can be used for protocols like UDP where the original packet boundaries
  are important (so you can't coalesce into a superframe).
Really the last of those was the original reason for this idea, helping with
forwarding is just another nice bonus that we (might) get from it.
And it's all speculative and I don't know for sure what the performance
would be like, but I won't know until I try it!
> which I would claim is more important.
No argument there :-)
> Packets that are
> forwarded really should not be GRO'ed in the first place because of
> the loss of information and added latency. The difficultly is that we
> don't currently make the forwarding decision before GRO,  if we can
> change this to decide whether packets are to be forwarded earlier then
> we can avoid doing GRO for those.
That certainly would be nice, XDP is exciting and I look forward to it.

-Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Tom Herbert
On Wed, Apr 6, 2016 at 8:21 AM, Edward Cree  wrote:
> On 06/04/16 00:45, David Miller wrote:
>> From: Edward Cree 
>> Date: Tue, 5 Apr 2016 16:07:49 +0100
>>
>>> On the gripping hand, I feel like GRO+TSO is the wrong model for
>>> speeding up forwarding/routing workloads.  Instead we should be
>>> looking into having lists of SKBs traverse the stack together,
>>> splitting the list whenever e.g. the destination changes.
>> "Destination" is a very complicated beast.  It's not just a
>> destination IP address.
>>
>> It's not even just a full saddr/daddr/TOS triplet.
>>
>> Packets can be forwarded around based upon any key whatsoever in the
>> headers.  Netfilter can mangle them based upon arbitrary bits in the
>> packet, as can the packet scheduler classifier actions.
>>
>> It's therefore not profitable to try this at all, it's completely
>> pointless unless all the keys match up exactly.
> Possibly I wasn't completely clear (or maybe I was and I'm just
> wrong...), but I meant that _each layer_ in the stack would split the
> list whenever it wants to treat two packets differently.  Whether
> that's a protocol receive handler, or a netfilter or tc operation.
>
> Obviously if you want to decide at the _beginning_ whether "all the
> keys match", then you do essentially need GRO's flow-matching logic.
> But even then, I find myself wondering if having GRO coalesce the
> segments into a superpacket is really better than having it just make
> lists of segments, and have that list traverse the stack as a single
> entity.  That way lossless resegmentation remains easy.  But I suppose
> that could make life difficult for things like BPF, if they want to
> act upon the superframe (because we haven't built it).  If instead
> they act on each of the segments, we might get different actions for
> each segment and that might also be awkward; so you'd still need this
> concept of 'any layer in the stack can decide to split lists up'.
>
But again, this scheme is optimizing for forwarding case and doesn't
help (and probably hurts) the use case of locally terminated
connections which I would claim is more important. Packets that are
forwarded really should not be GRO'ed in the first place because of
the loss of information and added latency. The difficultly is that we
don't currently make the forwarding decision before GRO,  if we can
change this to decide whether packets are to be forwarded earlier then
we can avoid doing GRO for those.

Tom

> -Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-06 Thread Edward Cree
On 06/04/16 00:45, David Miller wrote:
> From: Edward Cree 
> Date: Tue, 5 Apr 2016 16:07:49 +0100
>
>> On the gripping hand, I feel like GRO+TSO is the wrong model for
>> speeding up forwarding/routing workloads.  Instead we should be
>> looking into having lists of SKBs traverse the stack together,
>> splitting the list whenever e.g. the destination changes.
> "Destination" is a very complicated beast.  It's not just a
> destination IP address.
>
> It's not even just a full saddr/daddr/TOS triplet.
>
> Packets can be forwarded around based upon any key whatsoever in the
> headers.  Netfilter can mangle them based upon arbitrary bits in the
> packet, as can the packet scheduler classifier actions.
>
> It's therefore not profitable to try this at all, it's completely
> pointless unless all the keys match up exactly.
Possibly I wasn't completely clear (or maybe I was and I'm just
wrong...), but I meant that _each layer_ in the stack would split the
list whenever it wants to treat two packets differently.  Whether
that's a protocol receive handler, or a netfilter or tc operation.

Obviously if you want to decide at the _beginning_ whether "all the
keys match", then you do essentially need GRO's flow-matching logic.
But even then, I find myself wondering if having GRO coalesce the
segments into a superpacket is really better than having it just make
lists of segments, and have that list traverse the stack as a single
entity.  That way lossless resegmentation remains easy.  But I suppose
that could make life difficult for things like BPF, if they want to
act upon the superframe (because we haven't built it).  If instead
they act on each of the segments, we might get different actions for
each segment and that might also be awkward; so you'd still need this
concept of 'any layer in the stack can decide to split lists up'.

-Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Marcelo Ricardo Leitner
On Tue, Apr 05, 2016 at 12:36:40PM -0300, Tom Herbert wrote:
> On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree  wrote:
> > On 05/04/16 05:32, Herbert Xu wrote:
> >> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
> >>> The question I would have is what are you really losing with increment
> >>> from 0 versus fixed 0?  From what I see it is essentially just garbage
> >>> in/garbage out.
> >> GRO is meant to be lossless, that is, you should not be able to
> >> detect its presence from the outside.  If you lose information then
> >> you're breaking this rule and people will soon start asking for it
> >> to be disabled in various situations.
> >>
> >> I'm not against doing this per se but it should not be part of the
> >> default configuration.
> > I'm certainly in favour of this being configurable - indeed IMHO it should
> > also be possible to configure GRO with the 'looser' semantics of LRO, so
> > that people who want that can get it without all the horrible "don't confuse
> > Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it
> > exists are (a) improved performance from the 'loose' semantics and (b) old
> > kernels without GRO.  We may not be able to kill (b) but we can certainly
> > address (a)).
> >
> > But I don't agree that the default has to be totally lossless; anyone who is
> > caring about the ID fields in atomic datagrams is breaking the RFCs, and can
> > be assumed to Know What They're Doing sufficiently to configure this.
> >
> > On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up
> > forwarding/routing workloads.  Instead we should be looking into having 
> > lists
> > of SKBs traverse the stack together, splitting the list whenever e.g. the
> > destination changes.  That seems like it ought to be much more efficient 
> > than
> > rewriting headers twice, once to coalesce a superframe and once to segment 
> > it
> > again - and it also means this worry about GRO being lossless can go away.
> > But until someone tries implementing skb batches, we won't know for sure if
> > it works (and I don't have time right now ;)
> >
> Ed,
> 
> I thought about that some. It seems like we would want to do both GRO
> and retain all the individual packets in the skb so that we could use
> those for forwarding instead of GSO as I think you're saying. This

Retaining the individual packets would also help to make GRO feasible
for SCTP.  SCTP needs to know where each packet ended because of AUTH
chunks and we cannot rely on something like gso_size as each original
packet had it's own size.

I could do it for tx side (see my SCTP/GSO RFC patches) using
skb_gro_receive() with a specially crafted header skb, but I'm not
seeing a way to do it in rx side as I cannot guarantee incoming skbs
will follow that pattern.

  Marcelo

> would would work great in the plain forwarding case, but one problem
> is what to do if the host modifies the super packet (for instance when
> forwarding over a tunnel we might add encapsulation header). This
> should work in GSO (although we need to address the limitations around
> 1 encap level), not sure this is easy if we need to add a header to
> each packet in a batch.
> 
> Tom
> 
> 
> 
> > -Ed
> 


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread David Miller
From: Edward Cree 
Date: Tue, 5 Apr 2016 16:07:49 +0100

> On the gripping hand, I feel like GRO+TSO is the wrong model for
> speeding up forwarding/routing workloads.  Instead we should be
> looking into having lists of SKBs traverse the stack together,
> splitting the list whenever e.g. the destination changes.

"Destination" is a very complicated beast.  It's not just a
destination IP address.

It's not even just a full saddr/daddr/TOS triplet.

Packets can be forwarded around based upon any key whatsoever in the
headers.  Netfilter can mangle them based upon arbitrary bits in the
packet, as can the packet scheduler classifier actions.

It's therefore not profitable to try this at all, it's completely
pointless unless all the keys match up exactly.

This is why GRO _is_ the proper model to speed this stuff and do
bulk processing, because it still presents a full "packet" to all
of these layers to mangle, rewrite, route, and do whatever else
however they like.


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Tom Herbert
On Tue, Apr 5, 2016 at 2:06 PM, Edward Cree  wrote:
> On 05/04/16 16:36, Tom Herbert wrote:
>> I thought about that some. It seems like we would want to do both GRO
>> and retain all the individual packets in the skb so that we could use
>> those for forwarding instead of GSO as I think you're saying.
> I didn't quite mean that, I meant just pass around the skb list, don't
> do GRO at all.  The receiving TCP stack ends up just getting called
> several times, in quick succession, without I$ loss from network stack
> traversal in between.
>
>> This
>> would would work great in the plain forwarding case, but one problem
>> is what to do if the host modifies the super packet (for instance when
>> forwarding over a tunnel we might add encapsulation header). This
>> should work in GSO (although we need to address the limitations around
>> 1 encap level), not sure this is easy if we need to add a header to
>> each packet in a batch.
> This is indeed a problem with what I was proposing; perhaps the answer
> is that as you process these SKB lists you also update something like a
> GRO CB, then if you do decide to transform the packets you can coalesce
> them at that point.  But doing 'the rest' of GRO might cost as much as
> just transforming all the packets, in which case you only win if you want
> to transform them multiple times.
> And if we assume that we're going to be touching all the headers anyway,
> it probably doesn't cost us much to transform all the packets in the list
> since our code and data are both hot in cache.  Well, the code is cold
> for the first packet in the list, but equally it's cold for the
> superpacket in the current approach.
>
> If this is as performant as GRO in the normal (non-forwarding) receive
> case (and that's a *big* if, which can only be resolved by trying it), it
> might make sense to just not have GRO, while TSO only gets used for
> locally-generated sends, and for the case where you're forwarding between
> networks with different MTUs (e.g. from a 64k VM netdevice to the wire).
>
> What do you think?  Am I crazy?  (More than my usual?)
>
It's not clear to me how important optimizing the GRO to GSO
forwarding case really is. We definitely need GRO for locally received
packets, but how much benefit do we really get when applying GRO to
forwarding and does that really outweigh the latency we're adding in
the forwarding path to to GRO? Also, once we implement high
performance forwarding in XDP there is really is no need to consider
GRO, we'll do some sort of I/O batching for sure but that would be
batching packets to same TX queue not same destination.

Tom

> -Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Edward Cree
On 05/04/16 16:36, Tom Herbert wrote:
> I thought about that some. It seems like we would want to do both GRO
> and retain all the individual packets in the skb so that we could use
> those for forwarding instead of GSO as I think you're saying.
I didn't quite mean that, I meant just pass around the skb list, don't
do GRO at all.  The receiving TCP stack ends up just getting called
several times, in quick succession, without I$ loss from network stack
traversal in between.

> This
> would would work great in the plain forwarding case, but one problem
> is what to do if the host modifies the super packet (for instance when
> forwarding over a tunnel we might add encapsulation header). This
> should work in GSO (although we need to address the limitations around
> 1 encap level), not sure this is easy if we need to add a header to
> each packet in a batch.
This is indeed a problem with what I was proposing; perhaps the answer
is that as you process these SKB lists you also update something like a
GRO CB, then if you do decide to transform the packets you can coalesce
them at that point.  But doing 'the rest' of GRO might cost as much as
just transforming all the packets, in which case you only win if you want
to transform them multiple times.
And if we assume that we're going to be touching all the headers anyway,
it probably doesn't cost us much to transform all the packets in the list
since our code and data are both hot in cache.  Well, the code is cold
for the first packet in the list, but equally it's cold for the
superpacket in the current approach.

If this is as performant as GRO in the normal (non-forwarding) receive
case (and that's a *big* if, which can only be resolved by trying it), it
might make sense to just not have GRO, while TSO only gets used for
locally-generated sends, and for the case where you're forwarding between
networks with different MTUs (e.g. from a 64k VM netdevice to the wire).

What do you think?  Am I crazy?  (More than my usual?)

-Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Alexander Duyck
On Tue, Apr 5, 2016 at 9:30 AM, Eric Dumazet  wrote:
> On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote:
>
>>
>> I disagree I think it will have to be part of the default
>> configuration.  The problem is the IP ID is quickly becoming
>> meaningless.  When you consider that a 40Gb/s link can wrap the IP ID
>> value nearly 50 times a second using a 1500 MTU the IP ID field should
>> just be ignored anyway because you cannot guarantee that it will be
>> unique without limiting the Tx window size.  That was the whole point
>> of RFC6864.  Basically the IP ID field is so small that as we push
>> into the higher speeds you cannot guarantee that the field will have
>> any meaning so for any case where you don't need to use it you
>> shouldn't because it will likely not provide enough useful data.
>
> Just because a few flows reach 40Gbit , we should remind that vast
> majority of the Internet runs with < 50Mbits flows.
>
> I prefer the argument of IPv6 not having ID ;)

Okay, maybe I'll try to use that argument more often then.. :-)

> We should do our best to keep interoperability, this is the selling
> point.
>
> And quite frankly your last patch makes perfect sense to me :

Yes.  It was a compromise, though I might still have to go through and
refine it more.  It might make sense for the IP header associated with
the TCP flow, but for outer headers it actually is worse because we
end up blocking several different possibilities.  What I might need to
do is capture the state of the DF bit as we work a flow up through the
stack and once it is in the list of GRO SKBs use that DF bit as a flag
to indicate if we support a incrementing or fixed pattern for the
values.  That way tunnels can optionally ignore the IP ID if the DF
bit is set since their values may not be as clean as that of TCP.

> The aggregation is done only if the TCP headers of consecutive packets
> matches. So who cares of IPv4 ID really ?
> This is a very minor detail. The possible gains outperform the
> theoretical 'problem'
>
> GRO already reorder flows, it never had a guarantee of being 'ínvisible'
> as Herbert claims.

I can see what he is trying to get at.  I just think it is a bit too
strict on the interpretation of what values have to be maintained.  My
plan going forward is to add a sysctl that will probably allow us some
wiggle room in regards to IP ID for GRO and GSO so that when it is
disabled we will not perform GSO partial nor allow for repeating IP ID
in GRO on devices that cannot get the IP ID right.

- Alex


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Eric Dumazet
On Tue, 2016-04-05 at 08:52 -0700, Alexander Duyck wrote:

> 
> I disagree I think it will have to be part of the default
> configuration.  The problem is the IP ID is quickly becoming
> meaningless.  When you consider that a 40Gb/s link can wrap the IP ID
> value nearly 50 times a second using a 1500 MTU the IP ID field should
> just be ignored anyway because you cannot guarantee that it will be
> unique without limiting the Tx window size.  That was the whole point
> of RFC6864.  Basically the IP ID field is so small that as we push
> into the higher speeds you cannot guarantee that the field will have
> any meaning so for any case where you don't need to use it you
> shouldn't because it will likely not provide enough useful data.

Just because a few flows reach 40Gbit , we should remind that vast
majority of the Internet runs with < 50Mbits flows.

I prefer the argument of IPv6 not having ID ;)

We should do our best to keep interoperability, this is the selling
point. 

And quite frankly your last patch makes perfect sense to me :

The aggregation is done only if the TCP headers of consecutive packets
matches. So who cares of IPv4 ID really ?
This is a very minor detail. The possible gains outperform the
theoretical 'problem'

GRO already reorder flows, it never had a guarantee of being 'ínvisible'
as Herbert claims.





Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Alexander Duyck
On Mon, Apr 4, 2016 at 9:32 PM, Herbert Xu  wrote:
> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
>>
>> The problem is right now we are mangling the IP ID for outer headers
>> on tunnels.  We end up totally ignoring the delta between the values
>> so if you have two flows that get interleaved over the same tunnel GRO
>> will currently mash the IP IDs for the two tunnels so that they end up
>> overlapping.
>
> Then it should be fixed.  I never reviewed those patches or I would
> have objected at the time.

The problem is what you are proposing is much larger than I am
comfortable proposing for a net patch so I will probably go back to
targeting net-next.

>> The reason why I keep referencing RFC 6864 is because it specifies
>> that the IP ID field must not be read if the DF bit is set, and that
>> if we are manipulating headers we can handle the IP ID as though we
>> are the transmitting station.  What this means is that if DF is not
>> set we have to have unique values per packet, otherwise we can ignore
>> the values if DF is set.
>
> As I said GRO itself should not be visible.  The fact that it is
> for tunnels is a bug.

Right.  But the fact that we are even trying to get reliable data out
of IP ID is also a bug.  The fact is the IP ID isn't going to be
linear in the case of tunnels.  There is a good liklihood that the IP
ID will jump around if we are doing encapsulation using a third party
device such as a switch.  That was one of the reasons why my first
implementation just completely ignored the IP ID in the case that the
DF bit is set.  Honestly I am leaning more toward taking the approach
of going back to that implementation and adding a sysctl that would
let you disable it.  Maybe something like net.ipv4.gro.rfc6864 since
that is the RFC that spells out that we should treat IP ID as a
floating input/output if DF is set.  Basically we need to be able to
do that if the GSO partial code is going to work.

>> The question I would have is what are you really losing with increment
>> from 0 versus fixed 0?  From what I see it is essentially just garbage
>> in/garbage out.
>
> GRO is meant to be lossless, that is, you should not be able to
> detect its presence from the outside.  If you lose information then
> you're breaking this rule and people will soon start asking for it
> to be disabled in various situations.

Right.  But in this case we technically aren't losing information.
That is the thing I have been trying to point out with RFC 6864.  It
calls out that you MUST not read IP ID if the DF bit is set.

> I'm not against doing this per se but it should not be part of the
> default configuration.

I disagree I think it will have to be part of the default
configuration.  The problem is the IP ID is quickly becoming
meaningless.  When you consider that a 40Gb/s link can wrap the IP ID
value nearly 50 times a second using a 1500 MTU the IP ID field should
just be ignored anyway because you cannot guarantee that it will be
unique without limiting the Tx window size.  That was the whole point
of RFC6864.  Basically the IP ID field is so small that as we push
into the higher speeds you cannot guarantee that the field will have
any meaning so for any case where you don't need to use it you
shouldn't because it will likely not provide enough useful data.

- Alex


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Tom Herbert
On Tue, Apr 5, 2016 at 12:07 PM, Edward Cree  wrote:
> On 05/04/16 05:32, Herbert Xu wrote:
>> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
>>> The question I would have is what are you really losing with increment
>>> from 0 versus fixed 0?  From what I see it is essentially just garbage
>>> in/garbage out.
>> GRO is meant to be lossless, that is, you should not be able to
>> detect its presence from the outside.  If you lose information then
>> you're breaking this rule and people will soon start asking for it
>> to be disabled in various situations.
>>
>> I'm not against doing this per se but it should not be part of the
>> default configuration.
> I'm certainly in favour of this being configurable - indeed IMHO it should
> also be possible to configure GRO with the 'looser' semantics of LRO, so
> that people who want that can get it without all the horrible "don't confuse
> Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it
> exists are (a) improved performance from the 'loose' semantics and (b) old
> kernels without GRO.  We may not be able to kill (b) but we can certainly
> address (a)).
>
> But I don't agree that the default has to be totally lossless; anyone who is
> caring about the ID fields in atomic datagrams is breaking the RFCs, and can
> be assumed to Know What They're Doing sufficiently to configure this.
>
> On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up
> forwarding/routing workloads.  Instead we should be looking into having lists
> of SKBs traverse the stack together, splitting the list whenever e.g. the
> destination changes.  That seems like it ought to be much more efficient than
> rewriting headers twice, once to coalesce a superframe and once to segment it
> again - and it also means this worry about GRO being lossless can go away.
> But until someone tries implementing skb batches, we won't know for sure if
> it works (and I don't have time right now ;)
>
Ed,

I thought about that some. It seems like we would want to do both GRO
and retain all the individual packets in the skb so that we could use
those for forwarding instead of GSO as I think you're saying. This
would would work great in the plain forwarding case, but one problem
is what to do if the host modifies the super packet (for instance when
forwarding over a tunnel we might add encapsulation header). This
should work in GSO (although we need to address the limitations around
1 encap level), not sure this is easy if we need to add a header to
each packet in a batch.

Tom



> -Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-05 Thread Edward Cree
On 05/04/16 05:32, Herbert Xu wrote:
> On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
>> The question I would have is what are you really losing with increment
>> from 0 versus fixed 0?  From what I see it is essentially just garbage
>> in/garbage out.
> GRO is meant to be lossless, that is, you should not be able to
> detect its presence from the outside.  If you lose information then
> you're breaking this rule and people will soon start asking for it
> to be disabled in various situations.
>
> I'm not against doing this per se but it should not be part of the
> default configuration.
I'm certainly in favour of this being configurable - indeed IMHO it should
also be possible to configure GRO with the 'looser' semantics of LRO, so
that people who want that can get it without all the horrible "don't confuse
Slow Start" hacks, and so that LRO can go away (AIUI the only reasons it
exists are (a) improved performance from the 'loose' semantics and (b) old
kernels without GRO.  We may not be able to kill (b) but we can certainly
address (a)).

But I don't agree that the default has to be totally lossless; anyone who is
caring about the ID fields in atomic datagrams is breaking the RFCs, and can
be assumed to Know What They're Doing sufficiently to configure this.

On the gripping hand, I feel like GRO+TSO is the wrong model for speeding up
forwarding/routing workloads.  Instead we should be looking into having lists
of SKBs traverse the stack together, splitting the list whenever e.g. the
destination changes.  That seems like it ought to be much more efficient than
rewriting headers twice, once to coalesce a superframe and once to segment it
again - and it also means this worry about GRO being lossless can go away.
But until someone tries implementing skb batches, we won't know for sure if
it works (and I don't have time right now ;)

-Ed


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-04 Thread Herbert Xu
On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
> 
> The problem is right now we are mangling the IP ID for outer headers
> on tunnels.  We end up totally ignoring the delta between the values
> so if you have two flows that get interleaved over the same tunnel GRO
> will currently mash the IP IDs for the two tunnels so that they end up
> overlapping.

Then it should be fixed.  I never reviewed those patches or I would
have objected at the time.

> The reason why I keep referencing RFC 6864 is because it specifies
> that the IP ID field must not be read if the DF bit is set, and that
> if we are manipulating headers we can handle the IP ID as though we
> are the transmitting station.  What this means is that if DF is not
> set we have to have unique values per packet, otherwise we can ignore
> the values if DF is set.

As I said GRO itself should not be visible.  The fact that it is
for tunnels is a bug.
 
> The question I would have is what are you really losing with increment
> from 0 versus fixed 0?  From what I see it is essentially just garbage
> in/garbage out.

GRO is meant to be lossless, that is, you should not be able to
detect its presence from the outside.  If you lose information then
you're breaking this rule and people will soon start asking for it
to be disabled in various situations.

I'm not against doing this per se but it should not be part of the
default configuration.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-04 Thread Alexander Duyck
On Mon, Apr 4, 2016 at 8:44 PM, Herbert Xu  wrote:
> On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote:
>> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
>> than fragmentation and reassembly.  Currently we are looking at this field
>> as a way of identifying what frames can be aggregated and  which cannot for
>> GRO.  While this is valid for frames that do not have DF set, it is invalid
>> to do so if the bit is set.
>
> This justification is bogus.  GRO is a completely local optimisation
> that should have zero visibility to third parties.  So it makes no
> sense to talk about RFC compliance of GRO.  The Linux network stack
> as a whole is subject to RFC compliance, but not GRO per se.

The problem is right now we are mangling the IP ID for outer headers
on tunnels.  We end up totally ignoring the delta between the values
so if you have two flows that get interleaved over the same tunnel GRO
will currently mash the IP IDs for the two tunnels so that they end up
overlapping.

The reason why I keep referencing RFC 6864 is because it specifies
that the IP ID field must not be read if the DF bit is set, and that
if we are manipulating headers we can handle the IP ID as though we
are the transmitting station.  What this means is that if DF is not
set we have to have unique values per packet, otherwise we can ignore
the values if DF is set.

>> In the case of the non-incrementing IP ID we will end up losing the data
>> that the IP ID is fixed.  However as per RFC 6864 we should be able to
>> write any value into the IP ID when the DF bit is set so this should cause
>> minimal harm.
>
> No we should not do that, at least not by default.  GRO was designed
> to be completely lossless, that is its main advantage of the various
> forms of LRO which preceded it.

Well the problem is it isn't right now.  Instead in the case of
tunnels it allows you to generate overlapping sequences of IP IDs.

> If you lose that people will start asking it to be disabled for
> routers/bridges and we'll be back in the same old mess that we
> had with LRO.

The question I would have is what are you really losing with increment
from 0 versus fixed 0?  From what I see it is essentially just garbage
in/garbage out.

> So please do this properly and preserve the information in the packet.
> As I said earlier all it takes is one single bit, like we do with ECN.
> If you put it in the feature bit you'll also allow us to distinguish
> between TSO drivers that produce fixed IDs vs. incrementing IDs.

Actually it will take at least 2.  One for the outer headers and one
for the inner headers.  I'll also have to add tracking for each to the
GRO code so that we don't merge frames that go from a fixed ID to
incrementing one or visa-versa since I suspect that is probably
floating around out there too as my GSO partial code was going to end
up doing some of that.

- Alex


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-04 Thread Herbert Xu
On Mon, Apr 04, 2016 at 09:31:21AM -0700, Alexander Duyck wrote:
> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
> than fragmentation and reassembly.  Currently we are looking at this field
> as a way of identifying what frames can be aggregated and  which cannot for
> GRO.  While this is valid for frames that do not have DF set, it is invalid
> to do so if the bit is set.

This justification is bogus.  GRO is a completely local optimisation
that should have zero visibility to third parties.  So it makes no
sense to talk about RFC compliance of GRO.  The Linux network stack
as a whole is subject to RFC compliance, but not GRO per se.
 
> In the case of the non-incrementing IP ID we will end up losing the data
> that the IP ID is fixed.  However as per RFC 6864 we should be able to
> write any value into the IP ID when the DF bit is set so this should cause
> minimal harm.

No we should not do that, at least not by default.  GRO was designed
to be completely lossless, that is its main advantage of the various
forms of LRO which preceded it.

If you lose that people will start asking it to be disabled for
routers/bridges and we'll be back in the same old mess that we
had with LRO.

So please do this properly and preserve the information in the packet.
As I said earlier all it takes is one single bit, like we do with ECN.
If you put it in the feature bit you'll also allow us to distinguish
between TSO drivers that produce fixed IDs vs. incrementing IDs.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-04 Thread subashab

On 2016-04-04 10:31, Alexander Duyck wrote:
RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes 
other
than fragmentation and reassembly.  Currently we are looking at this 
field
as a way of identifying what frames can be aggregated and  which cannot 
for
GRO.  While this is valid for frames that do not have DF set, it is 
invalid

to do so if the bit is set.

In addition we were generating IPv4 ID collisions when 2 or more flows 
were
interleaved over the same tunnel.  To prevent that we store the result 
of

all IP ID checks via a "|=" instead of overwriting previous values.

With this patch we support two different approaches for the IP ID 
field.
The first is a non-incrementing IP ID with DF bit set.  In such a case 
we

simply won't write to the flush_id field in the GRO context block.  The
other option is the legacy option in which the IP ID must increment by 
1

for every packet we aggregate.

In the case of the non-incrementing IP ID we will end up losing the 
data

that the IP ID is fixed.  However as per RFC 6864 we should be able to
write any value into the IP ID when the DF bit is set so this should 
cause

minimal harm.

Signed-off-by: Alexander Duyck 
---

v2: Updated patch so that we now only support one of two options.  
Either
the IP ID is fixed with DF bit set, or the IP ID is incrementing.  
That

allows us to support the fixed ID case as occurs with IPv6 to IPv4
header translation and what is likely already out there for some
devices with tunnel headers.

 net/core/dev.c |1 +
 net/ipv4/af_inet.c |   25 ++---
 net/ipv6/ip6_offload.c |3 ---
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 77a71cd68535..3429632398a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct
*napi, struct sk_buff *skb)
unsigned long diffs;

NAPI_GRO_CB(p)->flush = 0;
+   NAPI_GRO_CB(p)->flush_id = 0;

if (hash != skb_get_hash_raw(p)) {
NAPI_GRO_CB(p)->same_flow = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbae..33f6335448a2 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct
sk_buff **head,

for (p = *head; p; p = p->next) {
struct iphdr *iph2;
+   u16 flush_id;

if (!NAPI_GRO_CB(p)->same_flow)
continue;
@@ -1347,14 +1348,24 @@ static struct sk_buff
**inet_gro_receive(struct sk_buff **head,
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));

-   /* Save the IP ID check to be included later when we get to
-* the transport layer so only the inner most IP ID is checked.
-* This is because some GSO/TSO implementations do not
-* correctly increment the IP ID for the outer hdrs.
-*/
-   NAPI_GRO_CB(p)->flush_id =
-   ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ 
id);
NAPI_GRO_CB(p)->flush |= flush;
+
+   /* We must save the offset as it is possible to have multiple
+* flows using the same protocol and address pairs so we
+* need to wait until we can validate this is part of the
+* same flow with a 5-tuple or better to avoid unnecessary
+* collisions between flows.  We can support one of two
+* possible scenarios, either a fixed value with DF bit set
+* or an incrementing value with DF either set or unset.
+* In the case of a fixed value we will end up losing the
+* data that the IP ID was a fixed value, however per RFC
+* 6864 in such a case the actual value of the IP ID is
+* meant to be ignored anyway.
+*/
+   flush_id = (u16)(id - ntohs(iph2->id));
+   if (flush_id || !(iph2->frag_off & htons(IP_DF)))
+   NAPI_GRO_CB(p)->flush_id |= flush_id ^
+   NAPI_GRO_CB(p)->count;
}

NAPI_GRO_CB(skb)->flush |= flush;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 82e9f3076028..9aa53f64dffd 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct
sk_buff **head,
/* flush if Traffic Class fields are different */
NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0));
NAPI_GRO_CB(p)->flush |= flush;
-
-   /* Clear flush_id, there's really no concept of ID in IPv6. */
-   NAPI_GRO_CB(p)->flush_id = 0;
}


[net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-04 Thread Alexander Duyck
RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
than fragmentation and reassembly.  Currently we are looking at this field
as a way of identifying what frames can be aggregated and  which cannot for
GRO.  While this is valid for frames that do not have DF set, it is invalid
to do so if the bit is set.

In addition we were generating IPv4 ID collisions when 2 or more flows were
interleaved over the same tunnel.  To prevent that we store the result of
all IP ID checks via a "|=" instead of overwriting previous values.

With this patch we support two different approaches for the IP ID field.
The first is a non-incrementing IP ID with DF bit set.  In such a case we
simply won't write to the flush_id field in the GRO context block.  The
other option is the legacy option in which the IP ID must increment by 1
for every packet we aggregate.

In the case of the non-incrementing IP ID we will end up losing the data
that the IP ID is fixed.  However as per RFC 6864 we should be able to
write any value into the IP ID when the DF bit is set so this should cause
minimal harm.

Signed-off-by: Alexander Duyck 
---

v2: Updated patch so that we now only support one of two options.  Either
the IP ID is fixed with DF bit set, or the IP ID is incrementing.  That
allows us to support the fixed ID case as occurs with IPv6 to IPv4
header translation and what is likely already out there for some
devices with tunnel headers.

 net/core/dev.c |1 +
 net/ipv4/af_inet.c |   25 ++---
 net/ipv6/ip6_offload.c |3 ---
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 77a71cd68535..3429632398a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, 
struct sk_buff *skb)
unsigned long diffs;
 
NAPI_GRO_CB(p)->flush = 0;
+   NAPI_GRO_CB(p)->flush_id = 0;
 
if (hash != skb_get_hash_raw(p)) {
NAPI_GRO_CB(p)->same_flow = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbae..33f6335448a2 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff 
**head,
 
for (p = *head; p; p = p->next) {
struct iphdr *iph2;
+   u16 flush_id;
 
if (!NAPI_GRO_CB(p)->same_flow)
continue;
@@ -1347,14 +1348,24 @@ static struct sk_buff **inet_gro_receive(struct sk_buff 
**head,
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
 
-   /* Save the IP ID check to be included later when we get to
-* the transport layer so only the inner most IP ID is checked.
-* This is because some GSO/TSO implementations do not
-* correctly increment the IP ID for the outer hdrs.
-*/
-   NAPI_GRO_CB(p)->flush_id =
-   ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ 
id);
NAPI_GRO_CB(p)->flush |= flush;
+
+   /* We must save the offset as it is possible to have multiple
+* flows using the same protocol and address pairs so we
+* need to wait until we can validate this is part of the
+* same flow with a 5-tuple or better to avoid unnecessary
+* collisions between flows.  We can support one of two
+* possible scenarios, either a fixed value with DF bit set
+* or an incrementing value with DF either set or unset.
+* In the case of a fixed value we will end up losing the
+* data that the IP ID was a fixed value, however per RFC
+* 6864 in such a case the actual value of the IP ID is
+* meant to be ignored anyway.
+*/
+   flush_id = (u16)(id - ntohs(iph2->id));
+   if (flush_id || !(iph2->frag_off & htons(IP_DF)))
+   NAPI_GRO_CB(p)->flush_id |= flush_id ^
+   NAPI_GRO_CB(p)->count;
}
 
NAPI_GRO_CB(skb)->flush |= flush;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 82e9f3076028..9aa53f64dffd 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff 
**head,
/* flush if Traffic Class fields are different */
NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0));
NAPI_GRO_CB(p)->flush |= flush;
-
-   /* Clear flush_id, there's really no concept of ID in IPv6. */
-   NAPI_GRO_CB(p)->flush_id = 0;
}
 
NAPI_GRO_CB(skb)->flush |= flush;