Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-13 Thread Jesper Dangaard Brouer
On Tue, 13 Sep 2016 08:58:30 -0700
Eric Dumazet  wrote:

> We also care about icache pressure, and GRO/TSO already provides
> bundling where it is applicable, without adding insane complexity in
> the stacks.

Sorry, I cannot resist. The GRO code is really bad regarding icache
pressure/usage, due to how everything is function pointers calling
function pointers, even if the general case is calling the function
defined just next to it in the same C-file (which usually cause
inlining).  I can easily get 10% more performance for UDP use-cases by
simply disabling the GRO code, and I measure a significant drop in
icache-misses.

Edward's solution should lower icache pressure.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-13 Thread Eric Dumazet
On Tue, 2016-09-13 at 16:20 +0100, Edward Cree wrote:
> On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
> > I'm reacting so loudly, because this is a mental model switch, that
> > need to be applied to the full drivers RX path. Also for normal stack
> > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> > there is between 10%-25% perf gain here.
> >
> > [1] http://lists.openwall.net/netdev/2016/04/19/89
> > [2] http://lists.openwall.net/netdev/2016/01/15/51
> BTW, I'd also still rather like to see that happen, I never really
> understood the objections people had to those patches when I posted them.  I
> still believe that dealing in skb-lists instead of skbs, and thus
> 'automatically' bulking similar packets, is better than trying to categorise
> packets into flows early on based on some set of keys.  The problem with the
> latter approach is that there are now two definitions of "similar":
> 1) the set of fields used to index the flow
> 2) what will actually cause the stack's behaviour to differ if not using the
> cached values.
> Quite apart from the possibility of bugs if one changes but not the other,
> this forces (1) to be conservative, only considering things "similar" if the
> entire stack will.  Whereas with bundling, the stack can keep packets
> together until they reach a layer at which they are no longer "similar"
> enough.  Thus, for instance, packets with the same IP 3-tuple but different
> port numbers can be grouped together for IP layer processing, then split
> apart for L4.

To be fair you never showed us the numbers for DDOS traffic, and you did
not show us how typical TCP + netfilter modules kind of traffic would be
handled.

Show us real numbers, not synthetic ones, say when receiving traffic on
100,000 or more TCP sockets.

We also care about icache pressure, and GRO/TSO already provides
bundling where it is applicable, without adding insane complexity in the
stacks.

Just look at how complex the software fallbacks for GSO/checksumming
are, how many bugs we had to fix... And this is only at the edge of our
stack.







Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-13 Thread Edward Cree
On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
BTW, I'd also still rather like to see that happen, I never really
understood the objections people had to those patches when I posted them.  I
still believe that dealing in skb-lists instead of skbs, and thus
'automatically' bulking similar packets, is better than trying to categorise
packets into flows early on based on some set of keys.  The problem with the
latter approach is that there are now two definitions of "similar":
1) the set of fields used to index the flow
2) what will actually cause the stack's behaviour to differ if not using the
cached values.
Quite apart from the possibility of bugs if one changes but not the other,
this forces (1) to be conservative, only considering things "similar" if the
entire stack will.  Whereas with bundling, the stack can keep packets
together until they reach a layer at which they are no longer "similar"
enough.  Thus, for instance, packets with the same IP 3-tuple but different
port numbers can be grouped together for IP layer processing, then split
apart for L4.

-Ed


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Tom Herbert
On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer
 wrote:
> On Fri, 9 Sep 2016 18:03:09 +0300
> Saeed Mahameed  wrote:
>
>> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>>  wrote:
>> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >>
>> >> I'm sorry but I have a problem with this patch!
>> >> Looking at this patch, I want to bring up a fundamental architectural
>> >> concern with the development direction of XDP transmit.
>> >>
>> >>
>> >> What you are trying to implement, with delaying the doorbell, is
>> >> basically TX bulking for TX_XDP.
>> >>
>> >>  Why not implement a TX bulking interface directly instead?!?
>> >>
>> >> Yes, the tailptr/doorbell is the most costly operation, but why not
>> >> also take advantage of the benefits of bulking for other parts of the
>> >> code? (benefit is smaller, by every cycles counts in this area)
>> >>
>> >> This hole XDP exercise is about avoiding having a transaction cost per
>> >> packet, that reads "bulking" or "bundling" of packets, where possible.
>> >>
>> >>  Lets do bundling/bulking from the start!
>>
>> Jesper, what we did here is also bulking, instead of bulking in a
>> temporary list in the driver we list the packets in the HW and once
>> done we transmit all at once via the xdp_doorbell indication.
>>
>> I agree with you that we can take advantage and improve the icache by
>> bulking first in software and then queue all at once in the hw then
>> ring one doorbell.
>>
>> but I also agree with Alexei that this will introduce an extra
>> pointer/list handling in the driver and we need to do the comparison
>> between both approaches before we decide which is better.
>
> I welcome implementing both approaches and benchmarking them against
> each-other, I'll gladly dedicate time for this!
>
Yes, please implement this so we can have something clear to evaluate
and compare. There is far to much spewing of "expert opinions"
happening here :-(

> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> The key point is stop seeing the lowest driver RX as something that
> works on a per packet basis.  It might be easier to view this as a kind
> of vector processing.  This is about having a vector of packets, where
> we apply some action/operation.
>
> This is about using the CPU more efficiently, getting it to do more
> instructions per cycle.  The next level of optimization (for >= Sandy
> Bridge CPUs) is to make these vector processing stages small enough to fit
> into the CPUs decoded-I-cache section.
>
>
> It might also be important to mention, that for netstack delivery, I
> don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
> packets.  Why, because the NIC-HW runs independently and have the
> opportunity to deliver more frames in the RX ring queue, while the
> stack "slow" processed packets.  You can view this as "bulking" from
> the RX ring queue, with a "look-back" before exiting the NAPI poll loop.
>
>
>> this must be marked as future work and not have this from the start.
>
> We both know that statement is BS, and the other approach will never be
> implemented once this patch is accepted upstream.
>
>
>> > mlx4 already does bulking and this proposed mlx5 set of patches
>> > does bulking as well.
>
> I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
> way IMHO.  And now mlx5 is building on the same principle. That is why
> I'm yelling STOP.
>
>
>> >> The reason behind the xmit_more API is that we could not change the
>> >> API of all the drivers.  And we found that calling an explicit NDO
>> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> >> would hit the common single packet use-case.
>> >>
>> >> It should be really easy to build a bundle of packets that need XDP_TX
>> >> action, especially given you only have a single destination "port".
>> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer
On Fri, 9 Sep 2016 18:03:09 +0300
Saeed Mahameed  wrote:

> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>  wrote:
> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:  
> >>
> >> I'm sorry but I have a problem with this patch!  
> >> Looking at this patch, I want to bring up a fundamental architectural
> >> concern with the development direction of XDP transmit.
> >>
> >>
> >> What you are trying to implement, with delaying the doorbell, is
> >> basically TX bulking for TX_XDP.
> >>
> >>  Why not implement a TX bulking interface directly instead?!?
> >>
> >> Yes, the tailptr/doorbell is the most costly operation, but why not
> >> also take advantage of the benefits of bulking for other parts of the
> >> code? (benefit is smaller, by every cycles counts in this area)
> >>
> >> This hole XDP exercise is about avoiding having a transaction cost per
> >> packet, that reads "bulking" or "bundling" of packets, where possible.
> >>
> >>  Lets do bundling/bulking from the start!  
> 
> Jesper, what we did here is also bulking, instead of bulking in a
> temporary list in the driver we list the packets in the HW and once
> done we transmit all at once via the xdp_doorbell indication.
> 
> I agree with you that we can take advantage and improve the icache by
> bulking first in software and then queue all at once in the hw then
> ring one doorbell.
> 
> but I also agree with Alexei that this will introduce an extra
> pointer/list handling in the driver and we need to do the comparison
> between both approaches before we decide which is better.

I welcome implementing both approaches and benchmarking them against
each-other, I'll gladly dedicate time for this!

I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

The key point is stop seeing the lowest driver RX as something that
works on a per packet basis.  It might be easier to view this as a kind
of vector processing.  This is about having a vector of packets, where
we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle.  The next level of optimization (for >= Sandy
Bridge CPUs) is to make these vector processing stages small enough to fit
into the CPUs decoded-I-cache section.


It might also be important to mention, that for netstack delivery, I
don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
packets.  Why, because the NIC-HW runs independently and have the
opportunity to deliver more frames in the RX ring queue, while the
stack "slow" processed packets.  You can view this as "bulking" from
the RX ring queue, with a "look-back" before exiting the NAPI poll loop.


> this must be marked as future work and not have this from the start.

We both know that statement is BS, and the other approach will never be
implemented once this patch is accepted upstream.


> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.

I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
way IMHO.  And now mlx5 is building on the same principle. That is why
I'm yelling STOP.


> >> The reason behind the xmit_more API is that we could not change the
> >> API of all the drivers.  And we found that calling an explicit NDO
> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> >> would hit the common single packet use-case.
> >>
> >> It should be really easy to build a bundle of packets that need XDP_TX
> >> action, especially given you only have a single destination "port".
> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  


[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-09 Thread Saeed Mahameed
On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
 wrote:
> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>>
>> I'm sorry but I have a problem with this patch!
>
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.
>
>> Looking at this patch, I want to bring up a fundamental architectural
>> concern with the development direction of XDP transmit.
>>
>>
>> What you are trying to implement, with delaying the doorbell, is
>> basically TX bulking for TX_XDP.
>>
>>  Why not implement a TX bulking interface directly instead?!?
>>
>> Yes, the tailptr/doorbell is the most costly operation, but why not
>> also take advantage of the benefits of bulking for other parts of the
>> code? (benefit is smaller, by every cycles counts in this area)
>>
>> This hole XDP exercise is about avoiding having a transaction cost per
>> packet, that reads "bulking" or "bundling" of packets, where possible.
>>
>>  Lets do bundling/bulking from the start!

Jesper, what we did here is also bulking, instead of bulkin in a
temporary list in the driver
we list the packets in the HW and once done we transmit all at once via the
xdp_doorbell indication.

I agree with you that we can take advantage and improve the icahce by
bulkin first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling
in the diver and we need to do the comparison between both approaches
before we decide which is better.

this must be marked as future work and not have this from the start.

>
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.
>
>> The reason behind the xmit_more API is that we could not change the
>> API of all the drivers.  And we found that calling an explicit NDO
>> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> would hit the common single packet use-case.
>>
>> It should be really easy to build a bundle of packets that need XDP_TX
>> action, especially given you only have a single destination "port".
>> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
>
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.
>
>> In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> out other interfaces.  I also want bulk transmit from day-1 here.  It
>> is slightly more tricky to sort packets for multiple outgoing
>> interfaces efficiently in the pool loop.
>
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.
>
>> But the mSwitch[1] article actually already solved this destination
>> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> understanding the next steps, for a smarter data structure, when
>> starting to have more TX "ports".  And perhaps align your single
>> XDP_TX destination data structure to this future development.
>>
>> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
>
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
>
>> --Jesper
>> (top post)
>
> since when it's ok to top post?
>
>> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed  
>> wrote:
>>
>> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >
>> > Here we introduce a xmit more like mechanism that will queue up more
>> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >
>> > Once RX napi budget is consumed and we exit napi RX loop, we will
>> > flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > XDP forward packet rate:
>> >
>> > Comparing XDP with and w/o xmit more (bulk transmit):
>> >
>> > Streams XDP TX   XDP TX (xmit more)
>> > ---
>> > 1   4.90Mpps  7.50Mpps
>> > 2   9.50Mpps  14.8Mpps
>> > 4   16.5Mpps