Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Tue, 13 Sep 2016 08:58:30 -0700 Eric Dumazetwrote: > We also care about icache pressure, and GRO/TSO already provides > bundling where it is applicable, without adding insane complexity in > the stacks. Sorry, I cannot resist. The GRO code is really bad regarding icache pressure/usage, due to how everything is function pointers calling function pointers, even if the general case is calling the function defined just next to it in the same C-file (which usually cause inlining). I can easily get 10% more performance for UDP use-cases by simply disabling the GRO code, and I measure a significant drop in icache-misses. Edward's solution should lower icache pressure. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Tue, 2016-09-13 at 16:20 +0100, Edward Cree wrote: > On 12/09/16 11:15, Jesper Dangaard Brouer wrote: > > I'm reacting so loudly, because this is a mental model switch, that > > need to be applied to the full drivers RX path. Also for normal stack > > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated, > > there is between 10%-25% perf gain here. > > > > [1] http://lists.openwall.net/netdev/2016/04/19/89 > > [2] http://lists.openwall.net/netdev/2016/01/15/51 > BTW, I'd also still rather like to see that happen, I never really > understood the objections people had to those patches when I posted them. I > still believe that dealing in skb-lists instead of skbs, and thus > 'automatically' bulking similar packets, is better than trying to categorise > packets into flows early on based on some set of keys. The problem with the > latter approach is that there are now two definitions of "similar": > 1) the set of fields used to index the flow > 2) what will actually cause the stack's behaviour to differ if not using the > cached values. > Quite apart from the possibility of bugs if one changes but not the other, > this forces (1) to be conservative, only considering things "similar" if the > entire stack will. Whereas with bundling, the stack can keep packets > together until they reach a layer at which they are no longer "similar" > enough. Thus, for instance, packets with the same IP 3-tuple but different > port numbers can be grouped together for IP layer processing, then split > apart for L4. To be fair you never showed us the numbers for DDOS traffic, and you did not show us how typical TCP + netfilter modules kind of traffic would be handled. Show us real numbers, not synthetic ones, say when receiving traffic on 100,000 or more TCP sockets. We also care about icache pressure, and GRO/TSO already provides bundling where it is applicable, without adding insane complexity in the stacks. Just look at how complex the software fallbacks for GSO/checksumming are, how many bugs we had to fix... And this is only at the edge of our stack.
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 12/09/16 11:15, Jesper Dangaard Brouer wrote: > I'm reacting so loudly, because this is a mental model switch, that > need to be applied to the full drivers RX path. Also for normal stack > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated, > there is between 10%-25% perf gain here. > > [1] http://lists.openwall.net/netdev/2016/04/19/89 > [2] http://lists.openwall.net/netdev/2016/01/15/51 BTW, I'd also still rather like to see that happen, I never really understood the objections people had to those patches when I posted them. I still believe that dealing in skb-lists instead of skbs, and thus 'automatically' bulking similar packets, is better than trying to categorise packets into flows early on based on some set of keys. The problem with the latter approach is that there are now two definitions of "similar": 1) the set of fields used to index the flow 2) what will actually cause the stack's behaviour to differ if not using the cached values. Quite apart from the possibility of bugs if one changes but not the other, this forces (1) to be conservative, only considering things "similar" if the entire stack will. Whereas with bundling, the stack can keep packets together until they reach a layer at which they are no longer "similar" enough. Thus, for instance, packets with the same IP 3-tuple but different port numbers can be grouped together for IP layer processing, then split apart for L4. -Ed
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouerwrote: > On Fri, 9 Sep 2016 18:03:09 +0300 > Saeed Mahameed wrote: > >> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev >> wrote: >> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: >> >> >> >> I'm sorry but I have a problem with this patch! >> >> Looking at this patch, I want to bring up a fundamental architectural >> >> concern with the development direction of XDP transmit. >> >> >> >> >> >> What you are trying to implement, with delaying the doorbell, is >> >> basically TX bulking for TX_XDP. >> >> >> >> Why not implement a TX bulking interface directly instead?!? >> >> >> >> Yes, the tailptr/doorbell is the most costly operation, but why not >> >> also take advantage of the benefits of bulking for other parts of the >> >> code? (benefit is smaller, by every cycles counts in this area) >> >> >> >> This hole XDP exercise is about avoiding having a transaction cost per >> >> packet, that reads "bulking" or "bundling" of packets, where possible. >> >> >> >> Lets do bundling/bulking from the start! >> >> Jesper, what we did here is also bulking, instead of bulking in a >> temporary list in the driver we list the packets in the HW and once >> done we transmit all at once via the xdp_doorbell indication. >> >> I agree with you that we can take advantage and improve the icache by >> bulking first in software and then queue all at once in the hw then >> ring one doorbell. >> >> but I also agree with Alexei that this will introduce an extra >> pointer/list handling in the driver and we need to do the comparison >> between both approaches before we decide which is better. > > I welcome implementing both approaches and benchmarking them against > each-other, I'll gladly dedicate time for this! > Yes, please implement this so we can have something clear to evaluate and compare. There is far to much spewing of "expert opinions" happening here :-( > I'm reacting so loudly, because this is a mental model switch, that > need to be applied to the full drivers RX path. Also for normal stack > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated, > there is between 10%-25% perf gain here. > > The key point is stop seeing the lowest driver RX as something that > works on a per packet basis. It might be easier to view this as a kind > of vector processing. This is about having a vector of packets, where > we apply some action/operation. > > This is about using the CPU more efficiently, getting it to do more > instructions per cycle. The next level of optimization (for >= Sandy > Bridge CPUs) is to make these vector processing stages small enough to fit > into the CPUs decoded-I-cache section. > > > It might also be important to mention, that for netstack delivery, I > don't imagine bulking 64 packets. Instead, I imagine doing 8-16 > packets. Why, because the NIC-HW runs independently and have the > opportunity to deliver more frames in the RX ring queue, while the > stack "slow" processed packets. You can view this as "bulking" from > the RX ring queue, with a "look-back" before exiting the NAPI poll loop. > > >> this must be marked as future work and not have this from the start. > > We both know that statement is BS, and the other approach will never be > implemented once this patch is accepted upstream. > > >> > mlx4 already does bulking and this proposed mlx5 set of patches >> > does bulking as well. > > I'm reacting exactly because mlx4 is also doing "bulking" in the wrong > way IMHO. And now mlx5 is building on the same principle. That is why > I'm yelling STOP. > > >> >> The reason behind the xmit_more API is that we could not change the >> >> API of all the drivers. And we found that calling an explicit NDO >> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that >> >> would hit the common single packet use-case. >> >> >> >> It should be really easy to build a bundle of packets that need XDP_TX >> >> action, especially given you only have a single destination "port". >> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. > > > [1] http://lists.openwall.net/netdev/2016/04/19/89 > [2] http://lists.openwall.net/netdev/2016/01/15/51 > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Fri, 9 Sep 2016 18:03:09 +0300 Saeed Mahameedwrote: > On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev > wrote: > > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: > >> > >> I'm sorry but I have a problem with this patch! > >> Looking at this patch, I want to bring up a fundamental architectural > >> concern with the development direction of XDP transmit. > >> > >> > >> What you are trying to implement, with delaying the doorbell, is > >> basically TX bulking for TX_XDP. > >> > >> Why not implement a TX bulking interface directly instead?!? > >> > >> Yes, the tailptr/doorbell is the most costly operation, but why not > >> also take advantage of the benefits of bulking for other parts of the > >> code? (benefit is smaller, by every cycles counts in this area) > >> > >> This hole XDP exercise is about avoiding having a transaction cost per > >> packet, that reads "bulking" or "bundling" of packets, where possible. > >> > >> Lets do bundling/bulking from the start! > > Jesper, what we did here is also bulking, instead of bulking in a > temporary list in the driver we list the packets in the HW and once > done we transmit all at once via the xdp_doorbell indication. > > I agree with you that we can take advantage and improve the icache by > bulking first in software and then queue all at once in the hw then > ring one doorbell. > > but I also agree with Alexei that this will introduce an extra > pointer/list handling in the driver and we need to do the comparison > between both approaches before we decide which is better. I welcome implementing both approaches and benchmarking them against each-other, I'll gladly dedicate time for this! I'm reacting so loudly, because this is a mental model switch, that need to be applied to the full drivers RX path. Also for normal stack delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated, there is between 10%-25% perf gain here. The key point is stop seeing the lowest driver RX as something that works on a per packet basis. It might be easier to view this as a kind of vector processing. This is about having a vector of packets, where we apply some action/operation. This is about using the CPU more efficiently, getting it to do more instructions per cycle. The next level of optimization (for >= Sandy Bridge CPUs) is to make these vector processing stages small enough to fit into the CPUs decoded-I-cache section. It might also be important to mention, that for netstack delivery, I don't imagine bulking 64 packets. Instead, I imagine doing 8-16 packets. Why, because the NIC-HW runs independently and have the opportunity to deliver more frames in the RX ring queue, while the stack "slow" processed packets. You can view this as "bulking" from the RX ring queue, with a "look-back" before exiting the NAPI poll loop. > this must be marked as future work and not have this from the start. We both know that statement is BS, and the other approach will never be implemented once this patch is accepted upstream. > > mlx4 already does bulking and this proposed mlx5 set of patches > > does bulking as well. I'm reacting exactly because mlx4 is also doing "bulking" in the wrong way IMHO. And now mlx5 is building on the same principle. That is why I'm yelling STOP. > >> The reason behind the xmit_more API is that we could not change the > >> API of all the drivers. And we found that calling an explicit NDO > >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that > >> would hit the common single packet use-case. > >> > >> It should be really easy to build a bundle of packets that need XDP_TX > >> action, especially given you only have a single destination "port". > >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. [1] http://lists.openwall.net/netdev/2016/04/19/89 [2] http://lists.openwall.net/netdev/2016/01/15/51 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-devwrote: > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: >> >> I'm sorry but I have a problem with this patch! > > is it because the variable is called 'xdp_doorbell'? > Frankly I see nothing scary in this patch. > It extends existing code by adding a flag to ring doorbell or not. > The end of rx napi is used as an obvious heuristic to flush the pipe. > Looks pretty generic to me. > The same code can be used for non-xdp as well once we figure out > good algorithm for xmit_more in the stack. > >> Looking at this patch, I want to bring up a fundamental architectural >> concern with the development direction of XDP transmit. >> >> >> What you are trying to implement, with delaying the doorbell, is >> basically TX bulking for TX_XDP. >> >> Why not implement a TX bulking interface directly instead?!? >> >> Yes, the tailptr/doorbell is the most costly operation, but why not >> also take advantage of the benefits of bulking for other parts of the >> code? (benefit is smaller, by every cycles counts in this area) >> >> This hole XDP exercise is about avoiding having a transaction cost per >> packet, that reads "bulking" or "bundling" of packets, where possible. >> >> Lets do bundling/bulking from the start! Jesper, what we did here is also bulking, instead of bulkin in a temporary list in the driver we list the packets in the HW and once done we transmit all at once via the xdp_doorbell indication. I agree with you that we can take advantage and improve the icahce by bulkin first in software and then queue all at once in the hw then ring one doorbell. but I also agree with Alexei that this will introduce an extra pointer/list handling in the diver and we need to do the comparison between both approaches before we decide which is better. this must be marked as future work and not have this from the start. > > mlx4 already does bulking and this proposed mlx5 set of patches > does bulking as well. > See nothing wrong about it. RX side processes the packets and > when it's done it tells TX to xmit whatever it collected. > >> The reason behind the xmit_more API is that we could not change the >> API of all the drivers. And we found that calling an explicit NDO >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that >> would hit the common single packet use-case. >> >> It should be really easy to build a bundle of packets that need XDP_TX >> action, especially given you only have a single destination "port". >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. > > not sure what are you proposing here? > Sounds like you want to extend it to multi port in the future? > Sure. The proposed code is easily extendable. > > Or you want to see something like a link list of packets > or an array of packets that RX side is preparing and then > send the whole array/list to TX port? > I don't think that would be efficient, since it would mean > unnecessary copy of pointers. > >> In the future, XDP need to support XDP_FWD forwarding of packets/pages >> out other interfaces. I also want bulk transmit from day-1 here. It >> is slightly more tricky to sort packets for multiple outgoing >> interfaces efficiently in the pool loop. > > I don't think so. Multi port is natural extension to this set of patches. > With multi port the end of RX will tell multiple ports (that were > used to tx) to ring the bell. Pretty trivial and doesn't involve any > extra arrays or link lists. > >> But the mSwitch[1] article actually already solved this destination >> sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for >> understanding the next steps, for a smarter data structure, when >> starting to have more TX "ports". And perhaps align your single >> XDP_TX destination data structure to this future development. >> >> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf > > I don't see how this particular paper applies to the existing kernel code. > It's great to take ideas from research papers, but real code is different. > >> --Jesper >> (top post) > > since when it's ok to top post? > >> On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed >> wrote: >> >> > Previously we rang XDP SQ doorbell on every forwarded XDP packet. >> > >> > Here we introduce a xmit more like mechanism that will queue up more >> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware. >> > >> > Once RX napi budget is consumed and we exit napi RX loop, we will >> > flush (doorbell) all XDP looped packets in case there are such. >> > >> > XDP forward packet rate: >> > >> > Comparing XDP with and w/o xmit more (bulk transmit): >> > >> > Streams XDP TX XDP TX (xmit more) >> > --- >> > 1 4.90Mpps 7.50Mpps >> > 2 9.50Mpps 14.8Mpps >> > 4 16.5Mpps