Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-08 Thread Or Gerlitz
On Thu, Sep 8, 2016 at 12:53 AM, Saeed Mahameed
 wrote:
> On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-dev
>  wrote:
>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
>>> From: Rana Shahout 
>>>
>>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>>>
>>> When XDP is on we make sure to change channels RQs type to
>>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>>> ensure "page per packet".
>>>
>>> On XDP set, we fail if HW LRO is set and request from user to turn it
>>> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>>> annoying, but we prefer not to enforce LRO off from XDP set function.
>>>
>>> Full channels reset (close/open) is required only when setting XDP
>>> on/off.
>>>
>>> When XDP set is called just to exchange programs, we will update
>>> each RQ xdp program on the fly and for synchronization with current
>>> data path RX activity of that RQ, we temporally disable that RQ and
>>> ensure RX path is not running, quickly update and re-enable that RQ,
>>> for that we do:
>>> - rq.state = disabled
>>> - napi_synnchronize
>>> - xchg(rq->xdp_prg)
>>> - rq.state = enabled
>>> - napi_schedule // Just in case we've missed an IRQ
>>>
>>> Packet rate performance testing was done with pktgen 64B packets and on
>>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>>
>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>>
>>> Comparison is done between:
>>> 1. Baseline, Before this patch with TC drop action
>>> 2. This patch with TC drop action
>>> 3. This patch with XDP RX fast drop
>>>
>>> StreamsBaseline(TC drop)TC dropXDP fast Drop
>>> --
>>> 1   5.51Mpps5.14Mpps 13.5Mpps
>>
>> This (13.5 M PPS) is less than 50% of the result we presented @ the
>> XDP summit which was obtained by Rana. Please see if/how much does
>> this grows if you use more sender threads, but all of them to xmit the
>> same stream/flows, so we're on one ring. That (XDP with single RX ring
>> getting packets from N remote TX rings) would be your canonical
>> base-line for any further numbers.
>>
>
> I used N TX senders sending 48Mpps to a single RX core.
> The single RX core could handle only 13.5Mpps.
>
> The implementation here is different from the one we presented at the
> summit, before, it was with striding RQ, now it is regular linked list
> RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings
> handles only 1K)

> In striding RQ we register only 16 HW descriptors for every 32K
> packets. I.e for
> every 32K packets we access the HW only 16 times.  on the other hand,
> regular RQ will access the HW (register descriptors) once per packet,
> i.e we write to HW 1K time for 1K packets. i think this explains the
> difference.

> the catch here is that we can't use striding RQ for XDP, bummer!

yep, sounds like a bum bum bum (we went from >30M PPS to 13.5M PPS).

We used striding RQ for XDP with the prev impl. and I don't see a real
deep reason not to do so also when striding RQ doesn't use compound
pages any more.  I guess there are more details I need to catch up with
here, but the bottom result is not good and we need to re-think.

> As i said, we will have the full and final performance results on V1.
> This is just a RFC with barely quick and dirty testing

Yep, understood. But in parallel, you need to reconsider how to get along
without that bumming down of numbers.

Or.


Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-07 Thread Saeed Mahameed
On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-dev
 wrote:
> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
>> From: Rana Shahout 
>>
>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>>
>> When XDP is on we make sure to change channels RQs type to
>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>> ensure "page per packet".
>>
>> On XDP set, we fail if HW LRO is set and request from user to turn it
>> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>> annoying, but we prefer not to enforce LRO off from XDP set function.
>>
>> Full channels reset (close/open) is required only when setting XDP
>> on/off.
>>
>> When XDP set is called just to exchange programs, we will update
>> each RQ xdp program on the fly and for synchronization with current
>> data path RX activity of that RQ, we temporally disable that RQ and
>> ensure RX path is not running, quickly update and re-enable that RQ,
>> for that we do:
>> - rq.state = disabled
>> - napi_synnchronize
>> - xchg(rq->xdp_prg)
>> - rq.state = enabled
>> - napi_schedule // Just in case we've missed an IRQ
>>
>> Packet rate performance testing was done with pktgen 64B packets and on
>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Comparison is done between:
>> 1. Baseline, Before this patch with TC drop action
>> 2. This patch with TC drop action
>> 3. This patch with XDP RX fast drop
>>
>> StreamsBaseline(TC drop)TC dropXDP fast Drop
>> --
>> 1   5.51Mpps5.14Mpps 13.5Mpps
>
> This (13.5 M PPS) is less than 50% of the result we presented @ the
> XDP summit which was obtained by Rana. Please see if/how much does
> this grows if you use more sender threads, but all of them to xmit the
> same stream/flows, so we're on one ring. That (XDP with single RX ring
> getting packets from N remote TX rings) would be your canonical
> base-line for any further numbers.
>

I used N TX senders sending 48Mpps to a single RX core.
The single RX core could handle only 13.5Mpps.

The implementation here is different from the one we presented at the
summit, before, it was with striding RQ, now it is regular linked list
RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings
handles only 1K).

In striding RQ we register only 16 HW descriptors for every 32K
packets. I.e for
every 32K packets we access the HW only 16 times.  on the other hand,
regular RQ will access the HW (register descriptors) once per packet,
i.e we write to HW 1K time for 1K packets. i think this explains the
difference.

the catch here is that we can't use striding RQ for XDP, bummer!

As i said, we will have the full and final performance results on V1.
This is just a RFC with barely quick and dirty testing.


> ___
> iovisor-dev mailing list
> iovisor-...@lists.iovisor.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev