Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization

2017-03-26 Thread Saeed Mahameed
On Sat, Mar 25, 2017 at 6:54 PM, Tom Herbert  wrote:
> On Fri, Mar 24, 2017 at 2:52 PM, Saeed Mahameed  wrote:
>> Hi Dave,
>>
>> This series provides some preformancee optimizations for mlx5e
>> driver, especially for XDP TX flows.
>>
>> 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine
>> which shows a huge gain for both RX and TX packet rates.
>>
>> 2nd patch removes write combining logic from the driver TX handler
>> and simplifies the TX logic while improving TX CPU utilization.
>>
>> All other patches combined provide some refactoring to the driver TX
>> flows to allow some significant XDP TX improvements.
>>
>> More details and performance numbers per patch can be found in each patch
>> commit message compared to the preceding patch.
>>
>> Overall performance improvemnets
>>   System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>
>> Test case   Baseline  Now  improvement
>> ---
>> TX packets (24 threads) 45Mpps54Mpps  20%
>> TC stack Drop (1 core)  3.45Mpps  3.6Mpps 5%
>> XDP Drop  (1 core)  14Mpps16.9Mpps20%
>> XDP TX(1 core)  10.4Mpps  13.7Mpps31%
>>
> Awesome, and good timing. I'll be presenting XDP at IETF next and
> would like to include these numbers in the presentation if you don't
> mind...
>

Not at all, please go ahead.

But as you see, the system i tested on is not that powerful. We can
get even better results with a modern system.
If you want i can provide you those numbers by mid-week.


Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization

2017-03-25 Thread Tom Herbert
On Fri, Mar 24, 2017 at 2:52 PM, Saeed Mahameed  wrote:
> Hi Dave,
>
> This series provides some preformancee optimizations for mlx5e
> driver, especially for XDP TX flows.
>
> 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine
> which shows a huge gain for both RX and TX packet rates.
>
> 2nd patch removes write combining logic from the driver TX handler
> and simplifies the TX logic while improving TX CPU utilization.
>
> All other patches combined provide some refactoring to the driver TX
> flows to allow some significant XDP TX improvements.
>
> More details and performance numbers per patch can be found in each patch
> commit message compared to the preceding patch.
>
> Overall performance improvemnets
>   System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>
> Test case   Baseline  Now  improvement
> ---
> TX packets (24 threads) 45Mpps54Mpps  20%
> TC stack Drop (1 core)  3.45Mpps  3.6Mpps 5%
> XDP Drop  (1 core)  14Mpps16.9Mpps20%
> XDP TX(1 core)  10.4Mpps  13.7Mpps31%
>
Awesome, and good timing. I'll be presenting XDP at IETF next and
would like to include these numbers in the presentation if you don't
mind...

Tom

> Thanks,
> Saeed.
>
> Saeed Mahameed (12):
>   net/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine
>   net/mlx5e: Xmit, no write combining
>   net/mlx5e: Single bfreg (UAR) for all mlx5e SQs and netdevs
>   net/mlx5e: Move XDP completion functions to rx file
>   net/mlx5e: Move mlx5e_rq struct declaration
>   net/mlx5e: Move XDP SQ instance into RQ
>   net/mlx5e: Poll XDP TX CQ before RX CQ
>   net/mlx5e: Optimize XDP frame xmit
>   net/mlx5e: Generalize tx helper functions for different SQ types
>   net/mlx5e: Proper names for SQ/RQ/CQ functions
>   net/mlx5e: Generalize SQ create/modify/destroy functions
>   net/mlx5e: Different SQ types
>
>  drivers/net/ethernet/mellanox/mlx5/core/en.h   | 319 +-
>  .../net/ethernet/mellanox/mlx5/core/en_common.c|   9 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 644 
> +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 124 +++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 147 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  70 +--
>  include/linux/mlx5/driver.h|   1 +
>  7 files changed, 716 insertions(+), 598 deletions(-)
>
> --
> 2.11.0
>


Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization

2017-03-25 Thread Saeed Mahameed
On Sat, Mar 25, 2017 at 2:26 AM, Alexei Starovoitov  wrote:
> On 3/24/17 2:52 PM, Saeed Mahameed wrote:
>>
>> Hi Dave,
>>
>> This series provides some preformancee optimizations for mlx5e
>> driver, especially for XDP TX flows.
>>
>> 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine
>> which shows a huge gain for both RX and TX packet rates.
>>
>> 2nd patch removes write combining logic from the driver TX handler
>> and simplifies the TX logic while improving TX CPU utilization.
>>
>> All other patches combined provide some refactoring to the driver TX
>> flows to allow some significant XDP TX improvements.
>>
>> More details and performance numbers per patch can be found in each patch
>> commit message compared to the preceding patch.
>>
>> Overall performance improvemnets
>>   System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>
>> Test case   Baseline  Now  improvement
>> ---
>> TX packets (24 threads) 45Mpps54Mpps  20%
>> TC stack Drop (1 core)  3.45Mpps  3.6Mpps 5%
>> XDP Drop  (1 core)  14Mpps16.9Mpps20%
>> XDP TX(1 core)  10.4Mpps  13.7Mpps31%
>
>
> Excellent work!
> All patches look great, so for the series:
> Acked-by: Alexei Starovoitov 
>

Thanks Alexei !

> in patch 12 I noticed that inline_mode is being evaluated.
> I think for xdp queues it's guaranteed to be fixed.
> Can we optimize that path little bit more as well?

Yes, you are right, we do evaluate it in  mlx5e_alloc_xdpsq
+   if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) {
+   inline_hdr_sz = MLX5E_XDP_MIN_INLINE;
+   ds_cnt++;
+   }

and check it again in mlx5e_xmit_xdp_frame

+  /* copy the inline part if required */
+  if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) {

sq->min_inline_mode is fixed in run-time, but it is different across
HW versions.
This condition is needed so we would not copy inline headers and waste
CPU cycles while it is not required from ConnectX-5 and later.
Actually this is a 5% XDP_TX optimization you get when you run over
ConnectX-5 [1].

in ConnectX-4 and 4-LX driver is still required to copy L2 headers
into TX descriptor so the HW can make the loopback decision correctly
(needed in case you want XDP program to switch packets between
different PFs/VFs running on the same box/NIC).

So i don't see anyway to do this without breaking XDP loopback
functionality or removing the connectX-5 optimization.

for my taste this condition is good as is.

[1] https://www.spinics.net/lists/netdev/msg419215.html

> Thanks!


Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization

2017-03-24 Thread David Miller
From: Saeed Mahameed 
Date: Sat, 25 Mar 2017 00:52:02 +0300

> This series provides some preformancee optimizations for mlx5e
> driver, especially for XDP TX flows.

This looks really great, series applied, thanks!


Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization

2017-03-24 Thread Alexei Starovoitov

On 3/24/17 2:52 PM, Saeed Mahameed wrote:

Hi Dave,

This series provides some preformancee optimizations for mlx5e
driver, especially for XDP TX flows.

1st patch is a simple change of rmb to dma_rmb in CQE fetch routine
which shows a huge gain for both RX and TX packet rates.

2nd patch removes write combining logic from the driver TX handler
and simplifies the TX logic while improving TX CPU utilization.

All other patches combined provide some refactoring to the driver TX
flows to allow some significant XDP TX improvements.

More details and performance numbers per patch can be found in each patch
commit message compared to the preceding patch.

Overall performance improvemnets
  System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

Test case   Baseline  Now  improvement
---
TX packets (24 threads) 45Mpps54Mpps  20%
TC stack Drop (1 core)  3.45Mpps  3.6Mpps 5%
XDP Drop  (1 core)  14Mpps16.9Mpps20%
XDP TX(1 core)  10.4Mpps  13.7Mpps31%


Excellent work!
All patches look great, so for the series:
Acked-by: Alexei Starovoitov 

in patch 12 I noticed that inline_mode is being evaluated.
I think for xdp queues it's guaranteed to be fixed.
Can we optimize that path little bit more as well?
Thanks!


[PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization

2017-03-24 Thread Saeed Mahameed
Hi Dave,

This series provides some preformancee optimizations for mlx5e
driver, especially for XDP TX flows.

1st patch is a simple change of rmb to dma_rmb in CQE fetch routine
which shows a huge gain for both RX and TX packet rates.

2nd patch removes write combining logic from the driver TX handler
and simplifies the TX logic while improving TX CPU utilization.

All other patches combined provide some refactoring to the driver TX
flows to allow some significant XDP TX improvements.

More details and performance numbers per patch can be found in each patch
commit message compared to the preceding patch.

Overall performance improvemnets
  System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

Test case   Baseline  Now  improvement
---
TX packets (24 threads) 45Mpps54Mpps  20%
TC stack Drop (1 core)  3.45Mpps  3.6Mpps 5%
XDP Drop  (1 core)  14Mpps16.9Mpps20%
XDP TX(1 core)  10.4Mpps  13.7Mpps31%

Thanks,
Saeed.

Saeed Mahameed (12):
  net/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine
  net/mlx5e: Xmit, no write combining
  net/mlx5e: Single bfreg (UAR) for all mlx5e SQs and netdevs
  net/mlx5e: Move XDP completion functions to rx file
  net/mlx5e: Move mlx5e_rq struct declaration
  net/mlx5e: Move XDP SQ instance into RQ
  net/mlx5e: Poll XDP TX CQ before RX CQ
  net/mlx5e: Optimize XDP frame xmit
  net/mlx5e: Generalize tx helper functions for different SQ types
  net/mlx5e: Proper names for SQ/RQ/CQ functions
  net/mlx5e: Generalize SQ create/modify/destroy functions
  net/mlx5e: Different SQ types

 drivers/net/ethernet/mellanox/mlx5/core/en.h   | 319 +-
 .../net/ethernet/mellanox/mlx5/core/en_common.c|   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 644 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 124 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 147 +
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  70 +--
 include/linux/mlx5/driver.h|   1 +
 7 files changed, 716 insertions(+), 598 deletions(-)

-- 
2.11.0