Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization
On Sat, Mar 25, 2017 at 6:54 PM, Tom Herbertwrote: > On Fri, Mar 24, 2017 at 2:52 PM, Saeed Mahameed wrote: >> Hi Dave, >> >> This series provides some preformancee optimizations for mlx5e >> driver, especially for XDP TX flows. >> >> 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine >> which shows a huge gain for both RX and TX packet rates. >> >> 2nd patch removes write combining logic from the driver TX handler >> and simplifies the TX logic while improving TX CPU utilization. >> >> All other patches combined provide some refactoring to the driver TX >> flows to allow some significant XDP TX improvements. >> >> More details and performance numbers per patch can be found in each patch >> commit message compared to the preceding patch. >> >> Overall performance improvemnets >> System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz >> >> Test case Baseline Now improvement >> --- >> TX packets (24 threads) 45Mpps54Mpps 20% >> TC stack Drop (1 core) 3.45Mpps 3.6Mpps 5% >> XDP Drop (1 core) 14Mpps16.9Mpps20% >> XDP TX(1 core) 10.4Mpps 13.7Mpps31% >> > Awesome, and good timing. I'll be presenting XDP at IETF next and > would like to include these numbers in the presentation if you don't > mind... > Not at all, please go ahead. But as you see, the system i tested on is not that powerful. We can get even better results with a modern system. If you want i can provide you those numbers by mid-week.
Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization
On Fri, Mar 24, 2017 at 2:52 PM, Saeed Mahameedwrote: > Hi Dave, > > This series provides some preformancee optimizations for mlx5e > driver, especially for XDP TX flows. > > 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine > which shows a huge gain for both RX and TX packet rates. > > 2nd patch removes write combining logic from the driver TX handler > and simplifies the TX logic while improving TX CPU utilization. > > All other patches combined provide some refactoring to the driver TX > flows to allow some significant XDP TX improvements. > > More details and performance numbers per patch can be found in each patch > commit message compared to the preceding patch. > > Overall performance improvemnets > System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz > > Test case Baseline Now improvement > --- > TX packets (24 threads) 45Mpps54Mpps 20% > TC stack Drop (1 core) 3.45Mpps 3.6Mpps 5% > XDP Drop (1 core) 14Mpps16.9Mpps20% > XDP TX(1 core) 10.4Mpps 13.7Mpps31% > Awesome, and good timing. I'll be presenting XDP at IETF next and would like to include these numbers in the presentation if you don't mind... Tom > Thanks, > Saeed. > > Saeed Mahameed (12): > net/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine > net/mlx5e: Xmit, no write combining > net/mlx5e: Single bfreg (UAR) for all mlx5e SQs and netdevs > net/mlx5e: Move XDP completion functions to rx file > net/mlx5e: Move mlx5e_rq struct declaration > net/mlx5e: Move XDP SQ instance into RQ > net/mlx5e: Poll XDP TX CQ before RX CQ > net/mlx5e: Optimize XDP frame xmit > net/mlx5e: Generalize tx helper functions for different SQ types > net/mlx5e: Proper names for SQ/RQ/CQ functions > net/mlx5e: Generalize SQ create/modify/destroy functions > net/mlx5e: Different SQ types > > drivers/net/ethernet/mellanox/mlx5/core/en.h | 319 +- > .../net/ethernet/mellanox/mlx5/core/en_common.c| 9 + > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 644 > + > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 124 +++- > drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 147 + > drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 70 +-- > include/linux/mlx5/driver.h| 1 + > 7 files changed, 716 insertions(+), 598 deletions(-) > > -- > 2.11.0 >
Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization
On Sat, Mar 25, 2017 at 2:26 AM, Alexei Starovoitovwrote: > On 3/24/17 2:52 PM, Saeed Mahameed wrote: >> >> Hi Dave, >> >> This series provides some preformancee optimizations for mlx5e >> driver, especially for XDP TX flows. >> >> 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine >> which shows a huge gain for both RX and TX packet rates. >> >> 2nd patch removes write combining logic from the driver TX handler >> and simplifies the TX logic while improving TX CPU utilization. >> >> All other patches combined provide some refactoring to the driver TX >> flows to allow some significant XDP TX improvements. >> >> More details and performance numbers per patch can be found in each patch >> commit message compared to the preceding patch. >> >> Overall performance improvemnets >> System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz >> >> Test case Baseline Now improvement >> --- >> TX packets (24 threads) 45Mpps54Mpps 20% >> TC stack Drop (1 core) 3.45Mpps 3.6Mpps 5% >> XDP Drop (1 core) 14Mpps16.9Mpps20% >> XDP TX(1 core) 10.4Mpps 13.7Mpps31% > > > Excellent work! > All patches look great, so for the series: > Acked-by: Alexei Starovoitov > Thanks Alexei ! > in patch 12 I noticed that inline_mode is being evaluated. > I think for xdp queues it's guaranteed to be fixed. > Can we optimize that path little bit more as well? Yes, you are right, we do evaluate it in mlx5e_alloc_xdpsq + if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) { + inline_hdr_sz = MLX5E_XDP_MIN_INLINE; + ds_cnt++; + } and check it again in mlx5e_xmit_xdp_frame + /* copy the inline part if required */ + if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) { sq->min_inline_mode is fixed in run-time, but it is different across HW versions. This condition is needed so we would not copy inline headers and waste CPU cycles while it is not required from ConnectX-5 and later. Actually this is a 5% XDP_TX optimization you get when you run over ConnectX-5 [1]. in ConnectX-4 and 4-LX driver is still required to copy L2 headers into TX descriptor so the HW can make the loopback decision correctly (needed in case you want XDP program to switch packets between different PFs/VFs running on the same box/NIC). So i don't see anyway to do this without breaking XDP loopback functionality or removing the connectX-5 optimization. for my taste this condition is good as is. [1] https://www.spinics.net/lists/netdev/msg419215.html > Thanks!
Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization
From: Saeed MahameedDate: Sat, 25 Mar 2017 00:52:02 +0300 > This series provides some preformancee optimizations for mlx5e > driver, especially for XDP TX flows. This looks really great, series applied, thanks!
Re: [PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization
On 3/24/17 2:52 PM, Saeed Mahameed wrote: Hi Dave, This series provides some preformancee optimizations for mlx5e driver, especially for XDP TX flows. 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine which shows a huge gain for both RX and TX packet rates. 2nd patch removes write combining logic from the driver TX handler and simplifies the TX logic while improving TX CPU utilization. All other patches combined provide some refactoring to the driver TX flows to allow some significant XDP TX improvements. More details and performance numbers per patch can be found in each patch commit message compared to the preceding patch. Overall performance improvemnets System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Test case Baseline Now improvement --- TX packets (24 threads) 45Mpps54Mpps 20% TC stack Drop (1 core) 3.45Mpps 3.6Mpps 5% XDP Drop (1 core) 14Mpps16.9Mpps20% XDP TX(1 core) 10.4Mpps 13.7Mpps31% Excellent work! All patches look great, so for the series: Acked-by: Alexei Starovoitovin patch 12 I noticed that inline_mode is being evaluated. I think for xdp queues it's guaranteed to be fixed. Can we optimize that path little bit more as well? Thanks!
[PATCH net-next 00/12] Mellanox mlx5e XDP performance optimization
Hi Dave, This series provides some preformancee optimizations for mlx5e driver, especially for XDP TX flows. 1st patch is a simple change of rmb to dma_rmb in CQE fetch routine which shows a huge gain for both RX and TX packet rates. 2nd patch removes write combining logic from the driver TX handler and simplifies the TX logic while improving TX CPU utilization. All other patches combined provide some refactoring to the driver TX flows to allow some significant XDP TX improvements. More details and performance numbers per patch can be found in each patch commit message compared to the preceding patch. Overall performance improvemnets System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Test case Baseline Now improvement --- TX packets (24 threads) 45Mpps54Mpps 20% TC stack Drop (1 core) 3.45Mpps 3.6Mpps 5% XDP Drop (1 core) 14Mpps16.9Mpps20% XDP TX(1 core) 10.4Mpps 13.7Mpps31% Thanks, Saeed. Saeed Mahameed (12): net/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine net/mlx5e: Xmit, no write combining net/mlx5e: Single bfreg (UAR) for all mlx5e SQs and netdevs net/mlx5e: Move XDP completion functions to rx file net/mlx5e: Move mlx5e_rq struct declaration net/mlx5e: Move XDP SQ instance into RQ net/mlx5e: Poll XDP TX CQ before RX CQ net/mlx5e: Optimize XDP frame xmit net/mlx5e: Generalize tx helper functions for different SQ types net/mlx5e: Proper names for SQ/RQ/CQ functions net/mlx5e: Generalize SQ create/modify/destroy functions net/mlx5e: Different SQ types drivers/net/ethernet/mellanox/mlx5/core/en.h | 319 +- .../net/ethernet/mellanox/mlx5/core/en_common.c| 9 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 644 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 124 +++- drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 147 + drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 70 +-- include/linux/mlx5/driver.h| 1 + 7 files changed, 716 insertions(+), 598 deletions(-) -- 2.11.0