Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-24 Thread Alexander Lobakin
From: Ilias Apalodimas 
Date: Wed, 24 Mar 2021 09:50:38 +0200

> Hi Alexander,

Hi!

> On Tue, Mar 23, 2021 at 08:03:46PM +, Alexander Lobakin wrote:
> > From: Ilias Apalodimas 
> > Date: Tue, 23 Mar 2021 19:01:52 +0200
> >
> > > On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > > > > >
> > >
> > > [...]
> > >
> > > > > > > >
> > > > > > > > Thanks for the testing!
> > > > > > > > Any chance you can get a perf measurement on this?
> > > > > > >
> > > > > > > I guess you mean perf-report (--stdio) output, right?
> > > > > > >
> > > > > >
> > > > > > Yea,
> > > > > > As hinted below, I am just trying to figure out if on Alexander's 
> > > > > > platform the
> > > > > > cost of syncing, is bigger that free-allocate. I remember one armv7 
> > > > > > were that
> > > > > > was the case.
> > > > > >
> > > > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > > > >
> > > > > > > (+1 this is an important question)
> > > >
> > > > Sure, I'll drop perf tools to my test env and share the results,
> > > > maybe tomorrow or in a few days.
> >
> > Oh we-e-e-ell...
> > Looks like I've been fooled by I-cache misses or smth like that.
> > That happens sometimes, not only on my machines, and not only on
> > MIPS if I'm not mistaken.
> > Sorry for confusing you guys.
> >
> > I got drastically different numbers after I enabled CONFIG_KALLSYMS +
> > CONFIG_PERF_EVENTS for perf tools.
> > The only difference in code is that I rebased onto Mel's
> > mm-bulk-rebase-v6r4.
> >
> > (lunar is my WIP NIC driver)
> >
> > 1. 5.12-rc3 baseline:
> >
> > TCP: 566 Mbps
> > UDP: 615 Mbps
> >
> > perf top:
> >  4.44%  [lunar]  [k] lunar_rx_poll_page_pool
> >  3.56%  [kernel] [k] r4k_wait_irqoff
> >  2.89%  [kernel] [k] free_unref_page
> >  2.57%  [kernel] [k] dma_map_page_attrs
> >  2.32%  [kernel] [k] get_page_from_freelist
> >  2.28%  [lunar]  [k] lunar_start_xmit
> >  1.82%  [kernel] [k] __copy_user
> >  1.75%  [kernel] [k] dev_gro_receive
> >  1.52%  [kernel] [k] cpuidle_enter_state_coupled
> >  1.46%  [kernel] [k] tcp_gro_receive
> >  1.35%  [kernel] [k] __rmemcpy
> >  1.33%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
> >  1.30%  [kernel] [k] __dev_queue_xmit
> >  1.22%  [kernel] [k] pfifo_fast_dequeue
> >  1.17%  [kernel] [k] skb_release_data
> >  1.17%  [kernel] [k] skb_segment
> >
> > free_unref_page() and get_page_from_freelist() consume a lot.
> >
> > 2. 5.12-rc3 + Page Pool recycling by Matteo:
> > TCP: 589 Mbps
> > UDP: 633 Mbps
> >
> > perf top:
> >  4.27%  [lunar]  [k] lunar_rx_poll_page_pool
> >  2.68%  [lunar]  [k] lunar_start_xmit
> >  2.41%  [kernel] [k] dma_map_page_attrs
> >  1.92%  [kernel] [k] r4k_wait_irqoff
> >  1.89%  [kernel] [k] __copy_user
> >  1.62%  [kernel] [k] dev_gro_receive
> >  1.51%  [kernel] [k] cpuidle_enter_state_coupled
> >  1.44%  [kernel] [k] tcp_gro_receive
> >  1.40%  [kernel] [k] __rmemcpy
> >  1.38%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
> >  1.37%  [kernel] [k] free_unref_page
> >  1.35%  [kernel] [k] __dev_queue_xmit
> >  1.30%  [kernel] [k] skb_segment
> >  1.28%  [kernel] [k] get_page_from_freelist
> >  1.27%  [kernel] [k] r4k_dma_cache_inv
> >
> > +20 Mbps increase on both TCP and UDP. free_unref_page() and
> > get_page_from_freelist() dropped down the list significantly.
> >
> > 3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
> > TCP: 596
> > UDP: 641
> >
> > perf top:
> >  4.38%  [lunar]  [k] lunar_rx_poll_page_pool
> >  3.34%  [kernel] [k] r4k_wait_irqoff
> >  3.14%  [kernel] [k] dma_map_page_attrs
> >  2.49%  [lunar]  [k] lunar_start_xmit
> >  1.85%  [kernel] [k] dev_gro_receive
> >  1.76%  [kernel] [k] free_unref_page
> >  1.76%  [kernel] [k] __copy_user
> >  1.65%  [kernel] [k] inet_gro_receive
> >  1.57%  [kernel] [k] tcp_gro_receive
> >  1.48%  [kernel] [k] cpuidle_enter_state_coupled
> >  1.43%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
> >  1.42%  [kernel] [k] __rmemcpy
> >  1.25%  [kernel] [k] skb_segment
> >  1.21%  [kernel] [k] r4k_dma_cache_inv
> >
> > +10 Mbps on top of recycling.
> > get_page_from_freelist() is gone.
> > NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping
> > routine became the top consumers.
>
> Again, thanks for the extensive testing.
> I assume you dont use page pool to map 

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-24 Thread Ilias Apalodimas
Hi Alexander,

On Tue, Mar 23, 2021 at 08:03:46PM +, Alexander Lobakin wrote:
> From: Ilias Apalodimas 
> Date: Tue, 23 Mar 2021 19:01:52 +0200
> 
> > On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > > > >
> >
> > [...]
> >
> > > > > > >
> > > > > > > Thanks for the testing!
> > > > > > > Any chance you can get a perf measurement on this?
> > > > > >
> > > > > > I guess you mean perf-report (--stdio) output, right?
> > > > > >
> > > > >
> > > > > Yea,
> > > > > As hinted below, I am just trying to figure out if on Alexander's 
> > > > > platform the
> > > > > cost of syncing, is bigger that free-allocate. I remember one armv7 
> > > > > were that
> > > > > was the case.
> > > > >
> > > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > > >
> > > > > > (+1 this is an important question)
> > >
> > > Sure, I'll drop perf tools to my test env and share the results,
> > > maybe tomorrow or in a few days.
> 
> Oh we-e-e-ell...
> Looks like I've been fooled by I-cache misses or smth like that.
> That happens sometimes, not only on my machines, and not only on
> MIPS if I'm not mistaken.
> Sorry for confusing you guys.
> 
> I got drastically different numbers after I enabled CONFIG_KALLSYMS +
> CONFIG_PERF_EVENTS for perf tools.
> The only difference in code is that I rebased onto Mel's
> mm-bulk-rebase-v6r4.
> 
> (lunar is my WIP NIC driver)
> 
> 1. 5.12-rc3 baseline:
> 
> TCP: 566 Mbps
> UDP: 615 Mbps
> 
> perf top:
>  4.44%  [lunar]  [k] lunar_rx_poll_page_pool
>  3.56%  [kernel] [k] r4k_wait_irqoff
>  2.89%  [kernel] [k] free_unref_page
>  2.57%  [kernel] [k] dma_map_page_attrs
>  2.32%  [kernel] [k] get_page_from_freelist
>  2.28%  [lunar]  [k] lunar_start_xmit
>  1.82%  [kernel] [k] __copy_user
>  1.75%  [kernel] [k] dev_gro_receive
>  1.52%  [kernel] [k] cpuidle_enter_state_coupled
>  1.46%  [kernel] [k] tcp_gro_receive
>  1.35%  [kernel] [k] __rmemcpy
>  1.33%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
>  1.30%  [kernel] [k] __dev_queue_xmit
>  1.22%  [kernel] [k] pfifo_fast_dequeue
>  1.17%  [kernel] [k] skb_release_data
>  1.17%  [kernel] [k] skb_segment
> 
> free_unref_page() and get_page_from_freelist() consume a lot.
> 
> 2. 5.12-rc3 + Page Pool recycling by Matteo:
> TCP: 589 Mbps
> UDP: 633 Mbps
> 
> perf top:
>  4.27%  [lunar]  [k] lunar_rx_poll_page_pool
>  2.68%  [lunar]  [k] lunar_start_xmit
>  2.41%  [kernel] [k] dma_map_page_attrs
>  1.92%  [kernel] [k] r4k_wait_irqoff
>  1.89%  [kernel] [k] __copy_user
>  1.62%  [kernel] [k] dev_gro_receive
>  1.51%  [kernel] [k] cpuidle_enter_state_coupled
>  1.44%  [kernel] [k] tcp_gro_receive
>  1.40%  [kernel] [k] __rmemcpy
>  1.38%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
>  1.37%  [kernel] [k] free_unref_page
>  1.35%  [kernel] [k] __dev_queue_xmit
>  1.30%  [kernel] [k] skb_segment
>  1.28%  [kernel] [k] get_page_from_freelist
>  1.27%  [kernel] [k] r4k_dma_cache_inv
> 
> +20 Mbps increase on both TCP and UDP. free_unref_page() and
> get_page_from_freelist() dropped down the list significantly.
> 
> 3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
> TCP: 596
> UDP: 641
> 
> perf top:
>  4.38%  [lunar]  [k] lunar_rx_poll_page_pool
>  3.34%  [kernel] [k] r4k_wait_irqoff
>  3.14%  [kernel] [k] dma_map_page_attrs
>  2.49%  [lunar]  [k] lunar_start_xmit
>  1.85%  [kernel] [k] dev_gro_receive
>  1.76%  [kernel] [k] free_unref_page
>  1.76%  [kernel] [k] __copy_user
>  1.65%  [kernel] [k] inet_gro_receive
>  1.57%  [kernel] [k] tcp_gro_receive
>  1.48%  [kernel] [k] cpuidle_enter_state_coupled
>  1.43%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
>  1.42%  [kernel] [k] __rmemcpy
>  1.25%  [kernel] [k] skb_segment
>  1.21%  [kernel] [k] r4k_dma_cache_inv
> 
> +10 Mbps on top of recycling.
> get_page_from_freelist() is gone.
> NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping
> routine became the top consumers.

Again, thanks for the extensive testing. 
I assume you dont use page pool to map the buffers right?
Because if the ampping is preserved the only thing you have to do is sync it
after the packet reception

> 
> 4-5. __always_inline for rmqueue_bulk() and __rmqueue_pcplist(),
> removing 'noinline' from net/core/page_pool.c etc.
> 
> ...makes absolutely no sense anymore.
> 

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Alexander Lobakin
From: Ilias Apalodimas 
Date: Tue, 23 Mar 2021 19:01:52 +0200

> On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > > >
>
> [...]
>
> > > > > >
> > > > > > Thanks for the testing!
> > > > > > Any chance you can get a perf measurement on this?
> > > > >
> > > > > I guess you mean perf-report (--stdio) output, right?
> > > > >
> > > >
> > > > Yea,
> > > > As hinted below, I am just trying to figure out if on Alexander's 
> > > > platform the
> > > > cost of syncing, is bigger that free-allocate. I remember one armv7 
> > > > were that
> > > > was the case.
> > > >
> > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > >
> > > > > (+1 this is an important question)
> >
> > Sure, I'll drop perf tools to my test env and share the results,
> > maybe tomorrow or in a few days.

Oh we-e-e-ell...
Looks like I've been fooled by I-cache misses or smth like that.
That happens sometimes, not only on my machines, and not only on
MIPS if I'm not mistaken.
Sorry for confusing you guys.

I got drastically different numbers after I enabled CONFIG_KALLSYMS +
CONFIG_PERF_EVENTS for perf tools.
The only difference in code is that I rebased onto Mel's
mm-bulk-rebase-v6r4.

(lunar is my WIP NIC driver)

1. 5.12-rc3 baseline:

TCP: 566 Mbps
UDP: 615 Mbps

perf top:
 4.44%  [lunar]  [k] lunar_rx_poll_page_pool
 3.56%  [kernel] [k] r4k_wait_irqoff
 2.89%  [kernel] [k] free_unref_page
 2.57%  [kernel] [k] dma_map_page_attrs
 2.32%  [kernel] [k] get_page_from_freelist
 2.28%  [lunar]  [k] lunar_start_xmit
 1.82%  [kernel] [k] __copy_user
 1.75%  [kernel] [k] dev_gro_receive
 1.52%  [kernel] [k] cpuidle_enter_state_coupled
 1.46%  [kernel] [k] tcp_gro_receive
 1.35%  [kernel] [k] __rmemcpy
 1.33%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
 1.30%  [kernel] [k] __dev_queue_xmit
 1.22%  [kernel] [k] pfifo_fast_dequeue
 1.17%  [kernel] [k] skb_release_data
 1.17%  [kernel] [k] skb_segment

free_unref_page() and get_page_from_freelist() consume a lot.

2. 5.12-rc3 + Page Pool recycling by Matteo:
TCP: 589 Mbps
UDP: 633 Mbps

perf top:
 4.27%  [lunar]  [k] lunar_rx_poll_page_pool
 2.68%  [lunar]  [k] lunar_start_xmit
 2.41%  [kernel] [k] dma_map_page_attrs
 1.92%  [kernel] [k] r4k_wait_irqoff
 1.89%  [kernel] [k] __copy_user
 1.62%  [kernel] [k] dev_gro_receive
 1.51%  [kernel] [k] cpuidle_enter_state_coupled
 1.44%  [kernel] [k] tcp_gro_receive
 1.40%  [kernel] [k] __rmemcpy
 1.38%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
 1.37%  [kernel] [k] free_unref_page
 1.35%  [kernel] [k] __dev_queue_xmit
 1.30%  [kernel] [k] skb_segment
 1.28%  [kernel] [k] get_page_from_freelist
 1.27%  [kernel] [k] r4k_dma_cache_inv

+20 Mbps increase on both TCP and UDP. free_unref_page() and
get_page_from_freelist() dropped down the list significantly.

3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
TCP: 596
UDP: 641

perf top:
 4.38%  [lunar]  [k] lunar_rx_poll_page_pool
 3.34%  [kernel] [k] r4k_wait_irqoff
 3.14%  [kernel] [k] dma_map_page_attrs
 2.49%  [lunar]  [k] lunar_start_xmit
 1.85%  [kernel] [k] dev_gro_receive
 1.76%  [kernel] [k] free_unref_page
 1.76%  [kernel] [k] __copy_user
 1.65%  [kernel] [k] inet_gro_receive
 1.57%  [kernel] [k] tcp_gro_receive
 1.48%  [kernel] [k] cpuidle_enter_state_coupled
 1.43%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
 1.42%  [kernel] [k] __rmemcpy
 1.25%  [kernel] [k] skb_segment
 1.21%  [kernel] [k] r4k_dma_cache_inv

+10 Mbps on top of recycling.
get_page_from_freelist() is gone.
NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping
routine became the top consumers.

4-5. __always_inline for rmqueue_bulk() and __rmqueue_pcplist(),
removing 'noinline' from net/core/page_pool.c etc.

...makes absolutely no sense anymore.
I see Mel took Jesper's patch to make __rmqueue_pcplist() inline into
mm-bulk-rebase-v6r5, not sure if it's really needed now.

So I'm really glad we sorted out the things and I can see the real
performance improvements from both recycling and bulk allocations.

> > From what I know for sure about MIPS and my platform,
> > post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and
> > pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive.
> > I always have sane page_pool->pp.max_len value (smth about 1668
> > for 

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas
On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > >

[...]

> > > > >
> > > > > Thanks for the testing!
> > > > > Any chance you can get a perf measurement on this?
> > > >
> > > > I guess you mean perf-report (--stdio) output, right?
> > > >
> > >
> > > Yea,
> > > As hinted below, I am just trying to figure out if on Alexander's 
> > > platform the
> > > cost of syncing, is bigger that free-allocate. I remember one armv7 were 
> > > that
> > > was the case.
> > >
> > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > >
> > > > (+1 this is an important question)
> 
> Sure, I'll drop perf tools to my test env and share the results,
> maybe tomorrow or in a few days.
> From what I know for sure about MIPS and my platform,
> post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and
> pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive.
> I always have sane page_pool->pp.max_len value (smth about 1668
> for MTU of 1500) to minimize the overhead.
> 
> By the word, IIRC, all machines shipped with mvpp2 have hardware
> cache coherency units and don't suffer from sync routines at all.
> That may be the reason why mvpp2 wins the most from this series.

Yep exactly. It's also the reason why you explicitly have to opt-in using the
recycling (by marking the skb for it), instead of hiding the feature in the
page pool internals 

Cheers
/Ilias

> 
> > > > > >
> > > > > > [0] 
> > > > > > https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > > > >
> > > >
> >
> > That would be the same as for mvneta:
> >
> > Overhead  Shared Object Symbol
> >   24.10%  [kernel]  [k] __pi___inval_dcache_area
> >   23.02%  [mvneta]  [k] mvneta_rx_swbm
> >7.19%  [kernel]  [k] kmem_cache_alloc
> >
> > Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2,
> > and I get lower packet rate than recycling alone.
> > I don't know why, we should investigate it.
> 
> mvpp2 driver doesn't use napi_consume_skb() on its Tx completion path.
> As a result, NAPI percpu caches get refilled only through
> kmem_cache_alloc_bulk(), and most of skbuff_head recycling
> doesn't work.
> 
> > Regards,
> > --
> > per aspera ad upstream
> 
> Oh, I love that one!
> 
> Al
> 


Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Alexander Lobakin
From: Matteo Croce 
Date: Tue, 23 Mar 2021 17:28:32 +0100

> On Tue, Mar 23, 2021 at 5:10 PM Ilias Apalodimas
>  wrote:
> >
> > On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote:
> > > On Tue, 23 Mar 2021 17:47:46 +0200
> > > Ilias Apalodimas  wrote:
> > >
> > > > On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> > > > > From: Matteo Croce 
> > > > > Date: Mon, 22 Mar 2021 18:02:55 +0100
> > > > >
> > > > > > From: Matteo Croce 
> > > > > >
> > > > > > This series enables recycling of the buffers allocated with the 
> > > > > > page_pool API.
> > > > > > The first two patches are just prerequisite to save space in a 
> > > > > > struct and
> > > > > > avoid recycling pages allocated with other API.
> > > > > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > > > > >
> > > > > > The third one is the real recycling, 4 fixes the compilation of 
> > > > > > __skb_frag_unref
> > > > > > users, and 5,6 enable the recycling on two drivers.
> > > > > >
> > > > > > In the last two patches I reported the improvement I have with the 
> > > > > > series.
> > > > > >
> > > > > > The recycling as is can't be used with drivers like mlx5 which do 
> > > > > > page split,
> > > > > > but this is documented in a comment.
> > > > > > In the future, a refcount can be used so to support mlx5 with no 
> > > > > > changes.
> > > > > >
> > > > > > Ilias Apalodimas (2):
> > > > > >   page_pool: DMA handling and allow to recycles frames via SKB
> > > > > >   net: change users of __skb_frag_unref() and add an extra argument
> > > > > >
> > > > > > Jesper Dangaard Brouer (1):
> > > > > >   xdp: reduce size of struct xdp_mem_info
> > > > > >
> > > > > > Matteo Croce (3):
> > > > > >   mm: add a signature in struct page
> > > > > >   mvpp2: recycle buffers
> > > > > >   mvneta: recycle buffers
> > > > > >
> > > > > >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> > > > > >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> > > > > >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> > > > > >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> > > > > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> > > > > >  include/linux/mm_types.h  |  1 +
> > > > > >  include/linux/skbuff.h| 33 +++--
> > > > > >  include/net/page_pool.h   | 15 ++
> > > > > >  include/net/xdp.h |  5 +-
> > > > > >  net/core/page_pool.c  | 47 
> > > > > > +++
> > > > > >  net/core/skbuff.c | 20 +++-
> > > > > >  net/core/xdp.c| 14 --
> > > > > >  net/tls/tls_device.c  |  2 +-
> > > > > >  13 files changed, 138 insertions(+), 26 deletions(-)
> > > > >
> > > > > Just for the reference, I've performed some tests on 1G SoC NIC with
> > > > > this patchset on, here's direct link: [0]
> > > > >
> > > >
> > > > Thanks for the testing!
> > > > Any chance you can get a perf measurement on this?
> > >
> > > I guess you mean perf-report (--stdio) output, right?
> > >
> >
> > Yea,
> > As hinted below, I am just trying to figure out if on Alexander's platform 
> > the
> > cost of syncing, is bigger that free-allocate. I remember one armv7 were 
> > that
> > was the case.
> >
> > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > >
> > > (+1 this is an important question)

Sure, I'll drop perf tools to my test env and share the results,
maybe tomorrow or in a few days.
>From what I know for sure about MIPS and my platform,
post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and
pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive.
I always have sane page_pool->pp.max_len value (smth about 1668
for MTU of 1500) to minimize the overhead.

By the word, IIRC, all machines shipped with mvpp2 have hardware
cache coherency units and don't suffer from sync routines at all.
That may be the reason why mvpp2 wins the most from this series.

> > > > >
> > > > > [0] 
> > > > > https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > > >
> > >
>
> That would be the same as for mvneta:
>
> Overhead  Shared Object Symbol
>   24.10%  [kernel]  [k] __pi___inval_dcache_area
>   23.02%  [mvneta]  [k] mvneta_rx_swbm
>7.19%  [kernel]  [k] kmem_cache_alloc
>
> Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2,
> and I get lower packet rate than recycling alone.
> I don't know why, we should investigate it.

mvpp2 driver doesn't use napi_consume_skb() on its Tx completion path.
As a result, NAPI percpu caches get refilled only through
kmem_cache_alloc_bulk(), and most of skbuff_head recycling
doesn't work.

> Regards,
> --
> per aspera ad upstream

Oh, I love that one!

Al



Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Matteo Croce
On Tue, Mar 23, 2021 at 5:10 PM Ilias Apalodimas
 wrote:
>
> On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote:
> > On Tue, 23 Mar 2021 17:47:46 +0200
> > Ilias Apalodimas  wrote:
> >
> > > On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> > > > From: Matteo Croce 
> > > > Date: Mon, 22 Mar 2021 18:02:55 +0100
> > > >
> > > > > From: Matteo Croce 
> > > > >
> > > > > This series enables recycling of the buffers allocated with the 
> > > > > page_pool API.
> > > > > The first two patches are just prerequisite to save space in a struct 
> > > > > and
> > > > > avoid recycling pages allocated with other API.
> > > > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > > > >
> > > > > The third one is the real recycling, 4 fixes the compilation of 
> > > > > __skb_frag_unref
> > > > > users, and 5,6 enable the recycling on two drivers.
> > > > >
> > > > > In the last two patches I reported the improvement I have with the 
> > > > > series.
> > > > >
> > > > > The recycling as is can't be used with drivers like mlx5 which do 
> > > > > page split,
> > > > > but this is documented in a comment.
> > > > > In the future, a refcount can be used so to support mlx5 with no 
> > > > > changes.
> > > > >
> > > > > Ilias Apalodimas (2):
> > > > >   page_pool: DMA handling and allow to recycles frames via SKB
> > > > >   net: change users of __skb_frag_unref() and add an extra argument
> > > > >
> > > > > Jesper Dangaard Brouer (1):
> > > > >   xdp: reduce size of struct xdp_mem_info
> > > > >
> > > > > Matteo Croce (3):
> > > > >   mm: add a signature in struct page
> > > > >   mvpp2: recycle buffers
> > > > >   mvneta: recycle buffers
> > > > >
> > > > >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> > > > >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> > > > >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> > > > >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> > > > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> > > > >  include/linux/mm_types.h  |  1 +
> > > > >  include/linux/skbuff.h| 33 +++--
> > > > >  include/net/page_pool.h   | 15 ++
> > > > >  include/net/xdp.h |  5 +-
> > > > >  net/core/page_pool.c  | 47 
> > > > > +++
> > > > >  net/core/skbuff.c | 20 +++-
> > > > >  net/core/xdp.c| 14 --
> > > > >  net/tls/tls_device.c  |  2 +-
> > > > >  13 files changed, 138 insertions(+), 26 deletions(-)
> > > >
> > > > Just for the reference, I've performed some tests on 1G SoC NIC with
> > > > this patchset on, here's direct link: [0]
> > > >
> > >
> > > Thanks for the testing!
> > > Any chance you can get a perf measurement on this?
> >
> > I guess you mean perf-report (--stdio) output, right?
> >
>
> Yea,
> As hinted below, I am just trying to figure out if on Alexander's platform the
> cost of syncing, is bigger that free-allocate. I remember one armv7 were that
> was the case.
>
> > > Is DMA syncing taking a substantial amount of your cpu usage?
> >
> > (+1 this is an important question)
> >
> > > >
> > > > [0] 
> > > > https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > >
> >

That would be the same as for mvneta:

Overhead  Shared Object Symbol
  24.10%  [kernel]  [k] __pi___inval_dcache_area
  23.02%  [mvneta]  [k] mvneta_rx_swbm
   7.19%  [kernel]  [k] kmem_cache_alloc

Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2,
and I get lower packet rate than recycling alone.
I don't know why, we should investigate it.

Regards,
-- 
per aspera ad upstream


Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas
On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote:
> On Tue, 23 Mar 2021 17:47:46 +0200
> Ilias Apalodimas  wrote:
> 
> > On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> > > From: Matteo Croce 
> > > Date: Mon, 22 Mar 2021 18:02:55 +0100
> > >   
> > > > From: Matteo Croce 
> > > >
> > > > This series enables recycling of the buffers allocated with the 
> > > > page_pool API.
> > > > The first two patches are just prerequisite to save space in a struct 
> > > > and
> > > > avoid recycling pages allocated with other API.
> > > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > > >
> > > > The third one is the real recycling, 4 fixes the compilation of 
> > > > __skb_frag_unref
> > > > users, and 5,6 enable the recycling on two drivers.
> > > >
> > > > In the last two patches I reported the improvement I have with the 
> > > > series.
> > > >
> > > > The recycling as is can't be used with drivers like mlx5 which do page 
> > > > split,
> > > > but this is documented in a comment.
> > > > In the future, a refcount can be used so to support mlx5 with no 
> > > > changes.
> > > >
> > > > Ilias Apalodimas (2):
> > > >   page_pool: DMA handling and allow to recycles frames via SKB
> > > >   net: change users of __skb_frag_unref() and add an extra argument
> > > >
> > > > Jesper Dangaard Brouer (1):
> > > >   xdp: reduce size of struct xdp_mem_info
> > > >
> > > > Matteo Croce (3):
> > > >   mm: add a signature in struct page
> > > >   mvpp2: recycle buffers
> > > >   mvneta: recycle buffers
> > > >
> > > >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> > > >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> > > >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> > > >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> > > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> > > >  include/linux/mm_types.h  |  1 +
> > > >  include/linux/skbuff.h| 33 +++--
> > > >  include/net/page_pool.h   | 15 ++
> > > >  include/net/xdp.h |  5 +-
> > > >  net/core/page_pool.c  | 47 +++
> > > >  net/core/skbuff.c | 20 +++-
> > > >  net/core/xdp.c| 14 --
> > > >  net/tls/tls_device.c  |  2 +-
> > > >  13 files changed, 138 insertions(+), 26 deletions(-)  
> > > 
> > > Just for the reference, I've performed some tests on 1G SoC NIC with
> > > this patchset on, here's direct link: [0]
> > >   
> > 
> > Thanks for the testing!
> > Any chance you can get a perf measurement on this?
> 
> I guess you mean perf-report (--stdio) output, right?
> 

Yea, 
As hinted below, I am just trying to figure out if on Alexander's platform the
cost of syncing, is bigger that free-allocate. I remember one armv7 were that
was the case. 

> > Is DMA syncing taking a substantial amount of your cpu usage?
> 
> (+1 this is an important question)
>  
> > > 
> > > [0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 


Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Jesper Dangaard Brouer
On Tue, 23 Mar 2021 17:47:46 +0200
Ilias Apalodimas  wrote:

> On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> > From: Matteo Croce 
> > Date: Mon, 22 Mar 2021 18:02:55 +0100
> >   
> > > From: Matteo Croce 
> > >
> > > This series enables recycling of the buffers allocated with the page_pool 
> > > API.
> > > The first two patches are just prerequisite to save space in a struct and
> > > avoid recycling pages allocated with other API.
> > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > >
> > > The third one is the real recycling, 4 fixes the compilation of 
> > > __skb_frag_unref
> > > users, and 5,6 enable the recycling on two drivers.
> > >
> > > In the last two patches I reported the improvement I have with the series.
> > >
> > > The recycling as is can't be used with drivers like mlx5 which do page 
> > > split,
> > > but this is documented in a comment.
> > > In the future, a refcount can be used so to support mlx5 with no changes.
> > >
> > > Ilias Apalodimas (2):
> > >   page_pool: DMA handling and allow to recycles frames via SKB
> > >   net: change users of __skb_frag_unref() and add an extra argument
> > >
> > > Jesper Dangaard Brouer (1):
> > >   xdp: reduce size of struct xdp_mem_info
> > >
> > > Matteo Croce (3):
> > >   mm: add a signature in struct page
> > >   mvpp2: recycle buffers
> > >   mvneta: recycle buffers
> > >
> > >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> > >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> > >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> > >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> > >  include/linux/mm_types.h  |  1 +
> > >  include/linux/skbuff.h| 33 +++--
> > >  include/net/page_pool.h   | 15 ++
> > >  include/net/xdp.h |  5 +-
> > >  net/core/page_pool.c  | 47 +++
> > >  net/core/skbuff.c | 20 +++-
> > >  net/core/xdp.c| 14 --
> > >  net/tls/tls_device.c  |  2 +-
> > >  13 files changed, 138 insertions(+), 26 deletions(-)  
> > 
> > Just for the reference, I've performed some tests on 1G SoC NIC with
> > this patchset on, here's direct link: [0]
> >   
> 
> Thanks for the testing!
> Any chance you can get a perf measurement on this?

I guess you mean perf-report (--stdio) output, right?

> Is DMA syncing taking a substantial amount of your cpu usage?

(+1 this is an important question)
 
> > 
> > [0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas
On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> From: Matteo Croce 
> Date: Mon, 22 Mar 2021 18:02:55 +0100
> 
> > From: Matteo Croce 
> >
> > This series enables recycling of the buffers allocated with the page_pool 
> > API.
> > The first two patches are just prerequisite to save space in a struct and
> > avoid recycling pages allocated with other API.
> > Patch 2 was based on a previous idea from Jonathan Lemon.
> >
> > The third one is the real recycling, 4 fixes the compilation of 
> > __skb_frag_unref
> > users, and 5,6 enable the recycling on two drivers.
> >
> > In the last two patches I reported the improvement I have with the series.
> >
> > The recycling as is can't be used with drivers like mlx5 which do page 
> > split,
> > but this is documented in a comment.
> > In the future, a refcount can be used so to support mlx5 with no changes.
> >
> > Ilias Apalodimas (2):
> >   page_pool: DMA handling and allow to recycles frames via SKB
> >   net: change users of __skb_frag_unref() and add an extra argument
> >
> > Jesper Dangaard Brouer (1):
> >   xdp: reduce size of struct xdp_mem_info
> >
> > Matteo Croce (3):
> >   mm: add a signature in struct page
> >   mvpp2: recycle buffers
> >   mvneta: recycle buffers
> >
> >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> >  include/linux/mm_types.h  |  1 +
> >  include/linux/skbuff.h| 33 +++--
> >  include/net/page_pool.h   | 15 ++
> >  include/net/xdp.h |  5 +-
> >  net/core/page_pool.c  | 47 +++
> >  net/core/skbuff.c | 20 +++-
> >  net/core/xdp.c| 14 --
> >  net/tls/tls_device.c  |  2 +-
> >  13 files changed, 138 insertions(+), 26 deletions(-)
> 
> Just for the reference, I've performed some tests on 1G SoC NIC with
> this patchset on, here's direct link: [0]
> 

Thanks for the testing!
Any chance you can get a perf measurement on this?
Is DMA syncing taking a substantial amount of your cpu usage?

Thanks
/Ilias

> > --
> > 2.30.2
> 
> [0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> 
> Thanks,
> Al
> 


Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Alexander Lobakin
From: Matteo Croce 
Date: Mon, 22 Mar 2021 18:02:55 +0100

> From: Matteo Croce 
>
> This series enables recycling of the buffers allocated with the page_pool API.
> The first two patches are just prerequisite to save space in a struct and
> avoid recycling pages allocated with other API.
> Patch 2 was based on a previous idea from Jonathan Lemon.
>
> The third one is the real recycling, 4 fixes the compilation of 
> __skb_frag_unref
> users, and 5,6 enable the recycling on two drivers.
>
> In the last two patches I reported the improvement I have with the series.
>
> The recycling as is can't be used with drivers like mlx5 which do page split,
> but this is documented in a comment.
> In the future, a refcount can be used so to support mlx5 with no changes.
>
> Ilias Apalodimas (2):
>   page_pool: DMA handling and allow to recycles frames via SKB
>   net: change users of __skb_frag_unref() and add an extra argument
>
> Jesper Dangaard Brouer (1):
>   xdp: reduce size of struct xdp_mem_info
>
> Matteo Croce (3):
>   mm: add a signature in struct page
>   mvpp2: recycle buffers
>   mvneta: recycle buffers
>
>  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
>  drivers/net/ethernet/marvell/mvneta.c |  4 +-
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
>  drivers/net/ethernet/marvell/sky2.c   |  2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
>  include/linux/mm_types.h  |  1 +
>  include/linux/skbuff.h| 33 +++--
>  include/net/page_pool.h   | 15 ++
>  include/net/xdp.h |  5 +-
>  net/core/page_pool.c  | 47 +++
>  net/core/skbuff.c | 20 +++-
>  net/core/xdp.c| 14 --
>  net/tls/tls_device.c  |  2 +-
>  13 files changed, 138 insertions(+), 26 deletions(-)

Just for the reference, I've performed some tests on 1G SoC NIC with
this patchset on, here's direct link: [0]

> --
> 2.30.2

[0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me

Thanks,
Al



Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas
Hi David, 

On Tue, Mar 23, 2021 at 08:57:57AM -0600, David Ahern wrote:
> On 3/22/21 11:02 AM, Matteo Croce wrote:
> > From: Matteo Croce 
> > 
> > This series enables recycling of the buffers allocated with the page_pool 
> > API.
> > The first two patches are just prerequisite to save space in a struct and
> > avoid recycling pages allocated with other API.
> > Patch 2 was based on a previous idea from Jonathan Lemon.
> > 
> > The third one is the real recycling, 4 fixes the compilation of 
> > __skb_frag_unref
> > users, and 5,6 enable the recycling on two drivers.
> 
> patch 4 should be folded into 3; each patch should build without errors.
> 

Yes 

> > 
> > In the last two patches I reported the improvement I have with the series.
> > 
> > The recycling as is can't be used with drivers like mlx5 which do page 
> > split,
> > but this is documented in a comment.
> > In the future, a refcount can be used so to support mlx5 with no changes.
> 
> Is the end goal of the page_pool changes to remove driver private caches?
> 
> 

Yes. The patchset doesn't currently support that , because all the >10gbit
interfaces split the page and we don't account for that. We should be able to 
extend it though and account for that.  I don't have any hardware
(Intel/mlx) available, but I'll be happy to talk to anyone that does and
figure out a way to support those cards properly.


Cheers
/Ilias


Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread David Ahern
On 3/22/21 11:02 AM, Matteo Croce wrote:
> From: Matteo Croce 
> 
> This series enables recycling of the buffers allocated with the page_pool API.
> The first two patches are just prerequisite to save space in a struct and
> avoid recycling pages allocated with other API.
> Patch 2 was based on a previous idea from Jonathan Lemon.
> 
> The third one is the real recycling, 4 fixes the compilation of 
> __skb_frag_unref
> users, and 5,6 enable the recycling on two drivers.

patch 4 should be folded into 3; each patch should build without errors.

> 
> In the last two patches I reported the improvement I have with the series.
> 
> The recycling as is can't be used with drivers like mlx5 which do page split,
> but this is documented in a comment.
> In the future, a refcount can be used so to support mlx5 with no changes.

Is the end goal of the page_pool changes to remove driver private caches?




[PATCH net-next 0/6] page_pool: recycle buffers

2021-03-22 Thread Matteo Croce
From: Matteo Croce 

This series enables recycling of the buffers allocated with the page_pool API.
The first two patches are just prerequisite to save space in a struct and
avoid recycling pages allocated with other API.
Patch 2 was based on a previous idea from Jonathan Lemon.

The third one is the real recycling, 4 fixes the compilation of __skb_frag_unref
users, and 5,6 enable the recycling on two drivers.

In the last two patches I reported the improvement I have with the series.

The recycling as is can't be used with drivers like mlx5 which do page split,
but this is documented in a comment.
In the future, a refcount can be used so to support mlx5 with no changes.

Ilias Apalodimas (2):
  page_pool: DMA handling and allow to recycles frames via SKB
  net: change users of __skb_frag_unref() and add an extra argument

Jesper Dangaard Brouer (1):
  xdp: reduce size of struct xdp_mem_info

Matteo Croce (3):
  mm: add a signature in struct page
  mvpp2: recycle buffers
  mvneta: recycle buffers

 .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
 drivers/net/ethernet/marvell/mvneta.c |  4 +-
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
 drivers/net/ethernet/marvell/sky2.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
 include/linux/mm_types.h  |  1 +
 include/linux/skbuff.h| 33 +++--
 include/net/page_pool.h   | 15 ++
 include/net/xdp.h |  5 +-
 net/core/page_pool.c  | 47 +++
 net/core/skbuff.c | 20 +++-
 net/core/xdp.c| 14 --
 net/tls/tls_device.c  |  2 +-
 13 files changed, 138 insertions(+), 26 deletions(-)

-- 
2.30.2