Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-13 Thread Saeed Mahameed
On Sun, Mar 12, 2017 at 6:49 PM, Eric Dumazet  wrote:
> On Sun, 2017-03-12 at 17:49 +0200, Saeed Mahameed wrote:
>> On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet  wrote:
>> > On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
>> >
>> >> Problem is XDP TX :
>> >>
>> >> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
>> >> RX is owned by current cpu.
>> >>
>> >> Since TX completion is using a different NAPI, I really do not believe
>> >> we can avoid an atomic operation, like a spinlock, to protect the list
>> >> of pages ( ring->page_cache )
>> >
>> > A quick fix for net-next would be :
>> >
>>
>> Hi Eric, Good catch.
>>
>> I don't think we need to complicate with an expensive spinlock,
>>  we can simply fix this by not enabling interrupts on XDP TX CQ (not
>> arm this CQ at all).
>> and handle XDP TX CQ completion from the RX NAPI context, in a serial
>> (Atomic) manner before handling RX completions themselves.
>> This way locking is not required since all page cache handling is done
>> from the same context (RX NAPI).
>>
>> This is how we do this in mlx5, and this is the best approach
>> (performance wise) since we dealy XDP TX CQ completions handling
>> until we really need the space they hold (On new RX packets).
>
> SGTM, can you provide the patch for mlx4 ?
>

of course, We will send it soon.

> Thanks !
>
>


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Eric Dumazet
On Sun, 2017-03-12 at 17:49 +0200, Saeed Mahameed wrote:
> On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet  wrote:
> > On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
> >
> >> Problem is XDP TX :
> >>
> >> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
> >> RX is owned by current cpu.
> >>
> >> Since TX completion is using a different NAPI, I really do not believe
> >> we can avoid an atomic operation, like a spinlock, to protect the list
> >> of pages ( ring->page_cache )
> >
> > A quick fix for net-next would be :
> >
> 
> Hi Eric, Good catch.
> 
> I don't think we need to complicate with an expensive spinlock,
>  we can simply fix this by not enabling interrupts on XDP TX CQ (not
> arm this CQ at all).
> and handle XDP TX CQ completion from the RX NAPI context, in a serial
> (Atomic) manner before handling RX completions themselves.
> This way locking is not required since all page cache handling is done
> from the same context (RX NAPI).
> 
> This is how we do this in mlx5, and this is the best approach
> (performance wise) since we dealy XDP TX CQ completions handling
> until we really need the space they hold (On new RX packets).

SGTM, can you provide the patch for mlx4 ?

Thanks !




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Saeed Mahameed
On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet  wrote:
> On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
>
>> Problem is XDP TX :
>>
>> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
>> RX is owned by current cpu.
>>
>> Since TX completion is using a different NAPI, I really do not believe
>> we can avoid an atomic operation, like a spinlock, to protect the list
>> of pages ( ring->page_cache )
>
> A quick fix for net-next would be :
>

Hi Eric, Good catch.

I don't think we need to complicate with an expensive spinlock,
 we can simply fix this by not enabling interrupts on XDP TX CQ (not
arm this CQ at all).
and handle XDP TX CQ completion from the RX NAPI context, in a serial
(Atomic) manner before handling RX completions themselves.
This way locking is not required since all page cache handling is done
from the same context (RX NAPI).

This is how we do this in mlx5, and this is the best approach
(performance wise) since we dealy XDP TX CQ completions handling
until we really need the space they hold (On new RX packets).

> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..e0b2ea8cefd6beef093c41bade199e3ec4f0291c
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -137,13 +137,17 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
> *priv,
> struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
> struct mlx4_en_rx_alloc *frags = ring->rx_info +
> (index << priv->log_rx_info);
> +
> if (ring->page_cache.index > 0) {
> +   spin_lock(>page_cache.lock);
> +
> /* XDP uses a single page per frame */
> if (!frags->page) {
> ring->page_cache.index--;
> frags->page = 
> ring->page_cache.buf[ring->page_cache.index].page;
> frags->dma  = 
> ring->page_cache.buf[ring->page_cache.index].dma;
> }
> +   spin_unlock(>page_cache.lock);
> frags->page_offset = XDP_PACKET_HEADROOM;
> rx_desc->data[0].addr = cpu_to_be64(frags->dma +
> XDP_PACKET_HEADROOM);
> @@ -277,6 +281,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
> }
> }
>
> +   spin_lock_init(>page_cache.lock);
> ring->prod = 0;
> ring->cons = 0;
> ring->size = size;
> @@ -419,10 +424,13 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>
> if (cache->index >= MLX4_EN_CACHE_SIZE)
> return false;
> -
> -   cache->buf[cache->index].page = frame->page;
> -   cache->buf[cache->index].dma = frame->dma;
> -   cache->index++;
> +   spin_lock(>lock);
> +   if (cache->index < MLX4_EN_CACHE_SIZE) {
> +   cache->buf[cache->index].page = frame->page;
> +   cache->buf[cache->index].dma = frame->dma;
> +   cache->index++;
> +   }
> +   spin_unlock(>lock);
> return true;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
> b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 
> 39f401aa30474e61c0b0029463b23a829ec35fa3..090a08020d13d8e11cc163ac9fc6ac6affccc463
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -258,7 +258,8 @@ struct mlx4_en_rx_alloc {
>  #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
>
>  struct mlx4_en_page_cache {
> -   u32 index;
> +   u32 index;
> +   spinlock_t  lock;
> struct {
> struct page *page;
> dma_addr_t  dma;
>
>


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Eric Dumazet
On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:

> Problem is XDP TX :
> 
> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
> RX is owned by current cpu.
> 
> Since TX completion is using a different NAPI, I really do not believe
> we can avoid an atomic operation, like a spinlock, to protect the list
> of pages ( ring->page_cache )

A quick fix for net-next would be :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..e0b2ea8cefd6beef093c41bade199e3ec4f0291c
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -137,13 +137,17 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
*priv,
struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
struct mlx4_en_rx_alloc *frags = ring->rx_info +
(index << priv->log_rx_info);
+
if (ring->page_cache.index > 0) {
+   spin_lock(>page_cache.lock);
+
/* XDP uses a single page per frame */
if (!frags->page) {
ring->page_cache.index--;
frags->page = 
ring->page_cache.buf[ring->page_cache.index].page;
frags->dma  = 
ring->page_cache.buf[ring->page_cache.index].dma;
}
+   spin_unlock(>page_cache.lock);
frags->page_offset = XDP_PACKET_HEADROOM;
rx_desc->data[0].addr = cpu_to_be64(frags->dma +
XDP_PACKET_HEADROOM);
@@ -277,6 +281,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
}
}
 
+   spin_lock_init(>page_cache.lock);
ring->prod = 0;
ring->cons = 0;
ring->size = size;
@@ -419,10 +424,13 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
 
if (cache->index >= MLX4_EN_CACHE_SIZE)
return false;
-
-   cache->buf[cache->index].page = frame->page;
-   cache->buf[cache->index].dma = frame->dma;
-   cache->index++;
+   spin_lock(>lock);
+   if (cache->index < MLX4_EN_CACHE_SIZE) {
+   cache->buf[cache->index].page = frame->page;
+   cache->buf[cache->index].dma = frame->dma;
+   cache->index++;
+   }
+   spin_unlock(>lock);
return true;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 
39f401aa30474e61c0b0029463b23a829ec35fa3..090a08020d13d8e11cc163ac9fc6ac6affccc463
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -258,7 +258,8 @@ struct mlx4_en_rx_alloc {
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
 
 struct mlx4_en_page_cache {
-   u32 index;
+   u32 index;
+   spinlock_t  lock;
struct {
struct page *page;
dma_addr_t  dma;




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-03-12 Thread Eric Dumazet
On Wed, 2017-02-22 at 18:06 -0800, Eric Dumazet wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> > 
> 
> You must have misunderstood me.
> 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.
> 
> 
> > The whole idea behind using only half the page per descriptor is to
> > allow us to loop through the ring before we end up reusing it again.
> > That buys us enough time that usually the stack has consumed the frame
> > before we need it again.
> 
> 
> The same will happen really.
> 
> Best maybe is for me to send the patch ;)

Excellent results so far, performance on PowerPC is back, and x86 gets a
gain as well.

Problem is XDP TX :

I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
RX is owned by current cpu.

Since TX completion is using a different NAPI, I really do not believe
we can avoid an atomic operation, like a spinlock, to protect the list
of pages ( ring->page_cache )






Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-24 Thread Jesper Dangaard Brouer
On Wed, 22 Feb 2017 18:06:58 -0800
Eric Dumazet  wrote:

> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> >   
> 
> You must have misunderstood me.

FYI, I also misunderstood you (Eric) to start with ;-)
 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.

I do like it, and it should work.  I like it because it solves my
concern, regarding being able to adjust the amount of
outstanding-frames independently of the RX ring size.


Do notice: driver developers have to use Alex'es new DMA API in-order
to get writable-pages, else this will violate the DMA API.  And XDP
requires writable pages.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-23 Thread Tariq Toukan



On 23/02/2017 4:18 AM, Alexander Duyck wrote:

On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet  wrote:

On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:



Right but you were talking about using both halves one after the
other.  If that occurs you have nothing left that you can reuse.  That
was what I was getting at.  If you use up both halves you end up
having to unmap the page.



You must have misunderstood me.

Once we use both halves of a page, we _keep_ the page, we do not unmap
it.

We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'

When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.

If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.

If not, _then_ we unmap and release the page.


Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach".  Makes sense.



Indeed, in mlx5 Striding RQ (mpwqe) we do something similar.
Our NIC (ConnectX-4 LX and newer) knows to write multiple _consecutive_ 
packets into the same RX buffer (page).

AFAIU, this is what Eric suggests to do in SW in mlx4.

Here are the main characteristics of our page-cache in mlx5:
1) FIFO (for higher chances of an available page).
2) If the page-cache head is busy, it is not freed. This has its pros 
and cons. We might reconsider.
3) Pages in cache have no page-to-WQE assignment (WQE is for Work Queue 
Element, == RX descriptor). They are shared for all WQEs of an RQ and 
might be used by different WQEs in different rounds.
4) Cache size is smaller than suggested, we would happily increase it to 
reflect a whole ring.


Still, performance tests over mlx5 show that on high load we quickly end 
up allocating pages as the stack does not release its ref count on time.

Increasing the cache size helps of course.
As there's no _fixed_ fair size that guarantees the availability of 
pages every ring cycle, reflecting a ring size can help, and would give 
the opportunity for users to tune their performance by setting their 
ring size according to how powerful their CPUs are, and what traffic 
type/load they're running.



Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.

To recap on x86 :

2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.

And 2048 pages would be in the reserve.


The buffer info layout for something like that would probably be
pretty interesting.  Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.

If you get a chance to do something like that I would love to know the
result.  Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.


The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.



The same will happen really.

Best maybe is for me to send the patch ;)


I think I have the idea now.  However patches are always welcome..  :-)



Same here :-)


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Alexander Duyck
On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet  wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
>
>>
>> Right but you were talking about using both halves one after the
>> other.  If that occurs you have nothing left that you can reuse.  That
>> was what I was getting at.  If you use up both halves you end up
>> having to unmap the page.
>>
>
> You must have misunderstood me.
>
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
>
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
>
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
>
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
>
> If not, _then_ we unmap and release the page.

Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach".  Makes sense.

> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
>
> To recap on x86 :
>
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
>
> And 2048 pages would be in the reserve.

The buffer info layout for something like that would probably be
pretty interesting.  Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.

If you get a chance to do something like that I would love to know the
result.  Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.

>> The whole idea behind using only half the page per descriptor is to
>> allow us to loop through the ring before we end up reusing it again.
>> That buys us enough time that usually the stack has consumed the frame
>> before we need it again.
>
>
> The same will happen really.
>
> Best maybe is for me to send the patch ;)

I think I have the idea now.  However patches are always welcome..  :-)


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Eric Dumazet
On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:

> 
> Right but you were talking about using both halves one after the
> other.  If that occurs you have nothing left that you can reuse.  That
> was what I was getting at.  If you use up both halves you end up
> having to unmap the page.
> 

You must have misunderstood me.

Once we use both halves of a page, we _keep_ the page, we do not unmap
it.

We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'

When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.

If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.

If not, _then_ we unmap and release the page.

Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.

To recap on x86 :

2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.

And 2048 pages would be in the reserve.


> The whole idea behind using only half the page per descriptor is to
> allow us to loop through the ring before we end up reusing it again.
> That buys us enough time that usually the stack has consumed the frame
> before we need it again.


The same will happen really.

Best maybe is for me to send the patch ;)




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Alexander Duyck
On Wed, Feb 22, 2017 at 10:21 AM, Eric Dumazet  wrote:
> On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
>> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet  wrote:
>> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> >> Use of order-3 pages is problematic in some cases.
>> >>
>> >> This patch might add three kinds of regression :
>> >>
>> >> 1) a CPU performance regression, but we will add later page
>> >> recycling and performance should be back.
>> >>
>> >> 2) TCP receiver could grow its receive window slightly slower,
>> >>because skb->len/skb->truesize ratio will decrease.
>> >>This is mostly ok, we prefer being conservative to not risk OOM,
>> >>and eventually tune TCP better in the future.
>> >>This is consistent with other drivers using 2048 per ethernet frame.
>> >>
>> >> 3) Because we allocate one page per RX slot, we consume more
>> >>memory for the ring buffers. XDP already had this constraint anyway.
>> >>
>> >> Signed-off-by: Eric Dumazet 
>> >> ---
>> >
>> > Note that we also could use a different strategy.
>> >
>> > Assume RX rings of 4096 entries/slots.
>> >
>> > With this patch, mlx4 gets the strategy used by Alexander in Intel
>> > drivers :
>> >
>> > Each RX slot has an allocated page, and uses half of it, flipping to the
>> > other half every time the slot is used.
>> >
>> > So a ring buffer of 4096 slots allocates 4096 pages.
>> >
>> > When we receive a packet train for the same flow, GRO builds an skb with
>> > ~45 page frags, all from different pages.
>> >
>> > The put_page() done from skb_release_data() touches ~45 different struct
>> > page cache lines, and show a high cost. (compared to the order-3 used
>> > today by mlx4, this adds extra cache line misses and stalls for the
>> > consumer)
>> >
>> > If we instead try to use the two halves of one page on consecutive RX
>> > slots, we might instead cook skb with the same number of MSS (45), but
>> > half the number of cache lines for put_page(), so we should speed up the
>> > consumer.
>>
>> So there is a problem that is being overlooked here.  That is the cost
>> of the DMA map/unmap calls.  The problem is many PowerPC systems have
>> an IOMMU that you have to work around, and that IOMMU comes at a heavy
>> cost for every map/unmap call.  So unless you are saying you wan to
>> setup a hybrid between the mlx5 and this approach where we have a page
>> cache that these all fall back into you will take a heavy cost for
>> having to map and unmap pages.
>>
>> The whole reason why I implemented the Intel page reuse approach the
>> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
>> resolve allocator/freeing expense.  Basically the allocator scales,
>> the IOMMU does not.  So any solution would require making certain that
>> we can leave the pages pinned in the DMA to avoid having to take the
>> global locks involved in accessing the IOMMU.
>
>
> I do not see any difference for the fact that we keep pages mapped the
> same way.
>
> mlx4_en_complete_rx_desc() will still use the :
>
> dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
>   frag_size, priv->dma_dir);
>
> for every single MSS we receive.
>
> This wont change.

Right but you were talking about using both halves one after the
other.  If that occurs you have nothing left that you can reuse.  That
was what I was getting at.  If you use up both halves you end up
having to unmap the page.

The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Eric Dumazet
On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet  wrote:
> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> >> Use of order-3 pages is problematic in some cases.
> >>
> >> This patch might add three kinds of regression :
> >>
> >> 1) a CPU performance regression, but we will add later page
> >> recycling and performance should be back.
> >>
> >> 2) TCP receiver could grow its receive window slightly slower,
> >>because skb->len/skb->truesize ratio will decrease.
> >>This is mostly ok, we prefer being conservative to not risk OOM,
> >>and eventually tune TCP better in the future.
> >>This is consistent with other drivers using 2048 per ethernet frame.
> >>
> >> 3) Because we allocate one page per RX slot, we consume more
> >>memory for the ring buffers. XDP already had this constraint anyway.
> >>
> >> Signed-off-by: Eric Dumazet 
> >> ---
> >
> > Note that we also could use a different strategy.
> >
> > Assume RX rings of 4096 entries/slots.
> >
> > With this patch, mlx4 gets the strategy used by Alexander in Intel
> > drivers :
> >
> > Each RX slot has an allocated page, and uses half of it, flipping to the
> > other half every time the slot is used.
> >
> > So a ring buffer of 4096 slots allocates 4096 pages.
> >
> > When we receive a packet train for the same flow, GRO builds an skb with
> > ~45 page frags, all from different pages.
> >
> > The put_page() done from skb_release_data() touches ~45 different struct
> > page cache lines, and show a high cost. (compared to the order-3 used
> > today by mlx4, this adds extra cache line misses and stalls for the
> > consumer)
> >
> > If we instead try to use the two halves of one page on consecutive RX
> > slots, we might instead cook skb with the same number of MSS (45), but
> > half the number of cache lines for put_page(), so we should speed up the
> > consumer.
> 
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
> 
> The whole reason why I implemented the Intel page reuse approach the
> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
> resolve allocator/freeing expense.  Basically the allocator scales,
> the IOMMU does not.  So any solution would require making certain that
> we can leave the pages pinned in the DMA to avoid having to take the
> global locks involved in accessing the IOMMU.


I do not see any difference for the fact that we keep pages mapped the
same way.

mlx4_en_complete_rx_desc() will still use the :

dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
  frag_size, priv->dma_dir);

for every single MSS we receive.

This wont change.




RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread David Laight
From: Alexander Duyck
> Sent: 22 February 2017 17:24
...
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
..

I can't help feeling that you need to look at how to get the iommu
code to reuse pages, rather the ethernet driver.
Maybe something like:

1) The driver requests a mapped receive buffer from the iommu.
This might give it memory that is already mapped but not in use.

2) When the receive completes the driver tells the iommu the mapping
is no longer needed. The iommu is not (yet) changed.

3) When the skb is freed the iommu is told that the buffer can be freed.

4) At (1), if the driver is using too much iommu resource then the mapping
for completed receives can be removed to free up iommu space.

Probably not as simple as it looks :-)

David



Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Alexander Duyck
On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet  wrote:
> On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> Use of order-3 pages is problematic in some cases.
>>
>> This patch might add three kinds of regression :
>>
>> 1) a CPU performance regression, but we will add later page
>> recycling and performance should be back.
>>
>> 2) TCP receiver could grow its receive window slightly slower,
>>because skb->len/skb->truesize ratio will decrease.
>>This is mostly ok, we prefer being conservative to not risk OOM,
>>and eventually tune TCP better in the future.
>>This is consistent with other drivers using 2048 per ethernet frame.
>>
>> 3) Because we allocate one page per RX slot, we consume more
>>memory for the ring buffers. XDP already had this constraint anyway.
>>
>> Signed-off-by: Eric Dumazet 
>> ---
>
> Note that we also could use a different strategy.
>
> Assume RX rings of 4096 entries/slots.
>
> With this patch, mlx4 gets the strategy used by Alexander in Intel
> drivers :
>
> Each RX slot has an allocated page, and uses half of it, flipping to the
> other half every time the slot is used.
>
> So a ring buffer of 4096 slots allocates 4096 pages.
>
> When we receive a packet train for the same flow, GRO builds an skb with
> ~45 page frags, all from different pages.
>
> The put_page() done from skb_release_data() touches ~45 different struct
> page cache lines, and show a high cost. (compared to the order-3 used
> today by mlx4, this adds extra cache line misses and stalls for the
> consumer)
>
> If we instead try to use the two halves of one page on consecutive RX
> slots, we might instead cook skb with the same number of MSS (45), but
> half the number of cache lines for put_page(), so we should speed up the
> consumer.

So there is a problem that is being overlooked here.  That is the cost
of the DMA map/unmap calls.  The problem is many PowerPC systems have
an IOMMU that you have to work around, and that IOMMU comes at a heavy
cost for every map/unmap call.  So unless you are saying you wan to
setup a hybrid between the mlx5 and this approach where we have a page
cache that these all fall back into you will take a heavy cost for
having to map and unmap pages.

The whole reason why I implemented the Intel page reuse approach the
way I did is to try and mitigate the IOMMU issue, it wasn't so much to
resolve allocator/freeing expense.  Basically the allocator scales,
the IOMMU does not.  So any solution would require making certain that
we can leave the pages pinned in the DMA to avoid having to take the
global locks involved in accessing the IOMMU.

> This means the number of active pages would be minimal, especially on
> PowerPC. Pages that have been used by X=2 received frags would be put in
> a quarantine (size to be determined).
> On PowerPC, X would be PAGE_SIZE/frag_size
>
>
> This strategy would consume less memory on PowerPC :
> 65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
> 4096.
>
> The quarantine would be sized to increase chances of reusing an old
> page, without consuming too much memory.
>
> Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))
>
> x86 would still use 4096 pages, but PowerPC would use 98+128 pages
> instead of 4096) (14 MBytes instead of 256 MBytes)

So any solution will need to work with an IOMMU enabled on the
platform.  I assume you have some x86 test systems you could run with
an IOMMU enabled.  My advice would be to try running in that
environment and see where the overhead lies.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Eric Dumazet
On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> Use of order-3 pages is problematic in some cases.
> 
> This patch might add three kinds of regression :
> 
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
> 
> 2) TCP receiver could grow its receive window slightly slower,
>because skb->len/skb->truesize ratio will decrease.
>This is mostly ok, we prefer being conservative to not risk OOM,
>and eventually tune TCP better in the future.
>This is consistent with other drivers using 2048 per ethernet frame.
> 
> 3) Because we allocate one page per RX slot, we consume more
>memory for the ring buffers. XDP already had this constraint anyway.
> 
> Signed-off-by: Eric Dumazet 
> ---

Note that we also could use a different strategy.

Assume RX rings of 4096 entries/slots.

With this patch, mlx4 gets the strategy used by Alexander in Intel
drivers : 

Each RX slot has an allocated page, and uses half of it, flipping to the
other half every time the slot is used.

So a ring buffer of 4096 slots allocates 4096 pages.

When we receive a packet train for the same flow, GRO builds an skb with
~45 page frags, all from different pages.

The put_page() done from skb_release_data() touches ~45 different struct
page cache lines, and show a high cost. (compared to the order-3 used
today by mlx4, this adds extra cache line misses and stalls for the
consumer)

If we instead try to use the two halves of one page on consecutive RX
slots, we might instead cook skb with the same number of MSS (45), but
half the number of cache lines for put_page(), so we should speed up the
consumer.

This means the number of active pages would be minimal, especially on
PowerPC. Pages that have been used by X=2 received frags would be put in
a quarantine (size to be determined).
On PowerPC, X would be PAGE_SIZE/frag_size


This strategy would consume less memory on PowerPC : 
65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
4096.

The quarantine would be sized to increase chances of reusing an old
page, without consuming too much memory.

Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))

x86 would still use 4096 pages, but PowerPC would use 98+128 pages
instead of 4096) (14 MBytes instead of 256 MBytes)





Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread Saeed Mahameed
On Thu, Feb 16, 2017 at 9:03 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Thu, 16 Feb 2017 09:05:26 -0800
>
>> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan  wrote:
>>>
>>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:

 On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan  wrote:
>
> Isn't it the same principle in page_frag_alloc() ?
> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>
> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?

 This is not ok.

 This is a very well known problem, we already mentioned that here in the
 past,
 but at least core networking stack uses  order-0 pages on PowerPC.
>>>
>>> You're right, we should have done this as well in mlx4 on PPC.

 mlx4 driver suffers from this problem 100% more than other drivers ;)

 One problem at a time Tariq. Right now, only mlx4 has this big problem
 compared to other NIC.
>>>
>>> We _do_ agree that the series improves the driver's quality, stability,
>>> and performance in a fragmented system.
>>>
>>> But due to the late rc we're in, and the fact that we know what benchmarks
>>> our customers are going to run, we cannot Ack the series and get it
>>> as is inside kernel 4.11.
>>>
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>> Sorry, but that is a weak explanation.
>
> I have to agree with Tom and Eric.
>
> If your customers have gotten into the habit of using metrics which
> actually do not represent real life performance, that is a completely
> inappropriate reason to not include Eric's changes as-is.

Guys, It is not about customers, benchmark and great performance numbers.
We agree with you that those patches are good and needed for the long term,
but as we are already at rc8 and although this change is correct, i
think it is a little bit too late
to have such huge change in the core RX engine of the driver. ( the
code is already like this for
more kernel releases than one could count, it will hurt no one to keep
it like this for two weeks more).

All we ask is to have a little bit more time - one or two weeks- to
test them and evaluate the impact.
As Eric stated we don't care if they make it to 4.11 or 4.12, the idea
is to have them in ASAP.
so why not wait to 4.11 (just two more weeks) and Tariq already said
that he will accept it as is.
and by then we will be smarter and have a clear plan of the gaps and
how to close them.

For PowerPC page order issue, Eric already have a simpler suggestion
that i support, and can easily be
sent to net and -stable.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread Saeed Mahameed
On Thu, Feb 16, 2017 at 7:11 PM, Eric Dumazet  wrote:
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>
> Note that I do not particularly care if these patches go in 4.11 or 4.12 
> really.
>
> I already backported them into our 4.3 based kernel.
>
> I guess that we could at least propose the trivial patch for stable releases,
> since PowerPC arches really need it.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 
> cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
> 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -102,7 +102,8 @@
>  /* Use the maximum between 16384 and a single page */
>  #define MLX4_EN_ALLOC_SIZE PAGE_ALIGN(16384)
>
> -#define MLX4_EN_ALLOC_PREFER_ORDER PAGE_ALLOC_COSTLY_ORDER
> +#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
>  \
> +PAGE_ALLOC_COSTLY_ORDER)
>
>  /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
>   * and 4K allocations) */

+1


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread David Miller
From: Tom Herbert 
Date: Thu, 16 Feb 2017 09:05:26 -0800

> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan  wrote:
>>
>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>
>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan  wrote:

 Isn't it the same principle in page_frag_alloc() ?
 It is called form __netdev_alloc_skb()/__napi_alloc_skb().

 Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>
>>> This is not ok.
>>>
>>> This is a very well known problem, we already mentioned that here in the
>>> past,
>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>
>> You're right, we should have done this as well in mlx4 on PPC.
>>>
>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>
>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>> compared to other NIC.
>>
>> We _do_ agree that the series improves the driver's quality, stability,
>> and performance in a fragmented system.
>>
>> But due to the late rc we're in, and the fact that we know what benchmarks
>> our customers are going to run, we cannot Ack the series and get it
>> as is inside kernel 4.11.
>>
> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".
> Sorry, but that is a weak explanation.

I have to agree with Tom and Eric.

If your customers have gotten into the habit of using metrics which
actually do not represent real life performance, that is a completely
inappropriate reason to not include Eric's changes as-is.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread Eric Dumazet
> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".

Note that I do not particularly care if these patches go in 4.11 or 4.12 really.

I already backported them into our 4.3 based kernel.

I guess that we could at least propose the trivial patch for stable releases,
since PowerPC arches really need it.

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 
cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -102,7 +102,8 @@
 /* Use the maximum between 16384 and a single page */
 #define MLX4_EN_ALLOC_SIZE PAGE_ALIGN(16384)

-#define MLX4_EN_ALLOC_PREFER_ORDER PAGE_ALLOC_COSTLY_ORDER
+#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
 \
+PAGE_ALLOC_COSTLY_ORDER)

 /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
  * and 4K allocations) */


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread Tom Herbert
On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan  wrote:
>
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>
>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan  wrote:
>>>
>>> Isn't it the same principle in page_frag_alloc() ?
>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>
>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>
>> This is not ok.
>>
>> This is a very well known problem, we already mentioned that here in the
>> past,
>> but at least core networking stack uses  order-0 pages on PowerPC.
>
> You're right, we should have done this as well in mlx4 on PPC.
>>
>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>
>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>> compared to other NIC.
>
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
>
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
>
You're admitting that Eric's patches improve driver quality,
stability, and performance but you're not allowing this in the kernel
because "we know what benchmarks our customers are going to run".
Sorry, but that is a weak explanation.

> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
>
>
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.
>
>
>>
>> Then, if we _still_ hit major issues, we might also need to force
>> napi_get_frags()
>> to allocate skb->head using kmalloc() instead of a page frag.
>>
>> That is a very simple fix.
>>
>> Remember that we have skb->truesize that is an approximation, it will
>> never be completely accurate,
>> but we need to make it better.
>>
>> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
>> is obviously wrong when host is under memory pressure
>> (2 frags per page -> truesize should be 2048)
>>
>>
>>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is
>>> a
>>> frag of a huge page,
>>> and it is not going to be freed before the other non-linear frags.
>>> Cannot this cause the same threats (memory pinning and so...)?
>>>
>>> Currently, mlx4 doesn't use this generic API, while most other drivers
>>> do.
>>>
>>> Similar claims are true for TX:
>>>
>>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
>>
>> We do not have such problem on TX. GFP_KERNEL allocations do not have
>> the same issues.
>>
>> Tasks are usually not malicious in our DC, and most serious
>> applications use memcg or such memory control.
>
>


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread Eric Dumazet
On Thu, 2017-02-16 at 15:08 +0200, Tariq Toukan wrote:
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> > On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan  wrote:
> >> Isn't it the same principle in page_frag_alloc() ?
> >> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
> >>
> >> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> > This is not ok.
> >
> > This is a very well known problem, we already mentioned that here in the 
> > past,
> > but at least core networking stack uses  order-0 pages on PowerPC.
> You're right, we should have done this as well in mlx4 on PPC.
> > mlx4 driver suffers from this problem 100% more than other drivers ;)
> >
> > One problem at a time Tariq. Right now, only mlx4 has this big problem
> > compared to other NIC.
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
> 
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
> 
> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
> 
> 
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.

Make sure to resubmit my patches before anything new.

We need to backport them to stable versions, without XDP, without
anything fancy.

And submit what is needed for 4.11, since current mlx4 driver in
net-next is broken, in case you missed it.

Thanks.




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-16 Thread Tariq Toukan


On 15/02/2017 6:57 PM, Eric Dumazet wrote:

On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan  wrote:

Isn't it the same principle in page_frag_alloc() ?
It is called form __netdev_alloc_skb()/__napi_alloc_skb().

Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?

This is not ok.

This is a very well known problem, we already mentioned that here in the past,
but at least core networking stack uses  order-0 pages on PowerPC.

You're right, we should have done this as well in mlx4 on PPC.

mlx4 driver suffers from this problem 100% more than other drivers ;)

One problem at a time Tariq. Right now, only mlx4 has this big problem
compared to other NIC.

We _do_ agree that the series improves the driver's quality, stability,
and performance in a fragmented system.

But due to the late rc we're in, and the fact that we know what benchmarks
our customers are going to run, we cannot Ack the series and get it
as is inside kernel 4.11.

We are interested to get your series merged along another perf improvement
we are preparing for next rc1. This way we will earn the desired stability
without breaking existing benchmarks.
I think this is the right thing to do at this point of time.


The idea behind the perf improvement, suggested by Jesper, is to split
the napi_poll call mlx4_en_process_rx_cq() loop into two.
The first loop extracts completed CQEs and starts prefetching on data
and RX descriptors. The second loop process the real packets.



Then, if we _still_ hit major issues, we might also need to force
napi_get_frags()
to allocate skb->head using kmalloc() instead of a page frag.

That is a very simple fix.

Remember that we have skb->truesize that is an approximation, it will
never be completely accurate,
but we need to make it better.

mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
is obviously wrong when host is under memory pressure
(2 frags per page -> truesize should be 2048)



By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
frag of a huge page,
and it is not going to be freed before the other non-linear frags.
Cannot this cause the same threats (memory pinning and so...)?

Currently, mlx4 doesn't use this generic API, while most other drivers do.

Similar claims are true for TX:
https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

We do not have such problem on TX. GFP_KERNEL allocations do not have
the same issues.

Tasks are usually not malicious in our DC, and most serious
applications use memcg or such memory control.




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-15 Thread Eric Dumazet
On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan  wrote:
>

>
> Isn't it the same principle in page_frag_alloc() ?
> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>
> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?

This is not ok.

This is a very well known problem, we already mentioned that here in the past,
but at least core networking stack uses  order-0 pages on PowerPC.

mlx4 driver suffers from this problem 100% more than other drivers ;)

One problem at a time Tariq. Right now, only mlx4 has this big problem
compared to other NIC.

Then, if we _still_ hit major issues, we might also need to force
napi_get_frags()
to allocate skb->head using kmalloc() instead of a page frag.

That is a very simple fix.

Remember that we have skb->truesize that is an approximation, it will
never be completely accurate,
but we need to make it better.

mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
is obviously wrong when host is under memory pressure
(2 frags per page -> truesize should be 2048)


> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
> frag of a huge page,
> and it is not going to be freed before the other non-linear frags.
> Cannot this cause the same threats (memory pinning and so...)?
>
> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>
> Similar claims are true for TX:
> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

We do not have such problem on TX. GFP_KERNEL allocations do not have
the same issues.

Tasks are usually not malicious in our DC, and most serious
applications use memcg or such memory control.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-15 Thread Tariq Toukan



On 14/02/2017 7:29 PM, Tom Herbert wrote:

On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet  wrote:

On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:


As the previous series caused hangs, we must run functional regression
tests over this series as well.
Run has already started, and results will be available tomorrow morning.

In general, I really like this series. The re-factorization looks more
elegant and more correct, functionally.

However, performance wise: we fear that the numbers will be drastically
lower with this transition to order-0 pages,
because of the (becoming critical) page allocator and dma operations
bottlenecks, especially on systems with costly
dma operations, such as ARM, iommu=on, etc...


So, again, performance after this patch series his higher,
once you have sensible RX queues parameters, for the expected workload.

Only in pathological cases, you might have some regression.

The old schem was _maybe_ better _when_ memory is not fragmented.

When you run hosts for months, memory _is_ fragmented.

You never see that on benchmarks, unless you force memory being
fragmented.




We already have this exact issue in mlx5, where we moved to order-0
allocations with a fixed size cache, but that was not enough.
Customers of mlx5 have already complained about the performance
degradation, and currently this is hurting our business.
We get a clear nack from our performance regression team regarding doing
the same in mlx4.
So, the question is, can we live with this degradation until those
bottleneck challenges are addressed?

Again, there is no degradation.

We have been using order-0 pages for years at Google.

Only when we made the mistake to rebase from the upstream driver and
order-3 pages we got horrible regressions, causing production outages.

I was silly to believe that mm layer got better.


Following our perf experts feedback, I cannot just simply Ack. We need
to have a clear plan to close the perf gap or reduce the impact.

Your perf experts need to talk to me, or any experts at Google and
Facebook, really.


I agree with this 100%! To be blunt, power users like this are testing
your drivers far beyond what Mellanox is doing and understand how
performance gains in benchmarks translate to possible gains in real
production way more than your perf experts can. Listen to Eric!

Tom



Anything _relying_ on order-3 pages being available to impress
friends/customers is a lie.


Isn't it the same principle in page_frag_alloc() ?
It is called form __netdev_alloc_skb()/__napi_alloc_skb().

Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is 
a frag of a huge page,

and it is not going to be freed before the other non-linear frags.
Cannot this cause the same threats (memory pinning and so...)?

Currently, mlx4 doesn't use this generic API, while most other drivers do.

Similar claims are true for TX:
https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81







Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Eric Dumazet
>
> This obviously does not work for the case I'm talking about
> (transmitting out another device with XDP).
>

XDP_TX does not handle this yet.

When XDP_TX was added, it was very clear that the transmit _had_ to be
done on the same port.

Since all this discussion happened in this thread ( mlx4: use order-0
pages for RX )
I was kind of assuming all the comments were relevant to current code or patch,
not future devs ;)


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Jesper Dangaard Brouer
On Tue, 14 Feb 2017 11:02:01 -0800
Eric Dumazet  wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
>  wrote:
> >  
> 
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)
> >
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?
> >
> >  
> 
> 
> mlx4 already has a cache for XDP.
> I believe I did not change this part, it still should work.
> 
> commit d576acf0a22890cf3f8f7a9b035f1558077f6770
> Author: Brenden Blanco 
> Date:   Tue Jul 19 12:16:52 2016 -0700
> 
> net/mlx4_en: add page recycle to prepare rx ring for tx support

This obviously does not work for the case I'm talking about
(transmitting out another device with XDP).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Tue, 14 Feb 2017 20:38:22 +0100

> On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
> David Miller  wrote:
> 
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> 
> I like this idea.  Are you saying that this should be done per DMA
> engine or per device?
> 
> If this is per device, it is almost the page_pool idea.  

Per-device is simplest, at least at first.

Maybe later down the road we can try to pool by "mapping entity" be
that a parent IOMMU or something else like a hypervisor managed
page table.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Jesper Dangaard Brouer
On Tue, 14 Feb 2017 11:06:25 -0800
Alexander Duyck  wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
>  wrote:
> > On Tue, 14 Feb 2017 09:29:54 -0800
> > Alexander Duyck  wrote:
> >  
> >> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan  
> >> wrote:  
> >> >
> >> >
> >> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >> >>
> >> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> >>  wrote:
> >> >>  
> >> >>> It is important to understand that there are two cases for the cost of
> >> >>> an atomic op, which depend on the cache-coherency state of the
> >> >>> cacheline.
> >> >>>
> >> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >> >>>
> >> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >> >>>  
> >> >> Okay, it seems you guys really want a patch that I said was not giving
> >> >> good results
> >> >>
> >> >> Let me publish the numbers I get , adding or not the last (and not
> >> >> official) patch.
> >> >>
> >> >> If I _force_ the user space process to run on the other node,
> >> >> then the results are not the ones Alex or you are expecting.
> >> >>
> >> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> >> and 3.5 Mpps without it.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> >> Average: eth0 2699243.20  16663.70 1354783.36   1079.95
> >> >> 0.00  0.00  4.50
> >> >>
> >> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >> >>
> >> >>  54.73%  [kernel]  [k] copy_user_enhanced_fast_string
> >> >>  31.07%  [kernel]  [k] skb_release_data
> >> >>   4.24%  [kernel]  [k] skb_copy_datagram_iter
> >> >>   1.35%  [kernel]  [k] copy_page_to_iter
> >> >>   0.98%  [kernel]  [k] _raw_spin_lock
> >> >>   0.90%  [kernel]  [k] skb_release_head_state
> >> >>   0.60%  [kernel]  [k] tcp_transmit_skb
> >> >>   0.51%  [kernel]  [k] mlx4_en_xmit
> >> >>   0.33%  [kernel]  [k] ___cache_free
> >> >>   0.28%  [kernel]  [k] tcp_rcv_established
> >> >>
> >> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >> >>
> >> >>
> >> >>  48.00%  [kernel]  [k] mlx4_en_process_rx_cq
> >> >>  12.92%  [kernel]  [k] napi_gro_frags
> >> >>   7.28%  [kernel]  [k] inet_gro_receive
> >> >>   7.17%  [kernel]  [k] tcp_gro_receive
> >> >>   5.10%  [kernel]  [k] dev_gro_receive
> >> >>   4.87%  [kernel]  [k] skb_gro_receive
> >> >>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
> >> >>   2.04%  [kernel]  [k] __build_skb
> >> >>   1.02%  [kernel]  [k] napi_reuse_skb.isra.95
> >> >>   1.01%  [kernel]  [k] tcp4_gro_receive
> >> >>   0.65%  [kernel]  [k] kmem_cache_alloc
> >> >>   0.45%  [kernel]  [k] _raw_spin_lock
> >> >>
> >> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> >> reads.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> >> Average: eth0 3566768.50  25638.60 1790345.69   1663.51
> >> >> 0.00  0.00  4.50
> >> >>
> >> >> Profiles of the two cpus :
> >> >>
> >> >>  74.85%  [kernel]  [k] copy_user_enhanced_fast_string
> >> >>   6.42%  [kernel]  [k] skb_release_data
> >> >>   5.65%  [kernel]  [k] skb_copy_datagram_iter
> >> >>   1.83%  [kernel]  [k] copy_page_to_iter
> >> >>   1.59%  [kernel]  [k] _raw_spin_lock
> >> >>   1.48%  [kernel]  [k] skb_release_head_state
> >> >>   0.72%  [kernel]  [k] tcp_transmit_skb
> >> >>   0.68%  [kernel]  [k] mlx4_en_xmit
> >> >>   0.43%  [kernel]  [k] page_frag_free
> >> >>   0.38%  [kernel]  [k] ___cache_free
> >> >>   0.37%  [kernel]  [k] tcp_established_options
> >> >>   0.37%  [kernel]  [k] __ip_local_out
> >> >>
> >> >>
> >> >> 37.98%  [kernel]  [k] mlx4_en_process_rx_cq
> >> >>  26.47%  [kernel]  [k] napi_gro_frags
> >> >>   7.02%  [kernel]  [k] inet_gro_receive
> >> >>   5.89%  [kernel]  [k] tcp_gro_receive
> >> >>   5.17%  [kernel]  [k] dev_gro_receive
> >> >>   4.80%  [kernel]  [k] skb_gro_receive
> >> >>   2.61%  [kernel]  [k] __build_skb
> >> >>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
> >> >>   1.59%  [kernel]  [k] napi_reuse_skb.isra.95
> >> >>   0.95%  [kernel]  [k] tcp4_gro_receive
> >> >>   0.51%  [kernel]  [k] kmem_cache_alloc
> >> >>   0.42%  [kernel]  [k] __inet_lookup_established
> >> >>   0.34%  [kernel]  [k] swiotlb_sync_single_for_cpu
> >> >>
> >> >>
> >> >> So probably 

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Jesper Dangaard Brouer
On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
David Miller  wrote:

> From: Tariq Toukan 
> Date: Tue, 14 Feb 2017 16:56:49 +0200
> 
> > Internally, I already implemented "dynamic page-cache" and
> > "page-reuse" mechanisms in the driver, and together they totally
> > bridge the performance gap.  

It sounds like you basically implemented a page_pool scheme...

> I worry about a dynamically growing page cache inside of drivers
> because it is invisible to the rest of the kernel.

Exactly, that is why I wanted a separate standardized thing, I call the
page_pool, which is part of the MM-tree and interacts with the page
allocator.  E.g. it must implement/support a way the page allocator can
reclaim pages from it (admit I didn't implement this in RFC patches).


> It responds only to local needs.

Generally true, but a side effect of recycling these pages, result in
less fragmentation of the page allocator/buddy system.


> The price of the real page allocator comes partly because it can
> respond to global needs.
> 
> If a driver consumes some unreasonable percentage of system memory, it
> is keeping that memory from being used from other parts of the system
> even if it would be better for networking to be slightly slower with
> less cache because that other thing that needs memory is more
> important.

(That is why I want to have OOM protection at device level, with the
recycle feedback from page pool we have this knowledge, and further I
wanted to allow blocking a specific RX queue[1])
[1] 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html#userspace-delivery-and-oom


> I think this is one of the primary reasons that the MM guys severely
> chastise us when we build special purpose local caches into networking
> facilities.
> 
> And the more I think about it the more I think they are right.

+1
 
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.

I like this idea.  Are you saying that this should be done per DMA
engine or per device?

If this is per device, it is almost the page_pool idea.  

 
> Yes, we still need to make the page allocator faster, but this kind of
> work helps everyone not just 100GB ethernet NICs.

True.  And Mel already have some generic improvements to the page
allocator queued for the next merge.  And I have the responsibility to
get the bulking API into shape.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Alexander Duyck
On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
 wrote:
> On Tue, 14 Feb 2017 09:29:54 -0800
> Alexander Duyck  wrote:
>
>> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan  
>> wrote:
>> >
>> >
>> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>> >>
>> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> >>  wrote:
>> >>
>> >>> It is important to understand that there are two cases for the cost of
>> >>> an atomic op, which depend on the cache-coherency state of the
>> >>> cacheline.
>> >>>
>> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>> >>>
>> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>> >>>
>> >> Okay, it seems you guys really want a patch that I said was not giving
>> >> good results
>> >>
>> >> Let me publish the numbers I get , adding or not the last (and not
>> >> official) patch.
>> >>
>> >> If I _force_ the user space process to run on the other node,
>> >> then the results are not the ones Alex or you are expecting.
>> >>
>> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> >> and 3.5 Mpps without it.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> >> Average: eth0 2699243.20  16663.70 1354783.36   1079.95
>> >> 0.00  0.00  4.50
>> >>
>> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>> >>
>> >>  54.73%  [kernel]  [k] copy_user_enhanced_fast_string
>> >>  31.07%  [kernel]  [k] skb_release_data
>> >>   4.24%  [kernel]  [k] skb_copy_datagram_iter
>> >>   1.35%  [kernel]  [k] copy_page_to_iter
>> >>   0.98%  [kernel]  [k] _raw_spin_lock
>> >>   0.90%  [kernel]  [k] skb_release_head_state
>> >>   0.60%  [kernel]  [k] tcp_transmit_skb
>> >>   0.51%  [kernel]  [k] mlx4_en_xmit
>> >>   0.33%  [kernel]  [k] ___cache_free
>> >>   0.28%  [kernel]  [k] tcp_rcv_established
>> >>
>> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>> >>
>> >>
>> >>  48.00%  [kernel]  [k] mlx4_en_process_rx_cq
>> >>  12.92%  [kernel]  [k] napi_gro_frags
>> >>   7.28%  [kernel]  [k] inet_gro_receive
>> >>   7.17%  [kernel]  [k] tcp_gro_receive
>> >>   5.10%  [kernel]  [k] dev_gro_receive
>> >>   4.87%  [kernel]  [k] skb_gro_receive
>> >>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
>> >>   2.04%  [kernel]  [k] __build_skb
>> >>   1.02%  [kernel]  [k] napi_reuse_skb.isra.95
>> >>   1.01%  [kernel]  [k] tcp4_gro_receive
>> >>   0.65%  [kernel]  [k] kmem_cache_alloc
>> >>   0.45%  [kernel]  [k] _raw_spin_lock
>> >>
>> >> Without the latest  patch (the exact patch series v3 I submitted),
>> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> >> reads.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> >> Average: eth0 3566768.50  25638.60 1790345.69   1663.51
>> >> 0.00  0.00  4.50
>> >>
>> >> Profiles of the two cpus :
>> >>
>> >>  74.85%  [kernel]  [k] copy_user_enhanced_fast_string
>> >>   6.42%  [kernel]  [k] skb_release_data
>> >>   5.65%  [kernel]  [k] skb_copy_datagram_iter
>> >>   1.83%  [kernel]  [k] copy_page_to_iter
>> >>   1.59%  [kernel]  [k] _raw_spin_lock
>> >>   1.48%  [kernel]  [k] skb_release_head_state
>> >>   0.72%  [kernel]  [k] tcp_transmit_skb
>> >>   0.68%  [kernel]  [k] mlx4_en_xmit
>> >>   0.43%  [kernel]  [k] page_frag_free
>> >>   0.38%  [kernel]  [k] ___cache_free
>> >>   0.37%  [kernel]  [k] tcp_established_options
>> >>   0.37%  [kernel]  [k] __ip_local_out
>> >>
>> >>
>> >> 37.98%  [kernel]  [k] mlx4_en_process_rx_cq
>> >>  26.47%  [kernel]  [k] napi_gro_frags
>> >>   7.02%  [kernel]  [k] inet_gro_receive
>> >>   5.89%  [kernel]  [k] tcp_gro_receive
>> >>   5.17%  [kernel]  [k] dev_gro_receive
>> >>   4.80%  [kernel]  [k] skb_gro_receive
>> >>   2.61%  [kernel]  [k] __build_skb
>> >>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
>> >>   1.59%  [kernel]  [k] napi_reuse_skb.isra.95
>> >>   0.95%  [kernel]  [k] tcp4_gro_receive
>> >>   0.51%  [kernel]  [k] kmem_cache_alloc
>> >>   0.42%  [kernel]  [k] __inet_lookup_established
>> >>   0.34%  [kernel]  [k] swiotlb_sync_single_for_cpu
>> >>
>> >>
>> >> So probably this will need further analysis, outside of the scope of
>> >> this patch series.
>> >>
>> >> Could we now please Ack this v3 and merge it ?
>> >>
>> >> Thanks.
>> >
>> > Thanks Eric.
>> >
>> > As the previous series caused hangs, we must run functional regression 
>> > tests
>> > over this series as well.
>> > 

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Eric Dumazet
On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
 wrote:
>

>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)
>
> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.
> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?
>
>


mlx4 already has a cache for XDP.
I believe I did not change this part, it still should work.

commit d576acf0a22890cf3f8f7a9b035f1558077f6770
Author: Brenden Blanco 
Date:   Tue Jul 19 12:16:52 2016 -0700

net/mlx4_en: add page recycle to prepare rx ring for tx support

I have not checked if recent Tom work added core infra for this cache.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Jesper Dangaard Brouer
On Tue, 14 Feb 2017 09:29:54 -0800
Alexander Duyck  wrote:

> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan  wrote:
> >
> >
> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >>
> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >>  wrote:
> >>  
> >>> It is important to understand that there are two cases for the cost of
> >>> an atomic op, which depend on the cache-coherency state of the
> >>> cacheline.
> >>>
> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >>>
> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >>>  
> >> Okay, it seems you guys really want a patch that I said was not giving
> >> good results
> >>
> >> Let me publish the numbers I get , adding or not the last (and not
> >> official) patch.
> >>
> >> If I _force_ the user space process to run on the other node,
> >> then the results are not the ones Alex or you are expecting.
> >>
> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> and 3.5 Mpps without it.
> >>
> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> Average: eth0 2699243.20  16663.70 1354783.36   1079.95
> >> 0.00  0.00  4.50
> >>
> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >>
> >>  54.73%  [kernel]  [k] copy_user_enhanced_fast_string
> >>  31.07%  [kernel]  [k] skb_release_data
> >>   4.24%  [kernel]  [k] skb_copy_datagram_iter
> >>   1.35%  [kernel]  [k] copy_page_to_iter
> >>   0.98%  [kernel]  [k] _raw_spin_lock
> >>   0.90%  [kernel]  [k] skb_release_head_state
> >>   0.60%  [kernel]  [k] tcp_transmit_skb
> >>   0.51%  [kernel]  [k] mlx4_en_xmit
> >>   0.33%  [kernel]  [k] ___cache_free
> >>   0.28%  [kernel]  [k] tcp_rcv_established
> >>
> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >>
> >>
> >>  48.00%  [kernel]  [k] mlx4_en_process_rx_cq
> >>  12.92%  [kernel]  [k] napi_gro_frags
> >>   7.28%  [kernel]  [k] inet_gro_receive
> >>   7.17%  [kernel]  [k] tcp_gro_receive
> >>   5.10%  [kernel]  [k] dev_gro_receive
> >>   4.87%  [kernel]  [k] skb_gro_receive
> >>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
> >>   2.04%  [kernel]  [k] __build_skb
> >>   1.02%  [kernel]  [k] napi_reuse_skb.isra.95
> >>   1.01%  [kernel]  [k] tcp4_gro_receive
> >>   0.65%  [kernel]  [k] kmem_cache_alloc
> >>   0.45%  [kernel]  [k] _raw_spin_lock
> >>
> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> reads.
> >>
> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> Average: eth0 3566768.50  25638.60 1790345.69   1663.51
> >> 0.00  0.00  4.50
> >>
> >> Profiles of the two cpus :
> >>
> >>  74.85%  [kernel]  [k] copy_user_enhanced_fast_string
> >>   6.42%  [kernel]  [k] skb_release_data
> >>   5.65%  [kernel]  [k] skb_copy_datagram_iter
> >>   1.83%  [kernel]  [k] copy_page_to_iter
> >>   1.59%  [kernel]  [k] _raw_spin_lock
> >>   1.48%  [kernel]  [k] skb_release_head_state
> >>   0.72%  [kernel]  [k] tcp_transmit_skb
> >>   0.68%  [kernel]  [k] mlx4_en_xmit
> >>   0.43%  [kernel]  [k] page_frag_free
> >>   0.38%  [kernel]  [k] ___cache_free
> >>   0.37%  [kernel]  [k] tcp_established_options
> >>   0.37%  [kernel]  [k] __ip_local_out
> >>
> >>
> >> 37.98%  [kernel]  [k] mlx4_en_process_rx_cq
> >>  26.47%  [kernel]  [k] napi_gro_frags
> >>   7.02%  [kernel]  [k] inet_gro_receive
> >>   5.89%  [kernel]  [k] tcp_gro_receive
> >>   5.17%  [kernel]  [k] dev_gro_receive
> >>   4.80%  [kernel]  [k] skb_gro_receive
> >>   2.61%  [kernel]  [k] __build_skb
> >>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
> >>   1.59%  [kernel]  [k] napi_reuse_skb.isra.95
> >>   0.95%  [kernel]  [k] tcp4_gro_receive
> >>   0.51%  [kernel]  [k] kmem_cache_alloc
> >>   0.42%  [kernel]  [k] __inet_lookup_established
> >>   0.34%  [kernel]  [k] swiotlb_sync_single_for_cpu
> >>
> >>
> >> So probably this will need further analysis, outside of the scope of
> >> this patch series.
> >>
> >> Could we now please Ack this v3 and merge it ?
> >>
> >> Thanks.  
> >
> > Thanks Eric.
> >
> > As the previous series caused hangs, we must run functional regression tests
> > over this series as well.
> > Run has already started, and results will be available tomorrow morning.
> >
> > In general, I really like this series. The re-factorization looks more
> > elegant and more correct, functionally.
> >
> > 

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Alexander Duyck
On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan  wrote:
>
>
> On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>>
>> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>>  wrote:
>>
>>> It is important to understand that there are two cases for the cost of
>>> an atomic op, which depend on the cache-coherency state of the
>>> cacheline.
>>>
>>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>>
>>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>>
>> Okay, it seems you guys really want a patch that I said was not giving
>> good results
>>
>> Let me publish the numbers I get , adding or not the last (and not
>> official) patch.
>>
>> If I _force_ the user space process to run on the other node,
>> then the results are not the ones Alex or you are expecting.
>>
>> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> and 3.5 Mpps without it.
>>
>> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> Average: eth0 2699243.20  16663.70 1354783.36   1079.95
>> 0.00  0.00  4.50
>>
>> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>>
>>  54.73%  [kernel]  [k] copy_user_enhanced_fast_string
>>  31.07%  [kernel]  [k] skb_release_data
>>   4.24%  [kernel]  [k] skb_copy_datagram_iter
>>   1.35%  [kernel]  [k] copy_page_to_iter
>>   0.98%  [kernel]  [k] _raw_spin_lock
>>   0.90%  [kernel]  [k] skb_release_head_state
>>   0.60%  [kernel]  [k] tcp_transmit_skb
>>   0.51%  [kernel]  [k] mlx4_en_xmit
>>   0.33%  [kernel]  [k] ___cache_free
>>   0.28%  [kernel]  [k] tcp_rcv_established
>>
>> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>>
>>
>>  48.00%  [kernel]  [k] mlx4_en_process_rx_cq
>>  12.92%  [kernel]  [k] napi_gro_frags
>>   7.28%  [kernel]  [k] inet_gro_receive
>>   7.17%  [kernel]  [k] tcp_gro_receive
>>   5.10%  [kernel]  [k] dev_gro_receive
>>   4.87%  [kernel]  [k] skb_gro_receive
>>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
>>   2.04%  [kernel]  [k] __build_skb
>>   1.02%  [kernel]  [k] napi_reuse_skb.isra.95
>>   1.01%  [kernel]  [k] tcp4_gro_receive
>>   0.65%  [kernel]  [k] kmem_cache_alloc
>>   0.45%  [kernel]  [k] _raw_spin_lock
>>
>> Without the latest  patch (the exact patch series v3 I submitted),
>> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> reads.
>>
>> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> Average: eth0 3566768.50  25638.60 1790345.69   1663.51
>> 0.00  0.00  4.50
>>
>> Profiles of the two cpus :
>>
>>  74.85%  [kernel]  [k] copy_user_enhanced_fast_string
>>   6.42%  [kernel]  [k] skb_release_data
>>   5.65%  [kernel]  [k] skb_copy_datagram_iter
>>   1.83%  [kernel]  [k] copy_page_to_iter
>>   1.59%  [kernel]  [k] _raw_spin_lock
>>   1.48%  [kernel]  [k] skb_release_head_state
>>   0.72%  [kernel]  [k] tcp_transmit_skb
>>   0.68%  [kernel]  [k] mlx4_en_xmit
>>   0.43%  [kernel]  [k] page_frag_free
>>   0.38%  [kernel]  [k] ___cache_free
>>   0.37%  [kernel]  [k] tcp_established_options
>>   0.37%  [kernel]  [k] __ip_local_out
>>
>>
>> 37.98%  [kernel]  [k] mlx4_en_process_rx_cq
>>  26.47%  [kernel]  [k] napi_gro_frags
>>   7.02%  [kernel]  [k] inet_gro_receive
>>   5.89%  [kernel]  [k] tcp_gro_receive
>>   5.17%  [kernel]  [k] dev_gro_receive
>>   4.80%  [kernel]  [k] skb_gro_receive
>>   2.61%  [kernel]  [k] __build_skb
>>   2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
>>   1.59%  [kernel]  [k] napi_reuse_skb.isra.95
>>   0.95%  [kernel]  [k] tcp4_gro_receive
>>   0.51%  [kernel]  [k] kmem_cache_alloc
>>   0.42%  [kernel]  [k] __inet_lookup_established
>>   0.34%  [kernel]  [k] swiotlb_sync_single_for_cpu
>>
>>
>> So probably this will need further analysis, outside of the scope of
>> this patch series.
>>
>> Could we now please Ack this v3 and merge it ?
>>
>> Thanks.
>
> Thanks Eric.
>
> As the previous series caused hangs, we must run functional regression tests
> over this series as well.
> Run has already started, and results will be available tomorrow morning.
>
> In general, I really like this series. The re-factorization looks more
> elegant and more correct, functionally.
>
> However, performance wise: we fear that the numbers will be drastically
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...

So to give you an idea I originally came up 

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Tom Herbert
On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet  wrote:
> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>
>> As the previous series caused hangs, we must run functional regression
>> tests over this series as well.
>> Run has already started, and results will be available tomorrow morning.
>>
>> In general, I really like this series. The re-factorization looks more
>> elegant and more correct, functionally.
>>
>> However, performance wise: we fear that the numbers will be drastically
>> lower with this transition to order-0 pages,
>> because of the (becoming critical) page allocator and dma operations
>> bottlenecks, especially on systems with costly
>> dma operations, such as ARM, iommu=on, etc...
>>
>
> So, again, performance after this patch series his higher,
> once you have sensible RX queues parameters, for the expected workload.
>
> Only in pathological cases, you might have some regression.
>
> The old schem was _maybe_ better _when_ memory is not fragmented.
>
> When you run hosts for months, memory _is_ fragmented.
>
> You never see that on benchmarks, unless you force memory being
> fragmented.
>
>
>
>> We already have this exact issue in mlx5, where we moved to order-0
>> allocations with a fixed size cache, but that was not enough.
>> Customers of mlx5 have already complained about the performance
>> degradation, and currently this is hurting our business.
>> We get a clear nack from our performance regression team regarding doing
>> the same in mlx4.
>> So, the question is, can we live with this degradation until those
>> bottleneck challenges are addressed?
>
> Again, there is no degradation.
>
> We have been using order-0 pages for years at Google.
>
> Only when we made the mistake to rebase from the upstream driver and
> order-3 pages we got horrible regressions, causing production outages.
>
> I was silly to believe that mm layer got better.
>
>> Following our perf experts feedback, I cannot just simply Ack. We need
>> to have a clear plan to close the perf gap or reduce the impact.
>
> Your perf experts need to talk to me, or any experts at Google and
> Facebook, really.
>

I agree with this 100%! To be blunt, power users like this are testing
your drivers far beyond what Mellanox is doing and understand how
performance gains in benchmarks translate to possible gains in real
production way more than your perf experts can. Listen to Eric!

Tom


> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>
>


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread David Miller
From: David Laight 
Date: Tue, 14 Feb 2017 17:17:22 +

> From: David Miller
>> Sent: 14 February 2017 17:04
> ...
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> ...
> 
> For systems with 'expensive' iommu has anyone tried separating the
> allocation of iommu resource (eg page table slots) from their
> assignment to physical pages?
> 
> Provided the page sizes all match, setting up a receive buffer might
> be as simple as writing the physical address into the iommu slot
> that matches the ring entry.
> 
> Or am I thinking about hardware that is much simpler than real life?

You still will eat an expensive MMIO or hypervisor call to setup the
mapping.

IOMMU is expensive because of two operations, the slot allocation
(which takes locks) and the modification of the IOMMU PTE to setup
or teardown the mapping.

This is why attempts to preallocate slots (which people have looked
into) never really takes off.  You really have to eliminate the
entire operation to get worthwhile gains.



RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread David Laight
From: David Miller
> Sent: 14 February 2017 17:04
...
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.
...

For systems with 'expensive' iommu has anyone tried separating the
allocation of iommu resource (eg page table slots) from their
assignment to physical pages?

Provided the page sizes all match, setting up a receive buffer might
be as simple as writing the physical address into the iommu slot
that matches the ring entry.

Or am I thinking about hardware that is much simpler than real life?

David



Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread David Miller
From: Tariq Toukan 
Date: Tue, 14 Feb 2017 16:56:49 +0200

> Internally, I already implemented "dynamic page-cache" and
> "page-reuse" mechanisms in the driver, and together they totally
> bridge the performance gap.

I worry about a dynamically growing page cache inside of drivers
because it is invisible to the rest of the kernel.

It responds only to local needs.

The price of the real page allocator comes partly because it can
respond to global needs.

If a driver consumes some unreasonable percentage of system memory, it
is keeping that memory from being used from other parts of the system
even if it would be better for networking to be slightly slower with
less cache because that other thing that needs memory is more
important.

I think this is one of the primary reasons that the MM guys severely
chastise us when we build special purpose local caches into networking
facilities.

And the more I think about it the more I think they are right.

One path I see around all of this is full integration.  Meaning that
we can free pages into the page allocator which are still DMA mapped.
And future allocations from that device are prioritized to take still
DMA mapped objects.

Yes, we still need to make the page allocator faster, but this kind of
work helps everyone not just 100GB ethernet NICs.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Eric Dumazet
On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:

> As the previous series caused hangs, we must run functional regression 
> tests over this series as well.
> Run has already started, and results will be available tomorrow morning.
> 
> In general, I really like this series. The re-factorization looks more 
> elegant and more correct, functionally.
> 
> However, performance wise: we fear that the numbers will be drastically 
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations 
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...
> 

So, again, performance after this patch series his higher,
once you have sensible RX queues parameters, for the expected workload.

Only in pathological cases, you might have some regression.

The old schem was _maybe_ better _when_ memory is not fragmented.

When you run hosts for months, memory _is_ fragmented.

You never see that on benchmarks, unless you force memory being
fragmented.



> We already have this exact issue in mlx5, where we moved to order-0 
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance 
> degradation, and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing 
> the same in mlx4.
> So, the question is, can we live with this degradation until those 
> bottleneck challenges are addressed?

Again, there is no degradation.

We have been using order-0 pages for years at Google.

Only when we made the mistake to rebase from the upstream driver and
order-3 pages we got horrible regressions, causing production outages.

I was silly to believe that mm layer got better.

> Following our perf experts feedback, I cannot just simply Ack. We need 
> to have a clear plan to close the perf gap or reduce the impact.

Your perf experts need to talk to me, or any experts at Google and
Facebook, really.

Anything _relying_ on order-3 pages being available to impress
friends/customers is a lie.




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Eric Dumazet
> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>

BTW, you do understand that on PowerPC right now, an Ethernet frame
holds 65536*8 =  half a MByte , right ?

So any PowerPC host using mlx4 NIC can easily be bringed down, by
using a few TCP flows and sending out of order packets.

That is in itself quite scary.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Tariq Toukan



On 14/02/2017 3:45 PM, Eric Dumazet wrote:

On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
 wrote:


It is important to understand that there are two cases for the cost of
an atomic op, which depend on the cache-coherency state of the
cacheline.

Measured on Skylake CPU i7-6700K CPU @ 4.00GHz

(1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
(2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns


Okay, it seems you guys really want a patch that I said was not giving
good results

Let me publish the numbers I get , adding or not the last (and not
official) patch.

If I _force_ the user space process to run on the other node,
then the results are not the ones Alex or you are expecting.

I have with this patch about 2.7 Mpps of this silly single TCP flow,
and 3.5 Mpps without it.

lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
Average: eth0 2699243.20  16663.70 1354783.36   1079.95
0.00  0.00  4.50

Profile of the cpu on NUMA node 1 ( netserver consuming data ) :

 54.73%  [kernel]  [k] copy_user_enhanced_fast_string
 31.07%  [kernel]  [k] skb_release_data
  4.24%  [kernel]  [k] skb_copy_datagram_iter
  1.35%  [kernel]  [k] copy_page_to_iter
  0.98%  [kernel]  [k] _raw_spin_lock
  0.90%  [kernel]  [k] skb_release_head_state
  0.60%  [kernel]  [k] tcp_transmit_skb
  0.51%  [kernel]  [k] mlx4_en_xmit
  0.33%  [kernel]  [k] ___cache_free
  0.28%  [kernel]  [k] tcp_rcv_established

Profile of cpu handling mlx4 softirqs (NUMA node 0)


 48.00%  [kernel]  [k] mlx4_en_process_rx_cq
 12.92%  [kernel]  [k] napi_gro_frags
  7.28%  [kernel]  [k] inet_gro_receive
  7.17%  [kernel]  [k] tcp_gro_receive
  5.10%  [kernel]  [k] dev_gro_receive
  4.87%  [kernel]  [k] skb_gro_receive
  2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
  2.04%  [kernel]  [k] __build_skb
  1.02%  [kernel]  [k] napi_reuse_skb.isra.95
  1.01%  [kernel]  [k] tcp4_gro_receive
  0.65%  [kernel]  [k] kmem_cache_alloc
  0.45%  [kernel]  [k] _raw_spin_lock

Without the latest  patch (the exact patch series v3 I submitted),
thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.

lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
Average: eth0 3566768.50  25638.60 1790345.69   1663.51
0.00  0.00  4.50

Profiles of the two cpus :

 74.85%  [kernel]  [k] copy_user_enhanced_fast_string
  6.42%  [kernel]  [k] skb_release_data
  5.65%  [kernel]  [k] skb_copy_datagram_iter
  1.83%  [kernel]  [k] copy_page_to_iter
  1.59%  [kernel]  [k] _raw_spin_lock
  1.48%  [kernel]  [k] skb_release_head_state
  0.72%  [kernel]  [k] tcp_transmit_skb
  0.68%  [kernel]  [k] mlx4_en_xmit
  0.43%  [kernel]  [k] page_frag_free
  0.38%  [kernel]  [k] ___cache_free
  0.37%  [kernel]  [k] tcp_established_options
  0.37%  [kernel]  [k] __ip_local_out


37.98%  [kernel]  [k] mlx4_en_process_rx_cq
 26.47%  [kernel]  [k] napi_gro_frags
  7.02%  [kernel]  [k] inet_gro_receive
  5.89%  [kernel]  [k] tcp_gro_receive
  5.17%  [kernel]  [k] dev_gro_receive
  4.80%  [kernel]  [k] skb_gro_receive
  2.61%  [kernel]  [k] __build_skb
  2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
  1.59%  [kernel]  [k] napi_reuse_skb.isra.95
  0.95%  [kernel]  [k] tcp4_gro_receive
  0.51%  [kernel]  [k] kmem_cache_alloc
  0.42%  [kernel]  [k] __inet_lookup_established
  0.34%  [kernel]  [k] swiotlb_sync_single_for_cpu


So probably this will need further analysis, outside of the scope of
this patch series.

Could we now please Ack this v3 and merge it ?

Thanks.

Thanks Eric.

As the previous series caused hangs, we must run functional regression 
tests over this series as well.

Run has already started, and results will be available tomorrow morning.

In general, I really like this series. The re-factorization looks more 
elegant and more correct, functionally.


However, performance wise: we fear that the numbers will be drastically 
lower with this transition to order-0 pages,
because of the (becoming critical) page allocator and dma operations 
bottlenecks, especially on systems with costly

dma operations, such as ARM, iommu=on, etc...

We already have this exact issue in mlx5, where we moved to order-0 
allocations with a fixed size cache, but that was not enough.
Customers of mlx5 have already complained about the performance 
degradation, and currently this is hurting our business.
We get a clear nack from our performance regression team regarding doing 
the same in mlx4.
So, the question is, can we live with this degradation until those 
bottleneck challenges 

Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Eric Dumazet
On Tue, Feb 14, 2017 at 5:45 AM, Eric Dumazet  wrote:
>
> Could we now please Ack this v3 and merge it ?
>

BTW I found the limitation on sender side.

After doing :

lpaa23:~# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 40
pkt-rate-high: 45

rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

lpaa23:~# ethtool -C eth0 tx-usecs 8 tx-frames 8

-> 36 Gbit on a single TCP flow, this is the highest number I ever got.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Eric Dumazet
On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
 wrote:

> It is important to understand that there are two cases for the cost of
> an atomic op, which depend on the cache-coherency state of the
> cacheline.
>
> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>
> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>

Okay, it seems you guys really want a patch that I said was not giving
good results

Let me publish the numbers I get , adding or not the last (and not
official) patch.

If I _force_ the user space process to run on the other node,
then the results are not the ones Alex or you are expecting.

I have with this patch about 2.7 Mpps of this silly single TCP flow,
and 3.5 Mpps without it.

lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
Average: eth0 2699243.20  16663.70 1354783.36   1079.95
0.00  0.00  4.50

Profile of the cpu on NUMA node 1 ( netserver consuming data ) :

54.73%  [kernel]  [k] copy_user_enhanced_fast_string
31.07%  [kernel]  [k] skb_release_data
 4.24%  [kernel]  [k] skb_copy_datagram_iter
 1.35%  [kernel]  [k] copy_page_to_iter
 0.98%  [kernel]  [k] _raw_spin_lock
 0.90%  [kernel]  [k] skb_release_head_state
 0.60%  [kernel]  [k] tcp_transmit_skb
 0.51%  [kernel]  [k] mlx4_en_xmit
 0.33%  [kernel]  [k] ___cache_free
 0.28%  [kernel]  [k] tcp_rcv_established

Profile of cpu handling mlx4 softirqs (NUMA node 0)


48.00%  [kernel]  [k] mlx4_en_process_rx_cq
12.92%  [kernel]  [k] napi_gro_frags
 7.28%  [kernel]  [k] inet_gro_receive
 7.17%  [kernel]  [k] tcp_gro_receive
 5.10%  [kernel]  [k] dev_gro_receive
 4.87%  [kernel]  [k] skb_gro_receive
 2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
 2.04%  [kernel]  [k] __build_skb
 1.02%  [kernel]  [k] napi_reuse_skb.isra.95
 1.01%  [kernel]  [k] tcp4_gro_receive
 0.65%  [kernel]  [k] kmem_cache_alloc
 0.45%  [kernel]  [k] _raw_spin_lock

Without the latest  patch (the exact patch series v3 I submitted),
thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.

lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
Average: eth0 3566768.50  25638.60 1790345.69   1663.51
0.00  0.00  4.50

Profiles of the two cpus :

74.85%  [kernel]  [k] copy_user_enhanced_fast_string
 6.42%  [kernel]  [k] skb_release_data
 5.65%  [kernel]  [k] skb_copy_datagram_iter
 1.83%  [kernel]  [k] copy_page_to_iter
 1.59%  [kernel]  [k] _raw_spin_lock
 1.48%  [kernel]  [k] skb_release_head_state
 0.72%  [kernel]  [k] tcp_transmit_skb
 0.68%  [kernel]  [k] mlx4_en_xmit
 0.43%  [kernel]  [k] page_frag_free
 0.38%  [kernel]  [k] ___cache_free
 0.37%  [kernel]  [k] tcp_established_options
 0.37%  [kernel]  [k] __ip_local_out


   37.98%  [kernel]  [k] mlx4_en_process_rx_cq
26.47%  [kernel]  [k] napi_gro_frags
 7.02%  [kernel]  [k] inet_gro_receive
 5.89%  [kernel]  [k] tcp_gro_receive
 5.17%  [kernel]  [k] dev_gro_receive
 4.80%  [kernel]  [k] skb_gro_receive
 2.61%  [kernel]  [k] __build_skb
 2.45%  [kernel]  [k] mlx4_en_prepare_rx_desc
 1.59%  [kernel]  [k] napi_reuse_skb.isra.95
 0.95%  [kernel]  [k] tcp4_gro_receive
 0.51%  [kernel]  [k] kmem_cache_alloc
 0.42%  [kernel]  [k] __inet_lookup_established
 0.34%  [kernel]  [k] swiotlb_sync_single_for_cpu


So probably this will need further analysis, outside of the scope of
this patch series.

Could we now please Ack this v3 and merge it ?

Thanks.



> Notice the huge difference. And in case 2, it is enough that the remote
> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
> the local CPU then does the atomic op.
>
> One key ideas behind the page_pool, is that remote CPUs read/detect
> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
> When array is full, it gets bulk returned to the shared-ptr-ring pool.
> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
> during it's bulk refill, to latency-hide the MESI transitions needed.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-14 Thread Jesper Dangaard Brouer

On Mon, 13 Feb 2017 15:16:35 -0800
Alexander Duyck  wrote:

[...]
> ... As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

It is important to understand that there are two cases for the cost of
an atomic op, which depend on the cache-coherency state of the
cacheline.

Measured on Skylake CPU i7-6700K CPU @ 4.00GHz

(1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
(2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns

Notice the huge difference. And in case 2, it is enough that the remote
CPU reads the cacheline and brings it into "Shared" (MESI) state, and
the local CPU then does the atomic op.

One key ideas behind the page_pool, is that remote CPUs read/detect
refcnt==1 (Shared-state), and store the page in a small per-CPU array.
When array is full, it gets bulk returned to the shared-ptr-ring pool.
When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
during it's bulk refill, to latency-hide the MESI transitions needed.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Alexander Duyck
On Mon, Feb 13, 2017 at 4:57 PM, Eric Dumazet  wrote:
>
>> Alex, be assured that I implemented the full thing, of course.
>
> Patch was :
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..0ae1b544668d26c24044dbdefdd9b12253596ff9
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -68,6 +68,7 @@ static int mlx4_alloc_page(struct mlx4_en_priv *priv,
> frag->page = page;
> frag->dma = dma;
> frag->page_offset = priv->rx_headroom;
> +   frag->pagecnt_bias = 1;
> return 0;
>  }
>
> @@ -97,7 +98,7 @@ static void mlx4_en_free_frag(const struct mlx4_en_priv 
> *priv,
> if (frag->page) {
> dma_unmap_page(priv->ddev, frag->dma,
>PAGE_SIZE, priv->dma_dir);
> -   __free_page(frag->page);
> +   __page_frag_cache_drain(frag->page, frag->pagecnt_bias);
> }
> /* We need to clear all fields, otherwise a change of 
> priv->log_rx_info
>  * could lead to see garbage later in frag->page.
> @@ -470,6 +471,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
> *priv,
>  {
> const struct mlx4_en_frag_info *frag_info = priv->frag_info;
> unsigned int truesize = 0;
> +   unsigned int pagecnt_bias;
> int nr, frag_size;
> struct page *page;
> dma_addr_t dma;
> @@ -491,9 +493,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
> *priv,
>  frag_size);
>
> truesize += frag_info->frag_stride;
> +   pagecnt_bias = frags->pagecnt_bias--;
> if (frag_info->frag_stride == PAGE_SIZE / 2) {
> frags->page_offset ^= PAGE_SIZE / 2;
> -   release = page_count(page) != 1 ||
> +   release = page_count(page) != pagecnt_bias ||
>   page_is_pfmemalloc(page) ||
>   page_to_nid(page) != numa_mem_id();
> } else {
> @@ -504,9 +507,13 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
> *priv,
> }
> if (release) {
> dma_unmap_page(priv->ddev, dma, PAGE_SIZE, 
> priv->dma_dir);
> +   __page_frag_cache_drain(page, --pagecnt_bias);
> frags->page = NULL;
> } else {
> -   page_ref_inc(page);
> +   if (pagecnt_bias == 1) {
> +   page_ref_add(page, USHRT_MAX);
> +   frags->pagecnt_bias = USHRT_MAX;
> +   }
> }
>
> nr++;

You might want to examine the code while running perf.  What you
should see is the page_ref_inc here go from eating a significant
amount of time prior to the patch to something negligable after the
patch.  If the page_ref_inc isn't adding much pressure then maybe that
is why it didn't provide any significant gain on mlx4. I suppose it's
a possibility that the mlx4 code is different enough that maybe their
code is just running in a different environment, for example there
might not be any MMIO pressure to put any serious pressure on the
atomic op so it is processed more quickly.

Also back when I was hammering on this it was back when I was mostly
focused on routing and doing micro-benchmarks.  Odds are it is
probably one of those things that won't show up unless you are really
looking for it so no need to worry about addressing it now.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet

> Alex, be assured that I implemented the full thing, of course.

Patch was :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..0ae1b544668d26c24044dbdefdd9b12253596ff9
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -68,6 +68,7 @@ static int mlx4_alloc_page(struct mlx4_en_priv *priv,
frag->page = page;
frag->dma = dma;
frag->page_offset = priv->rx_headroom;
+   frag->pagecnt_bias = 1;
return 0;
 }
 
@@ -97,7 +98,7 @@ static void mlx4_en_free_frag(const struct mlx4_en_priv *priv,
if (frag->page) {
dma_unmap_page(priv->ddev, frag->dma,
   PAGE_SIZE, priv->dma_dir);
-   __free_page(frag->page);
+   __page_frag_cache_drain(frag->page, frag->pagecnt_bias);
}
/* We need to clear all fields, otherwise a change of priv->log_rx_info
 * could lead to see garbage later in frag->page.
@@ -470,6 +471,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
*priv,
 {
const struct mlx4_en_frag_info *frag_info = priv->frag_info;
unsigned int truesize = 0;
+   unsigned int pagecnt_bias;
int nr, frag_size;
struct page *page;
dma_addr_t dma;
@@ -491,9 +493,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
*priv,
 frag_size);
 
truesize += frag_info->frag_stride;
+   pagecnt_bias = frags->pagecnt_bias--;
if (frag_info->frag_stride == PAGE_SIZE / 2) {
frags->page_offset ^= PAGE_SIZE / 2;
-   release = page_count(page) != 1 ||
+   release = page_count(page) != pagecnt_bias ||
  page_is_pfmemalloc(page) ||
  page_to_nid(page) != numa_mem_id();
} else {
@@ -504,9 +507,13 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
*priv,
}
if (release) {
dma_unmap_page(priv->ddev, dma, PAGE_SIZE, 
priv->dma_dir);
+   __page_frag_cache_drain(page, --pagecnt_bias);
frags->page = NULL;
} else {
-   page_ref_inc(page);
+   if (pagecnt_bias == 1) {
+   page_ref_add(page, USHRT_MAX);
+   frags->pagecnt_bias = USHRT_MAX;
+   }
}
 
nr++;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 
52f157cea7765ad6907c59aead357a158848b38d..0de5933fd56bda3db15b4aa33d2366eff98f4154
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -252,7 +252,12 @@ struct mlx4_en_tx_desc {
 struct mlx4_en_rx_alloc {
struct page *page;
dma_addr_t  dma;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
u32 page_offset;
+#else
+   u16 page_offset;
+#endif
+   u16 pagecnt_bias;
 };
 
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)





Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
On Mon, 2017-02-13 at 16:46 -0800, Eric Dumazet wrote:

> Alex, be assured that I implemented the full thing, of course.
> 
> ( The pagecnt_bias field, .. refilled every 16K rounds )

Correction, USHRT_MAX is  ~64K, not 16K





Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
On Mon, 2017-02-13 at 16:34 -0800, Alexander Duyck wrote:
> On Mon, Feb 13, 2017 at 4:22 PM, Eric Dumazet  wrote:
> > On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
> >  wrote:
> >
> >> Actually it depends on the use case.  In the case of pktgen packets
> >> they are usually dropped pretty early in the receive path.  Think
> >> something more along the lines of a TCP syn flood versus something
> >> that would be loading up a socket.
> >
> > So I gave another spin an it, reducing the MTU on the sender to 500
> > instead of 1500 to triple the load (in term of packets per second)
> > since the sender seems to hit some kind of limit around 30Gbit.
> >
> > Number of packets we process on one RX queue , and one TCP flow.
> >
> > Current patch series : 6.3 Mpps
> >
> > Which is not too bad ;)
> >
> > This number does not change putting your __page_frag_cache_drain() trick.
> 
> Well the __page_frag_cache_drain by itself isn't going to add much of
> anything.  You really aren't going to be using it except for in the
> slow path.  I was talking about the bulk page count update by itself.
> All __page_frag_cache_drain gets you is it cleans up the
> page_frag_sub(n-1); put_page(1); code calls.
> 
> The approach I took for the Intel drivers isn't too different than
> what we did for the skbuff page frag cache.  Basically I update the
> page count once every 65535 frames setting it to 64K, and holding no
> more than 65535 references in the pagecnt_bias.  Also in my code I
> don't update the count until after the first recycle call since the
> highest likelihood of us discarding the frame is after the first
> allocation anyway so we wait until the first recycle to start
> repopulating the count.

Alex, be assured that I implemented the full thing, of course.

( The pagecnt_bias field, .. refilled every 16K rounds )

And I repeat : I got _no_ convincing results to justify this patch being
sent in this series.

This series is not about adding 1% or 2% performance changes, but adding
robustness to RX allocations.


 



Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Alexander Duyck
On Mon, Feb 13, 2017 at 4:22 PM, Eric Dumazet  wrote:
> On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
>  wrote:
>
>> Actually it depends on the use case.  In the case of pktgen packets
>> they are usually dropped pretty early in the receive path.  Think
>> something more along the lines of a TCP syn flood versus something
>> that would be loading up a socket.
>
> So I gave another spin an it, reducing the MTU on the sender to 500
> instead of 1500 to triple the load (in term of packets per second)
> since the sender seems to hit some kind of limit around 30Gbit.
>
> Number of packets we process on one RX queue , and one TCP flow.
>
> Current patch series : 6.3 Mpps
>
> Which is not too bad ;)
>
> This number does not change putting your __page_frag_cache_drain() trick.

Well the __page_frag_cache_drain by itself isn't going to add much of
anything.  You really aren't going to be using it except for in the
slow path.  I was talking about the bulk page count update by itself.
All __page_frag_cache_drain gets you is it cleans up the
page_frag_sub(n-1); put_page(1); code calls.

The approach I took for the Intel drivers isn't too different than
what we did for the skbuff page frag cache.  Basically I update the
page count once every 65535 frames setting it to 64K, and holding no
more than 65535 references in the pagecnt_bias.  Also in my code I
don't update the count until after the first recycle call since the
highest likelihood of us discarding the frame is after the first
allocation anyway so we wait until the first recycle to start
repopulating the count.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
 wrote:

> Actually it depends on the use case.  In the case of pktgen packets
> they are usually dropped pretty early in the receive path.  Think
> something more along the lines of a TCP syn flood versus something
> that would be loading up a socket.

So I gave another spin an it, reducing the MTU on the sender to 500
instead of 1500 to triple the load (in term of packets per second)
since the sender seems to hit some kind of limit around 30Gbit.

Number of packets we process on one RX queue , and one TCP flow.

Current patch series : 6.3 Mpps

Which is not too bad ;)

This number does not change putting your __page_frag_cache_drain() trick.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Alexander Duyck
On Mon, Feb 13, 2017 at 3:29 PM, Eric Dumazet  wrote:
> On Mon, Feb 13, 2017 at 3:26 PM, Alexander Duyck
>  wrote:
>
>>
>> Odds are for a single TCP flow you won't notice.  This tends to be
>> more of a small packets type performance issue.  If you hammer on he
>> Rx using pktgen you would be more likely to see it.
>>
>> Anyway patch looks fine from a functional perspective so I am okay
>> with it.  The page count bulking is something that can be addressed
>> later.
>
> Well, if we hammer enough, pages are not recycled because we cycle
> through RX ring buffer too fast.
>
> This might be different for PowerPC though, because of its huge PAGE_SIZE.

Actually it depends on the use case.  In the case of pktgen packets
they are usually dropped pretty early in the receive path.  Think
something more along the lines of a TCP syn flood versus something
that would be loading up a socket.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
On Mon, Feb 13, 2017 at 3:26 PM, Alexander Duyck
 wrote:

>
> Odds are for a single TCP flow you won't notice.  This tends to be
> more of a small packets type performance issue.  If you hammer on he
> Rx using pktgen you would be more likely to see it.
>
> Anyway patch looks fine from a functional perspective so I am okay
> with it.  The page count bulking is something that can be addressed
> later.

Well, if we hammer enough, pages are not recycled because we cycle
through RX ring buffer too fast.

This might be different for PowerPC though, because of its huge PAGE_SIZE.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Alexander Duyck
On Mon, Feb 13, 2017 at 3:22 PM, Eric Dumazet  wrote:
> On Mon, Feb 13, 2017 at 3:16 PM, Alexander Duyck
>  wrote:
>
>> Any plans to add the bulk page count updates back at some point?  I
>> just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
>> update code to better handle incrementing page count").  I should have
>> patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
>> the reasons why I implemented __page_frag_cache_drain in the first
>> place.  As I'm sure Jesper will probably point out the atomic op for
>> get_page/page_ref_inc can be pretty expensive if I recall correctly.
>
> As a matter of fact I did this last week, but could not show any
> improvement on one TCP flow,
> so decided to not include this in this patch series.
>
> Maybe the issue was more a sender issue, I might revisit this later.

Odds are for a single TCP flow you won't notice.  This tends to be
more of a small packets type performance issue.  If you hammer on he
Rx using pktgen you would be more likely to see it.

Anyway patch looks fine from a functional perspective so I am okay
with it.  The page count bulking is something that can be addressed
later.

Thanks.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
On Mon, Feb 13, 2017 at 3:16 PM, Alexander Duyck
 wrote:

> Any plans to add the bulk page count updates back at some point?  I
> just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
> update code to better handle incrementing page count").  I should have
> patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
> the reasons why I implemented __page_frag_cache_drain in the first
> place.  As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

As a matter of fact I did this last week, but could not show any
improvement on one TCP flow,
so decided to not include this in this patch series.

Maybe the issue was more a sender issue, I might revisit this later.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Alexander Duyck
On Mon, Feb 13, 2017 at 1:09 PM, Eric Dumazet  wrote:
> On Mon, Feb 13, 2017 at 12:51 PM, Alexander Duyck
>  wrote:
>> On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet  wrote:
>
>>> +  PAGE_SIZE, priv->dma_dir);
>>> page = page_alloc->page;
>>> /* Revert changes done by mlx4_alloc_pages */
>>> -   page_ref_sub(page, page_alloc->page_size /
>>> +   page_ref_sub(page, PAGE_SIZE /
>>>priv->frag_info[i].frag_stride - 1);
>>> put_page(page);
>>> page_alloc->page = NULL;
>>
>> You can probably simplify this by using __page_frag_cache_drain since
>> that way you can just doe the sub and test for the whole set instead
>> of doing N - 1 and 1.
>
> Well, we remove this anyway in following patch ;)

Any plans to add the bulk page count updates back at some point?  I
just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
update code to better handle incrementing page count").  I should have
patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
the reasons why I implemented __page_frag_cache_drain in the first
place.  As I'm sure Jesper will probably point out the atomic op for
get_page/page_ref_inc can be pretty expensive if I recall correctly.

- Alex


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
On Mon, Feb 13, 2017 at 12:51 PM, Alexander Duyck
 wrote:
> On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet  wrote:

>> +  PAGE_SIZE, priv->dma_dir);
>> page = page_alloc->page;
>> /* Revert changes done by mlx4_alloc_pages */
>> -   page_ref_sub(page, page_alloc->page_size /
>> +   page_ref_sub(page, PAGE_SIZE /
>>priv->frag_info[i].frag_stride - 1);
>> put_page(page);
>> page_alloc->page = NULL;
>
> You can probably simplify this by using __page_frag_cache_drain since
> that way you can just doe the sub and test for the whole set instead
> of doing N - 1 and 1.

Well, we remove this anyway in following patch ;)


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Alexander Duyck
On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet  wrote:
> Use of order-3 pages is problematic in some cases.
>
> This patch might add three kinds of regression :
>
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
>
> 2) TCP receiver could grow its receive window slightly slower,
>because skb->len/skb->truesize ratio will decrease.
>This is mostly ok, we prefer being conservative to not risk OOM,
>and eventually tune TCP better in the future.
>This is consistent with other drivers using 2048 per ethernet frame.
>
> 3) Because we allocate one page per RX slot, we consume more
>memory for the ring buffers. XDP already had this constraint anyway.
>
> Signed-off-by: Eric Dumazet 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 72 
> +---
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 --
>  2 files changed, 33 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> 0c61c1200f2aec4c74d7403d9ec3ed06821c1bac..069ea09185fb0669d5c1f9b1b88f517b2d69c5ed
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -53,38 +53,26 @@
>  static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
> struct mlx4_en_rx_alloc *page_alloc,
> const struct mlx4_en_frag_info *frag_info,
> -   gfp_t _gfp)
> +   gfp_t gfp)
>  {
> -   int order;
> struct page *page;
> dma_addr_t dma;
>
> -   for (order = priv->rx_page_order; ;) {
> -   gfp_t gfp = _gfp;
> -
> -   if (order)
> -   gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NOMEMALLOC;
> -   page = alloc_pages(gfp, order);
> -   if (likely(page))
> -   break;
> -   if (--order < 0 ||
> -   ((PAGE_SIZE << order) < frag_info->frag_size))
> -   return -ENOMEM;
> -   }
> -   dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
> -  priv->dma_dir);
> +   page = alloc_page(gfp);
> +   if (unlikely(!page))
> +   return -ENOMEM;
> +   dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
> if (unlikely(dma_mapping_error(priv->ddev, dma))) {
> put_page(page);
> return -ENOMEM;
> }
> -   page_alloc->page_size = PAGE_SIZE << order;
> page_alloc->page = page;
> page_alloc->dma = dma;
> page_alloc->page_offset = 0;
> /* Not doing get_page() for each frag is a big win
>  * on asymetric workloads. Note we can not use atomic_set().
>  */
> -   page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 
> 1);
> +   page_ref_add(page, PAGE_SIZE / frag_info->frag_stride - 1);
> return 0;
>  }
>
> @@ -105,7 +93,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
> page_alloc[i].page_offset += frag_info->frag_stride;
>
> if (page_alloc[i].page_offset + frag_info->frag_stride <=
> -   ring_alloc[i].page_size)
> +   PAGE_SIZE)
> continue;
>
> if (unlikely(mlx4_alloc_pages(priv, _alloc[i],
> @@ -127,11 +115,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv 
> *priv,
> while (i--) {
> if (page_alloc[i].page != ring_alloc[i].page) {
> dma_unmap_page(priv->ddev, page_alloc[i].dma,
> -   page_alloc[i].page_size,
> -   priv->dma_dir);
> +  PAGE_SIZE, priv->dma_dir);
> page = page_alloc[i].page;
> /* Revert changes done by mlx4_alloc_pages */
> -   page_ref_sub(page, page_alloc[i].page_size /
> +   page_ref_sub(page, PAGE_SIZE /
>priv->frag_info[i].frag_stride - 
> 1);
> put_page(page);
> }
> @@ -147,8 +134,8 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
> u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
>
>
> -   if (next_frag_end > frags[i].page_size)
> -   dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
> +   if (next_frag_end > PAGE_SIZE)
> +   dma_unmap_page(priv->ddev, frags[i].dma, PAGE_SIZE,
>priv->dma_dir);
>
> if (frags[i].page)
> @@ -168,9 +155,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv 
> *priv,
>  frag_info, GFP_KERNEL | __GFP_COLD))
>

[PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-13 Thread Eric Dumazet
Use of order-3 pages is problematic in some cases.

This patch might add three kinds of regression :

1) a CPU performance regression, but we will add later page
recycling and performance should be back.

2) TCP receiver could grow its receive window slightly slower,
   because skb->len/skb->truesize ratio will decrease.
   This is mostly ok, we prefer being conservative to not risk OOM,
   and eventually tune TCP better in the future.
   This is consistent with other drivers using 2048 per ethernet frame.

3) Because we allocate one page per RX slot, we consume more
   memory for the ring buffers. XDP already had this constraint anyway.

Signed-off-by: Eric Dumazet 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 72 +---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 --
 2 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
0c61c1200f2aec4c74d7403d9ec3ed06821c1bac..069ea09185fb0669d5c1f9b1b88f517b2d69c5ed
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -53,38 +53,26 @@
 static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
struct mlx4_en_rx_alloc *page_alloc,
const struct mlx4_en_frag_info *frag_info,
-   gfp_t _gfp)
+   gfp_t gfp)
 {
-   int order;
struct page *page;
dma_addr_t dma;
 
-   for (order = priv->rx_page_order; ;) {
-   gfp_t gfp = _gfp;
-
-   if (order)
-   gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NOMEMALLOC;
-   page = alloc_pages(gfp, order);
-   if (likely(page))
-   break;
-   if (--order < 0 ||
-   ((PAGE_SIZE << order) < frag_info->frag_size))
-   return -ENOMEM;
-   }
-   dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-  priv->dma_dir);
+   page = alloc_page(gfp);
+   if (unlikely(!page))
+   return -ENOMEM;
+   dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
if (unlikely(dma_mapping_error(priv->ddev, dma))) {
put_page(page);
return -ENOMEM;
}
-   page_alloc->page_size = PAGE_SIZE << order;
page_alloc->page = page;
page_alloc->dma = dma;
page_alloc->page_offset = 0;
/* Not doing get_page() for each frag is a big win
 * on asymetric workloads. Note we can not use atomic_set().
 */
-   page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1);
+   page_ref_add(page, PAGE_SIZE / frag_info->frag_stride - 1);
return 0;
 }
 
@@ -105,7 +93,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
page_alloc[i].page_offset += frag_info->frag_stride;
 
if (page_alloc[i].page_offset + frag_info->frag_stride <=
-   ring_alloc[i].page_size)
+   PAGE_SIZE)
continue;
 
if (unlikely(mlx4_alloc_pages(priv, _alloc[i],
@@ -127,11 +115,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
while (i--) {
if (page_alloc[i].page != ring_alloc[i].page) {
dma_unmap_page(priv->ddev, page_alloc[i].dma,
-   page_alloc[i].page_size,
-   priv->dma_dir);
+  PAGE_SIZE, priv->dma_dir);
page = page_alloc[i].page;
/* Revert changes done by mlx4_alloc_pages */
-   page_ref_sub(page, page_alloc[i].page_size /
+   page_ref_sub(page, PAGE_SIZE /
   priv->frag_info[i].frag_stride - 1);
put_page(page);
}
@@ -147,8 +134,8 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
 
 
-   if (next_frag_end > frags[i].page_size)
-   dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
+   if (next_frag_end > PAGE_SIZE)
+   dma_unmap_page(priv->ddev, frags[i].dma, PAGE_SIZE,
   priv->dma_dir);
 
if (frags[i].page)
@@ -168,9 +155,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
 frag_info, GFP_KERNEL | __GFP_COLD))
goto out;
 
-   en_dbg(DRV, priv, "  frag %d allocator: - size:%d frags:%d\n",
-  i, ring->page_alloc[i].page_size,
-  page_ref_count(ring->page_alloc[i].page));
+   en_dbg(DRV, priv, "  frag %d allocator: -