Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-15 Thread Tariq Toukan



On 14/03/2017 4:21 PM, Eric Dumazet wrote:

On Tue, 2017-03-14 at 06:34 -0700, Eric Dumazet wrote:


So I will leave this to Mellanox for XDP tests and upstreaming this,
and will stop arguing with you, this is going nowhere.


Tariq, I will send a v2, including these changes (plus the missing
include of yesterday)

One is to make sure high order allocations remove __GFP_DIRECT_RECLAIM

The other is changing mlx4_en_recover_from_oom() to increase by
one rx_alloc_order instead of plain reset to rx_pref_alloc_order

Please test XDP and tell me if you find any issues ?
Thanks !


I can do the XDP check tomorrow, and complete the patch review.
I will let you updated.

(I am replying this on V2 as well)

Regards,
Tariq


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-14 Thread Eric Dumazet
On Tue, 2017-03-14 at 06:34 -0700, Eric Dumazet wrote:

> So I will leave this to Mellanox for XDP tests and upstreaming this,
> and will stop arguing with you, this is going nowhere.

Tariq, I will send a v2, including these changes (plus the missing
include of yesterday)

One is to make sure high order allocations remove __GFP_DIRECT_RECLAIM

The other is changing mlx4_en_recover_from_oom() to increase by 
one rx_alloc_order instead of plain reset to rx_pref_alloc_order

Please test XDP and tell me if you find any issues ?
Thanks !

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
a71554649c25383bb765fa8220bc9cd490247aee..cc41f2f145541b469b52e7014659d5fdbb7dac68
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -60,8 +60,10 @@ static struct page *mlx4_alloc_page(struct mlx4_en_priv 
*priv,
if (unlikely(!ring->pre_allocated_count)) {
unsigned int order = READ_ONCE(ring->rx_alloc_order);
 
-   page = __alloc_pages_node(node, gfp | __GFP_NOMEMALLOC |
-   __GFP_NOWARN | __GFP_NORETRY,
+   page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
+   __GFP_NOMEMALLOC |
+   __GFP_NOWARN |
+   __GFP_NORETRY,
  order);
if (page) {
split_page(page, order);
@@ -412,12 +414,13 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 }
 
 /* Under memory pressure, each ring->rx_alloc_order might be lowered
- * to very small values. Periodically reset it to initial value for
+ * to very small values. Periodically increase t to initial value for
  * optimal allocations, in case stress is over.
  */
 void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 {
struct mlx4_en_rx_ring *ring;
+   unsigned int order;
int ring_ind;
 
if (!priv->port_up)
@@ -425,7 +428,9 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 
for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
ring = priv->rx_ring[ring_ind];
-   WRITE_ONCE(ring->rx_alloc_order, ring->rx_pref_alloc_order);
+   order = min_t(unsigned int, ring->rx_alloc_order + 1,
+ ring->rx_pref_alloc_order);
+   WRITE_ONCE(ring->rx_alloc_order, order);
}
 }
 




Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-14 Thread Eric Dumazet
On Mon, Mar 13, 2017 at 9:57 PM, Alexei Starovoitov
 wrote:
> On Mon, Mar 13, 2017 at 06:02:11PM -0700, Eric Dumazet wrote:
>> On Mon, 2017-03-13 at 16:40 -0700, Alexei Starovoitov wrote:
>>
>> > that's not how it works. It's a job of submitter to prove
>> > that additional code doesn't cause regressions especially
>> > when there are legitimate concerns.
>>
>> This test was moved out of the mlx4_en_prepare_rx_desc() section into
>> the XDP_TX code path.
>>
>>
>> if (ring->page_cache.index > 0) {
>> /* XDP uses a single page per frame */
>> if (!frags->page) {
>> ring->page_cache.index--;
>> frags->page = 
>> ring->page_cache.buf[ring->page_cache.index].page;
>> frags->dma  = 
>> ring->page_cache.buf[ring->page_cache.index].dma;
>> }
>> frags->page_offset = XDP_PACKET_HEADROOM;
>> rx_desc->data[0].addr = cpu_to_be64(frags->dma +
>> XDP_PACKET_HEADROOM);
>> return 0;
>> }
>>
>> Can you check again your claim, because I see no additional cost
>> for XDP_TX.
>
> Let's look what it was:
> - xdp_tx xmits the page regardless whether driver can replenish
> - at the end of the napi mlx4_en_refill_rx_buffers() will replenish
> rx in bulk either from page_cache or by allocating one page at a time
>
> after the changes:
> - xdp_tx will check page_cache if it's empty it will try to do
> order 10 (ten!!) alloc, will fail, will try to alloc single page,
> will xmit the packet, and will place just allocated page into rx ring.
> on the next packet in the same napi loop, it will try to allocate
> order 9 (since the cache is still empty), will fail, will try single
> page, succeed... next packet will try order 8 and so on.
> And that spiky order 10 allocations will be happening 4 times a second
> due to new logic in mlx4_en_recover_from_oom().
> We may get lucky and order 2 alloc will succeed, but before that
> we will waste tons of cycles.
> If an attacker somehow makes existing page recycling logic not effective,
> the xdp performance will be limited by order0 page allocator.
> Not great, but still acceptable.
> After this patch it will just tank due to this crazy scheme.
> Yet you're not talking about this new behavior in the commit log.
> You didn't test XDP at all and still claiming that everything is fine ?!
> NACK


Ouch. You NACK a patch based on fears from your side, just because I
said I would not spend time on XDP at this very moment.

We hardly allocate a page in our workloads, and a failed attempt to
get an order-10 page
with GFP_ATOMIC has exactly the same cost than a failed attempt of
order-0 or order-1 or order-X,
the buddy page allocator gives that for free.

So I will leave this to Mellanox for XDP tests and upstreaming this,
and will stop arguing with you, this is going nowhere.

I suggested some changes but you blocked everything just because I
publicly said that I would not use XDP,
which added a serious mess to this driver.


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Mon, Mar 13, 2017 at 06:02:11PM -0700, Eric Dumazet wrote:
> On Mon, 2017-03-13 at 16:40 -0700, Alexei Starovoitov wrote:
> 
> > that's not how it works. It's a job of submitter to prove
> > that additional code doesn't cause regressions especially
> > when there are legitimate concerns.
> 
> This test was moved out of the mlx4_en_prepare_rx_desc() section into
> the XDP_TX code path.
> 
> 
> if (ring->page_cache.index > 0) {
> /* XDP uses a single page per frame */
> if (!frags->page) {
> ring->page_cache.index--;
> frags->page = 
> ring->page_cache.buf[ring->page_cache.index].page;
> frags->dma  = 
> ring->page_cache.buf[ring->page_cache.index].dma;
> }
> frags->page_offset = XDP_PACKET_HEADROOM;
> rx_desc->data[0].addr = cpu_to_be64(frags->dma +
> XDP_PACKET_HEADROOM);
> return 0;
> }
> 
> Can you check again your claim, because I see no additional cost
> for XDP_TX.

Let's look what it was:
- xdp_tx xmits the page regardless whether driver can replenish
- at the end of the napi mlx4_en_refill_rx_buffers() will replenish
rx in bulk either from page_cache or by allocating one page at a time

after the changes:
- xdp_tx will check page_cache if it's empty it will try to do
order 10 (ten!!) alloc, will fail, will try to alloc single page,
will xmit the packet, and will place just allocated page into rx ring.
on the next packet in the same napi loop, it will try to allocate
order 9 (since the cache is still empty), will fail, will try single
page, succeed... next packet will try order 8 and so on.
And that spiky order 10 allocations will be happening 4 times a second
due to new logic in mlx4_en_recover_from_oom().
We may get lucky and order 2 alloc will succeed, but before that
we will waste tons of cycles.
If an attacker somehow makes existing page recycling logic not effective,
the xdp performance will be limited by order0 page allocator.
Not great, but still acceptable.
After this patch it will just tank due to this crazy scheme.
Yet you're not talking about this new behavior in the commit log.
You didn't test XDP at all and still claiming that everything is fine ?!
NACK



Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, 2017-03-13 at 16:40 -0700, Alexei Starovoitov wrote:

> that's not how it works. It's a job of submitter to prove
> that additional code doesn't cause regressions especially
> when there are legitimate concerns.

This test was moved out of the mlx4_en_prepare_rx_desc() section into
the XDP_TX code path.


if (ring->page_cache.index > 0) {
/* XDP uses a single page per frame */
if (!frags->page) {
ring->page_cache.index--;
frags->page = 
ring->page_cache.buf[ring->page_cache.index].page;
frags->dma  = 
ring->page_cache.buf[ring->page_cache.index].dma;
}
frags->page_offset = XDP_PACKET_HEADROOM;
rx_desc->data[0].addr = cpu_to_be64(frags->dma +
XDP_PACKET_HEADROOM);
return 0;
}

Can you check again your claim, because I see no additional cost
for XDP_TX.

In fact I removed from the other paths (which are equally important I
believe) a test that had no use, so everybody should be happy.





Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Mon, Mar 13, 2017 at 04:44:19PM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 4:40 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Mar 13, 2017 at 04:28:04PM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
> >>  wrote:
> >> >
> >> > is it once in the beginning only? If so then why that
> >> > 'if (!ring->page_cache.index)' check is done for every packet?
> >>
> >>
> >>
> >> You did not really read the patch, otherwise you would not ask these 
> >> questions.
> >
> > please explain. I see
> > +  if (!ring->page_cache.index) {
> > +  npage = mlx4_alloc_page(priv, ring,
> > which is done for every packet that goes via XDP_TX.
> >
> 
> Well, we do for all packets, even on hosts not running XDP:
> 
> if (xdp_prog) { ...
> 
> ...
> 
> Then :
> 
> if (doorbell_pending))
>  mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> 
> And nobody complained of few additional instructions.

it's not the additional 'if' I'm concerned about it's the allocation
after 'if', since you didn't explain clearly when it's executed.

> >> Test it, and if you find a regression, shout loudly.
> >
> > that's not how it works. It's a job of submitter to prove
> > that additional code doesn't cause regressions especially
> > when there are legitimate concerns.
> 
> I have no easy way to test XDP. I  have never used it and am not
> planning to use it any time soon.
> 
> Does it mean I no longer can participate to linux dev ?

when you're touching the most performance sensitive piece of XDP in
the driver then yes, you have to show that XDP doesn't regress.
Especially since it's trivial to test.
Two mlx4 serves, pktgen and xdp2 is enough.



Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, Mar 13, 2017 at 4:40 PM, Alexei Starovoitov
 wrote:
> On Mon, Mar 13, 2017 at 04:28:04PM -0700, Eric Dumazet wrote:
>> On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
>>  wrote:
>> >
>> > is it once in the beginning only? If so then why that
>> > 'if (!ring->page_cache.index)' check is done for every packet?
>>
>>
>>
>> You did not really read the patch, otherwise you would not ask these 
>> questions.
>
> please explain. I see
> +  if (!ring->page_cache.index) {
> +  npage = mlx4_alloc_page(priv, ring,
> which is done for every packet that goes via XDP_TX.
>

Well, we do for all packets, even on hosts not running XDP:

if (xdp_prog) { ...

...

Then :

if (doorbell_pending))
 mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);

And nobody complained of few additional instructions.

Should I had, very loudly ?


>> Test it, and if you find a regression, shout loudly.
>
> that's not how it works. It's a job of submitter to prove
> that additional code doesn't cause regressions especially
> when there are legitimate concerns.

I have no easy way to test XDP. I  have never used it and am not
planning to use it any time soon.

Does it mean I no longer can participate to linux dev ?

Nice to hear Alexei.


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Mon, Mar 13, 2017 at 04:28:04PM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
>  wrote:
> >
> > is it once in the beginning only? If so then why that
> > 'if (!ring->page_cache.index)' check is done for every packet?
> 
> 
> 
> You did not really read the patch, otherwise you would not ask these 
> questions.

please explain. I see
+  if (!ring->page_cache.index) {
+  npage = mlx4_alloc_page(priv, ring,
which is done for every packet that goes via XDP_TX.

> Test it, and if you find a regression, shout loudly.

that's not how it works. It's a job of submitter to prove
that additional code doesn't cause regressions especially
when there are legitimate concerns.



Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
 wrote:
>
> is it once in the beginning only? If so then why that
> 'if (!ring->page_cache.index)' check is done for every packet?



You did not really read the patch, otherwise you would not ask these questions.

Test it, and if you find a regression, shout loudly.


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Mon, Mar 13, 2017 at 02:09:23PM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 1:23 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
> >>  wrote:
> >> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> >> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
> >> >>  wrote:
> >> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device 
> >> >> >> *dev, struct mlx4_en_cq *cq, int bud
> >> >> >>   case XDP_PASS:
> >> >> >>   break;
> >> >> >>   case XDP_TX:
> >> >> >> + /* Make sure we have one page ready to 
> >> >> >> replace this one */
> >> >> >> + npage = NULL;
> >> >> >> + if (!ring->page_cache.index) {
> >> >> >> + npage = mlx4_alloc_page(priv, 
> >> >> >> ring,
> >> >> >> + , 
> >> >> >> numa_mem_id(),
> >> >> >> + 
> >> >> >> GFP_ATOMIC | __GFP_MEMALLOC);
> >> >> >
> >> >> > did you test this with xdp2 test ?
> >> >> > under what conditions it allocates ?
> >> >> > It looks dangerous from security point of view to do allocations here.
> >> >> > Can it be exploited by an attacker?
> >> >> > we use xdp for ddos and lb and this is fast path.
> >> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have 
> >> >> > serious
> >> >> > perf regression.
> >> >> > In general I dont think it's a good idea to penalize x86 in favor of 
> >> >> > powerpc.
> >> >> > Can you #ifdef this new code somehow? so we won't have these concerns 
> >> >> > on x86?
> >> >>
> >> >> Normal paths would never hit this point really. I wanted to be extra
> >> >> safe, because who knows, some guys could be tempted to set
> >> >> ethtool -G ethX  rx 512 tx 8192
> >> >>
> >> >> Before this patch, if you were able to push enough frames in TX ring,
> >> >> you would also eventually be forced to allocate memory, or drop 
> >> >> frames...
> >> >
> >> > hmm. not following.
> >> > Into xdp tx queues packets don't come from stack. It can only be via 
> >> > xdp_tx.
> >> > So this rx page belongs to driver, not shared with anyone and it only 
> >> > needs to
> >> > be put onto tx ring, so I don't understand why driver needs to allocating
> >> > anything here. To refill the rx ring? but why here?
> >>
> >> Because RX ring can be shared, by packets goind to the upper stacks (say 
> >> TCP)
> >>
> >> So there is no guarantee that the pages in the quarantine pool have
> >> their page count to 1.
> >>
> >> The normal TX_XDP path will recycle pages in ring->cache_page .
> >>
> >> This is exactly where I pick up a replacement.
> >>
> >> Pages in ring->cache_page have the guarantee to have no other users
> >> than ourself (mlx4 driver)
> >>
> >> You might have not noticed that current mlx4 driver has a lazy refill
> >> of RX ring buffers, that eventually
> >> removes all the pages from RX ring, and we have to recover with this
> >> lazy mlx4_en_recover_from_oom() thing
> >> that will attempt to restart the allocations.
> >>
> >> After my patch, we have the guarantee that the RX ring buffer is
> >> always fully populated.
> >>
> >> When we receive a frame (XDP or not), we drop it if we can not
> >> replenish the RX slot,
> >> in case the oldest page in quarantine is not a recycling candidate and
> >> we can not allocate a new page.
> >
> > Got it. Could you please add above explanation to the commit log,
> > since it's a huge difference vs other drivers.
> > I don't think any other driver in the tree follows this strategy.
> 
> sk_buff allocation can fail before considering adding a frag on the
> skb anyway...
> 
> Look, all this is completely irrelevant for XDP_TX, since there is no
> allocation at all,
> once ~128 pages have been put into page_cache.

is it once in the beginning only? If so then why that
'if (!ring->page_cache.index)' check is done for every packet?
If every 128 packets then it will cause performance drop.
Since I don't completely understand how your new recycling
logic works, I'm asking you to test this patch with samples/bpf/xdp2
to make sure perf is still good, since it doesn't sound that
you tested with xdp and I don't understand the patch enough to see
the impact and it makes me worried.

> Do you want to prefill this cache when XDP is loaded ?
> 
> > I think that's the right call, but it shouldn't be hidden in details.
> 
> ring->page_cache contains the pages used by XDP_TX are recycled
> through mlx4_en_rx_recycle()
> 
> There is a nice comment there. I did not change this part.
> 
> These pages _used_ 

Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, Mar 13, 2017 at 1:23 PM, Alexei Starovoitov
 wrote:
> On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
>> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
>>  wrote:
>> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
>> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
>> >>  wrote:
>> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
>> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device 
>> >> >> *dev, struct mlx4_en_cq *cq, int bud
>> >> >>   case XDP_PASS:
>> >> >>   break;
>> >> >>   case XDP_TX:
>> >> >> + /* Make sure we have one page ready to 
>> >> >> replace this one */
>> >> >> + npage = NULL;
>> >> >> + if (!ring->page_cache.index) {
>> >> >> + npage = mlx4_alloc_page(priv, 
>> >> >> ring,
>> >> >> + , 
>> >> >> numa_mem_id(),
>> >> >> + 
>> >> >> GFP_ATOMIC | __GFP_MEMALLOC);
>> >> >
>> >> > did you test this with xdp2 test ?
>> >> > under what conditions it allocates ?
>> >> > It looks dangerous from security point of view to do allocations here.
>> >> > Can it be exploited by an attacker?
>> >> > we use xdp for ddos and lb and this is fast path.
>> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
>> >> > perf regression.
>> >> > In general I dont think it's a good idea to penalize x86 in favor of 
>> >> > powerpc.
>> >> > Can you #ifdef this new code somehow? so we won't have these concerns 
>> >> > on x86?
>> >>
>> >> Normal paths would never hit this point really. I wanted to be extra
>> >> safe, because who knows, some guys could be tempted to set
>> >> ethtool -G ethX  rx 512 tx 8192
>> >>
>> >> Before this patch, if you were able to push enough frames in TX ring,
>> >> you would also eventually be forced to allocate memory, or drop frames...
>> >
>> > hmm. not following.
>> > Into xdp tx queues packets don't come from stack. It can only be via 
>> > xdp_tx.
>> > So this rx page belongs to driver, not shared with anyone and it only 
>> > needs to
>> > be put onto tx ring, so I don't understand why driver needs to allocating
>> > anything here. To refill the rx ring? but why here?
>>
>> Because RX ring can be shared, by packets goind to the upper stacks (say TCP)
>>
>> So there is no guarantee that the pages in the quarantine pool have
>> their page count to 1.
>>
>> The normal TX_XDP path will recycle pages in ring->cache_page .
>>
>> This is exactly where I pick up a replacement.
>>
>> Pages in ring->cache_page have the guarantee to have no other users
>> than ourself (mlx4 driver)
>>
>> You might have not noticed that current mlx4 driver has a lazy refill
>> of RX ring buffers, that eventually
>> removes all the pages from RX ring, and we have to recover with this
>> lazy mlx4_en_recover_from_oom() thing
>> that will attempt to restart the allocations.
>>
>> After my patch, we have the guarantee that the RX ring buffer is
>> always fully populated.
>>
>> When we receive a frame (XDP or not), we drop it if we can not
>> replenish the RX slot,
>> in case the oldest page in quarantine is not a recycling candidate and
>> we can not allocate a new page.
>
> Got it. Could you please add above explanation to the commit log,
> since it's a huge difference vs other drivers.
> I don't think any other driver in the tree follows this strategy.

sk_buff allocation can fail before considering adding a frag on the
skb anyway...

Look, all this is completely irrelevant for XDP_TX, since there is no
allocation at all,
once ~128 pages have been put into page_cache.

Do you want to prefill this cache when XDP is loaded ?

> I think that's the right call, but it shouldn't be hidden in details.

ring->page_cache contains the pages used by XDP_TX are recycled
through mlx4_en_rx_recycle()

There is a nice comment there. I did not change this part.

These pages _used_ to be directly taken from mlx4_en_prepare_rx_desc(),
because there was no page recycling for the normal path.

All my patch does is a generic implementation
for page recycling, leaving the XDP_TX pages in their own pool,
because we do not even have to check their page count, adding a cache line miss.

So I do have 2 different pools, simply to let XDP_TX path being ultra
fast, as before.

Does not look details to me.


> If that's the right approach (probably is) we should do it
> in the other drivers too.
> Though I don't see you dropping "to the stack" packet in this patch.
> The idea is to drop the packet (whether xdp or not) if rx
> buffer cannot be replinshed _regardless_ whether driver
> implements recycling, right?



>
> Theory vs 

Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
>  wrote:
> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
> >>  wrote:
> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> >> >> struct mlx4_en_cq *cq, int bud
> >> >>   case XDP_PASS:
> >> >>   break;
> >> >>   case XDP_TX:
> >> >> + /* Make sure we have one page ready to 
> >> >> replace this one */
> >> >> + npage = NULL;
> >> >> + if (!ring->page_cache.index) {
> >> >> + npage = mlx4_alloc_page(priv, 
> >> >> ring,
> >> >> + , 
> >> >> numa_mem_id(),
> >> >> + 
> >> >> GFP_ATOMIC | __GFP_MEMALLOC);
> >> >
> >> > did you test this with xdp2 test ?
> >> > under what conditions it allocates ?
> >> > It looks dangerous from security point of view to do allocations here.
> >> > Can it be exploited by an attacker?
> >> > we use xdp for ddos and lb and this is fast path.
> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> >> > perf regression.
> >> > In general I dont think it's a good idea to penalize x86 in favor of 
> >> > powerpc.
> >> > Can you #ifdef this new code somehow? so we won't have these concerns on 
> >> > x86?
> >>
> >> Normal paths would never hit this point really. I wanted to be extra
> >> safe, because who knows, some guys could be tempted to set
> >> ethtool -G ethX  rx 512 tx 8192
> >>
> >> Before this patch, if you were able to push enough frames in TX ring,
> >> you would also eventually be forced to allocate memory, or drop frames...
> >
> > hmm. not following.
> > Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
> > So this rx page belongs to driver, not shared with anyone and it only needs 
> > to
> > be put onto tx ring, so I don't understand why driver needs to allocating
> > anything here. To refill the rx ring? but why here?
> 
> Because RX ring can be shared, by packets goind to the upper stacks (say TCP)
> 
> So there is no guarantee that the pages in the quarantine pool have
> their page count to 1.
> 
> The normal TX_XDP path will recycle pages in ring->cache_page .
> 
> This is exactly where I pick up a replacement.
> 
> Pages in ring->cache_page have the guarantee to have no other users
> than ourself (mlx4 driver)
> 
> You might have not noticed that current mlx4 driver has a lazy refill
> of RX ring buffers, that eventually
> removes all the pages from RX ring, and we have to recover with this
> lazy mlx4_en_recover_from_oom() thing
> that will attempt to restart the allocations.
> 
> After my patch, we have the guarantee that the RX ring buffer is
> always fully populated.
> 
> When we receive a frame (XDP or not), we drop it if we can not
> replenish the RX slot,
> in case the oldest page in quarantine is not a recycling candidate and
> we can not allocate a new page.

Got it. Could you please add above explanation to the commit log,
since it's a huge difference vs other drivers.
I don't think any other driver in the tree follows this strategy.
I think that's the right call, but it shouldn't be hidden in details.
If that's the right approach (probably is) we should do it
in the other drivers too.
Though I don't see you dropping "to the stack" packet in this patch.
The idea is to drop the packet (whether xdp or not) if rx
buffer cannot be replinshed _regardless_ whether driver
implements recycling, right?

> > rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
> > will be unused.
> 
> Yes, but as an author of a kernel patch, I do not want to be
> responsible for a crash
> _if_ someone tries something dumb like that.
> 
> > why are you saying it will cause this if (!ring->page_cache.index) to 
> > trigger?
> >
> >> This patch does not penalize x86, quite the contrary.
> >> It brings a (small) improvement on x86, and a huge improvement on powerpc.
> >
> > for normal tcp stack. sure. I'm worried about xdp fast path that needs to 
> > be tested.
> 
> Note that TX_XDP in mlx4 is completely broken as I mentioned earlier.

Theory vs practice. We don't see this issue running xdp.
My understanding, since rx and tx napi are scheduled for the same cpu
there is a guarantee that during normal invocation they won't race
and the only problem is due to mlx4_en_recover_from_oom() that can
run rx napi on some random cpu.



Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
 wrote:
> On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
>> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
>>  wrote:
>> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
>> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
>> >> struct mlx4_en_cq *cq, int bud
>> >>   case XDP_PASS:
>> >>   break;
>> >>   case XDP_TX:
>> >> + /* Make sure we have one page ready to 
>> >> replace this one */
>> >> + npage = NULL;
>> >> + if (!ring->page_cache.index) {
>> >> + npage = mlx4_alloc_page(priv, ring,
>> >> + , 
>> >> numa_mem_id(),
>> >> + GFP_ATOMIC 
>> >> | __GFP_MEMALLOC);
>> >
>> > did you test this with xdp2 test ?
>> > under what conditions it allocates ?
>> > It looks dangerous from security point of view to do allocations here.
>> > Can it be exploited by an attacker?
>> > we use xdp for ddos and lb and this is fast path.
>> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
>> > perf regression.
>> > In general I dont think it's a good idea to penalize x86 in favor of 
>> > powerpc.
>> > Can you #ifdef this new code somehow? so we won't have these concerns on 
>> > x86?
>>
>> Normal paths would never hit this point really. I wanted to be extra
>> safe, because who knows, some guys could be tempted to set
>> ethtool -G ethX  rx 512 tx 8192
>>
>> Before this patch, if you were able to push enough frames in TX ring,
>> you would also eventually be forced to allocate memory, or drop frames...
>
> hmm. not following.
> Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
> So this rx page belongs to driver, not shared with anyone and it only needs to
> be put onto tx ring, so I don't understand why driver needs to allocating
> anything here. To refill the rx ring? but why here?

Because RX ring can be shared, by packets goind to the upper stacks (say TCP)

So there is no guarantee that the pages in the quarantine pool have
their page count to 1.

The normal TX_XDP path will recycle pages in ring->cache_page .

This is exactly where I pick up a replacement.

Pages in ring->cache_page have the guarantee to have no other users
than ourself (mlx4 driver)

You might have not noticed that current mlx4 driver has a lazy refill
of RX ring buffers, that eventually
removes all the pages from RX ring, and we have to recover with this
lazy mlx4_en_recover_from_oom() thing
that will attempt to restart the allocations.

After my patch, we have the guarantee that the RX ring buffer is
always fully populated.

When we receive a frame (XDP or not), we drop it if we can not
replenish the RX slot,
in case the oldest page in quarantine is not a recycling candidate and
we can not allocate a new page.



> rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
> will be unused.

Yes, but as an author of a kernel patch, I do not want to be
responsible for a crash
_if_ someone tries something dumb like that.

> why are you saying it will cause this if (!ring->page_cache.index) to trigger?
>
>> This patch does not penalize x86, quite the contrary.
>> It brings a (small) improvement on x86, and a huge improvement on powerpc.
>
> for normal tcp stack. sure. I'm worried about xdp fast path that needs to be 
> tested.

Note that TX_XDP in mlx4 is completely broken as I mentioned earlier.


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
>  wrote:
> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> >> struct mlx4_en_cq *cq, int bud
> >>   case XDP_PASS:
> >>   break;
> >>   case XDP_TX:
> >> + /* Make sure we have one page ready to 
> >> replace this one */
> >> + npage = NULL;
> >> + if (!ring->page_cache.index) {
> >> + npage = mlx4_alloc_page(priv, ring,
> >> + , 
> >> numa_mem_id(),
> >> + GFP_ATOMIC | 
> >> __GFP_MEMALLOC);
> >
> > did you test this with xdp2 test ?
> > under what conditions it allocates ?
> > It looks dangerous from security point of view to do allocations here.
> > Can it be exploited by an attacker?
> > we use xdp for ddos and lb and this is fast path.
> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> > perf regression.
> > In general I dont think it's a good idea to penalize x86 in favor of 
> > powerpc.
> > Can you #ifdef this new code somehow? so we won't have these concerns on 
> > x86?
> 
> Normal paths would never hit this point really. I wanted to be extra
> safe, because who knows, some guys could be tempted to set
> ethtool -G ethX  rx 512 tx 8192
> 
> Before this patch, if you were able to push enough frames in TX ring,
> you would also eventually be forced to allocate memory, or drop frames...

hmm. not following.
Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
So this rx page belongs to driver, not shared with anyone and it only needs to
be put onto tx ring, so I don't understand why driver needs to allocating
anything here. To refill the rx ring? but why here?
rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
will be unused.
why are you saying it will cause this if (!ring->page_cache.index) to trigger?

> This patch does not penalize x86, quite the contrary.
> It brings a (small) improvement on x86, and a huge improvement on powerpc.

for normal tcp stack. sure. I'm worried about xdp fast path that needs to be 
tested.



Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
 wrote:
> On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
>> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
>> struct mlx4_en_cq *cq, int bud
>>   case XDP_PASS:
>>   break;
>>   case XDP_TX:
>> + /* Make sure we have one page ready to replace 
>> this one */
>> + npage = NULL;
>> + if (!ring->page_cache.index) {
>> + npage = mlx4_alloc_page(priv, ring,
>> + , 
>> numa_mem_id(),
>> + GFP_ATOMIC | 
>> __GFP_MEMALLOC);
>
> did you test this with xdp2 test ?
> under what conditions it allocates ?
> It looks dangerous from security point of view to do allocations here.
> Can it be exploited by an attacker?
> we use xdp for ddos and lb and this is fast path.
> If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> perf regression.
> In general I dont think it's a good idea to penalize x86 in favor of powerpc.
> Can you #ifdef this new code somehow? so we won't have these concerns on x86?

Normal paths would never hit this point really. I wanted to be extra
safe, because who knows, some guys could be tempted to set
ethtool -G ethX  rx 512 tx 8192

Before this patch, if you were able to push enough frames in TX ring,
you would also eventually be forced to allocate memory, or drop frames...

This patch does not penalize x86, quite the contrary.
It brings a (small) improvement on x86, and a huge improvement on powerpc.

Thanks.


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Alexei Starovoitov
On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> struct mlx4_en_cq *cq, int bud
>   case XDP_PASS:
>   break;
>   case XDP_TX:
> + /* Make sure we have one page ready to replace 
> this one */
> + npage = NULL;
> + if (!ring->page_cache.index) {
> + npage = mlx4_alloc_page(priv, ring,
> + , 
> numa_mem_id(),
> + GFP_ATOMIC | 
> __GFP_MEMALLOC);

did you test this with xdp2 test ?
under what conditions it allocates ?
It looks dangerous from security point of view to do allocations here.
Can it be exploited by an attacker?
we use xdp for ddos and lb and this is fast path.
If 1 out of 100s XDP_TX packets hit this allocation we will have serious
perf regression.
In general I dont think it's a good idea to penalize x86 in favor of powerpc.
Can you #ifdef this new code somehow? so we won't have these concerns on x86?



Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, 2017-03-13 at 06:07 -0700, Eric Dumazet wrote:

> I removed the include 
> 
> It seems we need .

Yes, I will add this to v2 :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
de455c8a2dec389cfeca6b6d474a6184d6acf618..a71554649c25383bb765fa8220bc9cd490247aee
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 
+#include 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
 #endif




Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, 2017-03-13 at 20:50 +0800, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test WARNING on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Eric-Dumazet/mlx4-Better-use-of-order-0-pages-in-RX-path/20170313-191100
> config: x86_64-randconfig-s5-03131942 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All warnings (new ones prefixed by >>):
> 
>drivers/net/ethernet/mellanox/mlx4/en_rx.c:622:12: warning: 'struct iphdr' 
> declared inside parameter list will not be visible outside of this definition 
> or declaration
> struct iphdr *iph)
>^
>In file included from include/linux/swab.h:4:0,
> from include/uapi/linux/byteorder/little_endian.h:12,
> from include/linux/byteorder/little_endian.h:4,
> from arch/x86/include/uapi/asm/byteorder.h:4,
> from include/asm-generic/bitops/le.h:5,
> from arch/x86/include/asm/bitops.h:517,
> from include/linux/bitops.h:36,
> from include/linux/kernel.h:10,
> from include/linux/list.h:8,
> from include/linux/timer.h:4,
> from include/linux/workqueue.h:8,
> from include/linux/bpf.h:12,
> from drivers/net/ethernet/mellanox/mlx4/en_rx.c:34:
>drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 
> 'get_fixed_ipv4_csum':
>drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:36: error: dereferencing 
> pointer to incomplete type 'struct iphdr'
>  length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
>^
>include/uapi/linux/swab.h:100:54: note: in definition of macro '__swab16'
> #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
>  ^
> >> include/linux/byteorder/generic.h:96:21: note: in expansion of macro 
> >> '__be16_to_cpu'
> #define be16_to_cpu __be16_to_cpu
> ^
> >> drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:21: note: in expansion of 
> >> macro 'be16_to_cpu'
>  length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
> ^~~
> 
> vim +/be16_to_cpu +627 drivers/net/ethernet/mellanox/mlx4/en_rx.c

I removed the include 

It seems we need .








Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Eric Dumazet
On Mon, 2017-03-13 at 14:01 +0200, Tariq Toukan wrote:

> I think MM-list people won't be happy with this.
> We were doing a similar thing with order-5 pages in mlx5 Striding RQ:
> Allocate and split high-order pages, to have:
> - Physically contiguous memory,
> - Less page allocations,
> - Yet, keep a fine grained refcounts/truesize.
> In case no high-order page available, fallback to using order-0 pages.
> 
> However, we changed this behavior, as it was fragmenting the memory, and 
> depleting the high-order pages available quickly.


Sure, I was not happy with this schem either.

I was the first to complain and suggest split_page() one year ago.

mlx5 was using __GFP_MEMALLOC for its MLX5_MPWRQ_WQE_PAGE_ORDER
allocations, and failure had no fallback.

mlx5e_alloc_rx_mpwqe() was simply giving up immediately.

Very different behavior there, since :

1) we normally recycle 99% [1] of the pages, and rx_alloc_order quickly
decreases under memory pressure.


2) My high order allocations use __GFP_NOMEMALLOC to cancel the
__GFP_MEMALLOC

3) Also note that I chose to periodically reset rx_alloc_order from
mlx4_en_recover_from_oom() to the initial value.
   We could later change this to a slow recovery if really needed, but
   my tests were fine with this.


[1] This driver might need to change the default RX ring sizes.
1024 slots is a bit short for 40Gbit NIC these days.

(We tune this to 4096)

Thanks !




Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread kbuild test robot
Hi Eric,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/mlx4-Better-use-of-order-0-pages-in-RX-path/20170313-191100
config: x86_64-randconfig-s5-03131942 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/mellanox/mlx4/en_rx.c:622:12: warning: 'struct iphdr' 
declared inside parameter list will not be visible outside of this definition 
or declaration
struct iphdr *iph)
   ^
   In file included from include/linux/swab.h:4:0,
from include/uapi/linux/byteorder/little_endian.h:12,
from include/linux/byteorder/little_endian.h:4,
from arch/x86/include/uapi/asm/byteorder.h:4,
from include/asm-generic/bitops/le.h:5,
from arch/x86/include/asm/bitops.h:517,
from include/linux/bitops.h:36,
from include/linux/kernel.h:10,
from include/linux/list.h:8,
from include/linux/timer.h:4,
from include/linux/workqueue.h:8,
from include/linux/bpf.h:12,
from drivers/net/ethernet/mellanox/mlx4/en_rx.c:34:
   drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 
'get_fixed_ipv4_csum':
   drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:36: error: dereferencing 
pointer to incomplete type 'struct iphdr'
 length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
   ^
   include/uapi/linux/swab.h:100:54: note: in definition of macro '__swab16'
#define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
 ^
>> include/linux/byteorder/generic.h:96:21: note: in expansion of macro 
>> '__be16_to_cpu'
#define be16_to_cpu __be16_to_cpu
^
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:21: note: in expansion of 
>> macro 'be16_to_cpu'
 length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
^~~

vim +/be16_to_cpu +627 drivers/net/ethernet/mellanox/mlx4/en_rx.c

f8c6455b Shani Michaeli 2014-11-09  611  static inline __wsum 
get_fixed_vlan_csum(__wsum hw_checksum,
f8c6455b Shani Michaeli 2014-11-09  612 
 struct vlan_hdr *vlanh)
f8c6455b Shani Michaeli 2014-11-09  613  {
f8c6455b Shani Michaeli 2014-11-09  614 return csum_add(hw_checksum, 
*(__wsum *)vlanh);
f8c6455b Shani Michaeli 2014-11-09  615  }
f8c6455b Shani Michaeli 2014-11-09  616  
f8c6455b Shani Michaeli 2014-11-09  617  /* Although the stack expects checksum 
which doesn't include the pseudo
f8c6455b Shani Michaeli 2014-11-09  618   * header, the HW adds it. To address 
that, we are subtracting the pseudo
f8c6455b Shani Michaeli 2014-11-09  619   * header checksum from the checksum 
value provided by the HW.
f8c6455b Shani Michaeli 2014-11-09  620   */
f8c6455b Shani Michaeli 2014-11-09  621  static void get_fixed_ipv4_csum(__wsum 
hw_checksum, struct sk_buff *skb,
f8c6455b Shani Michaeli 2014-11-09  622 struct 
iphdr *iph)
f8c6455b Shani Michaeli 2014-11-09  623  {
f8c6455b Shani Michaeli 2014-11-09  624 __u16 length_for_csum = 0;
f8c6455b Shani Michaeli 2014-11-09  625 __wsum csum_pseudo_header = 0;
f8c6455b Shani Michaeli 2014-11-09  626  
f8c6455b Shani Michaeli 2014-11-09 @627 length_for_csum = 
(be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
f8c6455b Shani Michaeli 2014-11-09  628 csum_pseudo_header = 
csum_tcpudp_nofold(iph->saddr, iph->daddr,
f8c6455b Shani Michaeli 2014-11-09  629 
length_for_csum, iph->protocol, 0);
f8c6455b Shani Michaeli 2014-11-09  630 skb->csum = 
csum_sub(hw_checksum, csum_pseudo_header);
f8c6455b Shani Michaeli 2014-11-09  631  }
f8c6455b Shani Michaeli 2014-11-09  632  
f8c6455b Shani Michaeli 2014-11-09  633  #if IS_ENABLED(CONFIG_IPV6)
f8c6455b Shani Michaeli 2014-11-09  634  /* In IPv6 packets, besides 
subtracting the pseudo header checksum,
f8c6455b Shani Michaeli 2014-11-09  635   * we also compute/add the IP header 
checksum which

:: The code at line 627 was first introduced by commit
:: f8c6455bb04b944edb69e9b074e28efee2c56bdd net/mlx4_en: Extend checksum 
offloading by CHECKSUM COMPLETE

:: TO: Shani Michaeli 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread kbuild test robot
Hi Eric,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/mlx4-Better-use-of-order-0-pages-in-RX-path/20170313-191100
config: x86_64-acpi-redef (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:622:12: warning: 'struct iphdr' 
>> declared inside parameter list will not be visible outside of this 
>> definition or declaration
struct iphdr *iph)
   ^
   In file included from include/linux/swab.h:4:0,
from include/uapi/linux/byteorder/little_endian.h:12,
from include/linux/byteorder/little_endian.h:4,
from arch/x86/include/uapi/asm/byteorder.h:4,
from include/asm-generic/bitops/le.h:5,
from arch/x86/include/asm/bitops.h:517,
from include/linux/bitops.h:36,
from include/linux/kernel.h:10,
from include/linux/list.h:8,
from include/linux/timer.h:4,
from include/linux/workqueue.h:8,
from include/linux/bpf.h:12,
from drivers/net/ethernet/mellanox/mlx4/en_rx.c:34:
   drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 
'get_fixed_ipv4_csum':
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:36: error: dereferencing 
>> pointer to incomplete type 'struct iphdr'
 length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
   ^
   include/uapi/linux/swab.h:100:54: note: in definition of macro '__swab16'
#define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
 ^
   include/linux/byteorder/generic.h:96:21: note: in expansion of macro 
'__be16_to_cpu'
#define be16_to_cpu __be16_to_cpu
^
   drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:21: note: in expansion of 
macro 'be16_to_cpu'
 length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
^~~

vim +627 drivers/net/ethernet/mellanox/mlx4/en_rx.c

f8c6455b Shani Michaeli 2014-11-09  616  
f8c6455b Shani Michaeli 2014-11-09  617  /* Although the stack expects checksum 
which doesn't include the pseudo
f8c6455b Shani Michaeli 2014-11-09  618   * header, the HW adds it. To address 
that, we are subtracting the pseudo
f8c6455b Shani Michaeli 2014-11-09  619   * header checksum from the checksum 
value provided by the HW.
f8c6455b Shani Michaeli 2014-11-09  620   */
f8c6455b Shani Michaeli 2014-11-09  621  static void get_fixed_ipv4_csum(__wsum 
hw_checksum, struct sk_buff *skb,
f8c6455b Shani Michaeli 2014-11-09 @622 struct 
iphdr *iph)
f8c6455b Shani Michaeli 2014-11-09  623  {
f8c6455b Shani Michaeli 2014-11-09  624 __u16 length_for_csum = 0;
f8c6455b Shani Michaeli 2014-11-09  625 __wsum csum_pseudo_header = 0;
f8c6455b Shani Michaeli 2014-11-09  626  
f8c6455b Shani Michaeli 2014-11-09 @627 length_for_csum = 
(be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
f8c6455b Shani Michaeli 2014-11-09  628 csum_pseudo_header = 
csum_tcpudp_nofold(iph->saddr, iph->daddr,
f8c6455b Shani Michaeli 2014-11-09  629 
length_for_csum, iph->protocol, 0);
f8c6455b Shani Michaeli 2014-11-09  630 skb->csum = 
csum_sub(hw_checksum, csum_pseudo_header);

:: The code at line 627 was first introduced by commit
:: f8c6455bb04b944edb69e9b074e28efee2c56bdd net/mlx4_en: Extend checksum 
offloading by CHECKSUM COMPLETE

:: TO: Shani Michaeli 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-13 Thread Tariq Toukan

Hi Eric, thanks for your patch.

On 13/03/2017 2:58 AM, Eric Dumazet wrote:

When adding order-0 pages allocations and page recycling in receive path,
I added issues on PowerPC, or more generally on arches with large pages.

A GRO packet, aggregating 45 segments, ended up using 45 page frags
on 45 different pages. Before my changes we were very likely packing
up to 42 Ethernet frames per 64KB page.

1) At skb freeing time, all put_page() on the skb frags now touch 45
   different 'struct page' and this adds more cache line misses.
   Too bad that standard Ethernet MTU is so small :/

2) Using one order-0 page per ring slot consumes ~42 times more memory
   on PowerPC.

3) Allocating order-0 pages is very likely to use pages from very
   different locations, increasing TLB pressure on hosts with more
   than 256 GB of memory after days of uptime.

This patch uses a refined strategy, addressing these points.

We still use order-0 pages, but the page recyling technique is modified
so that we have better chances to lower number of pages containing the
frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)

Page allocations are split in two halves :
- One currently visible by the NIC for DMA operations.
- The other contains pages that already added to old skbs, put in
  a quarantine.

When we receive a frame, we look at the oldest entry in the pool and
check if the page count is back to one, meaning old skbs/frags were
consumed and the page can be recycled.

Page allocations are attempted using high order ones, trying
to lower TLB pressure.


I think MM-list people won't be happy with this.
We were doing a similar thing with order-5 pages in mlx5 Striding RQ:
Allocate and split high-order pages, to have:
- Physically contiguous memory,
- Less page allocations,
- Yet, keep a fine grained refcounts/truesize.
In case no high-order page available, fallback to using order-0 pages.

However, we changed this behavior, as it was fragmenting the memory, and 
depleting the high-order pages available quickly.




On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
But on PowerPC, this patch considerably reduces the allocated memory.

Performance gain on PowerPC is about 50% for a single TCP flow.


Nice!



On x86, I could not measure the difference, my test machine being
limited by the sender (33 Gbit per TCP flow).
22 less cache line misses per 64 KB GRO packet is probably in the order
of 2 % or so.

Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
Cc: Saeed Mahameed 
Cc: Alexander Duyck 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 462 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 +++-
 3 files changed, 310 insertions(+), 221 deletions(-)



Thanks,
Tariq


[PATCH net-next] mlx4: Better use of order-0 pages in RX path

2017-03-12 Thread Eric Dumazet
When adding order-0 pages allocations and page recycling in receive path,
I added issues on PowerPC, or more generally on arches with large pages.

A GRO packet, aggregating 45 segments, ended up using 45 page frags
on 45 different pages. Before my changes we were very likely packing
up to 42 Ethernet frames per 64KB page.

1) At skb freeing time, all put_page() on the skb frags now touch 45
   different 'struct page' and this adds more cache line misses.
   Too bad that standard Ethernet MTU is so small :/

2) Using one order-0 page per ring slot consumes ~42 times more memory
   on PowerPC.

3) Allocating order-0 pages is very likely to use pages from very
   different locations, increasing TLB pressure on hosts with more
   than 256 GB of memory after days of uptime.

This patch uses a refined strategy, addressing these points.

We still use order-0 pages, but the page recyling technique is modified
so that we have better chances to lower number of pages containing the
frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)

Page allocations are split in two halves :
- One currently visible by the NIC for DMA operations.
- The other contains pages that already added to old skbs, put in
  a quarantine.

When we receive a frame, we look at the oldest entry in the pool and
check if the page count is back to one, meaning old skbs/frags were
consumed and the page can be recycled.

Page allocations are attempted using high order ones, trying
to lower TLB pressure.

On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
But on PowerPC, this patch considerably reduces the allocated memory.

Performance gain on PowerPC is about 50% for a single TCP flow.

On x86, I could not measure the difference, my test machine being
limited by the sender (33 Gbit per TCP flow).
22 less cache line misses per 64 KB GRO packet is probably in the order
of 2 % or so.

Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
Cc: Saeed Mahameed 
Cc: Alexander Duyck 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 462 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 +++-
 3 files changed, 310 insertions(+), 221 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..de455c8a2dec389cfeca6b6d474a6184d6acf618
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -31,7 +31,6 @@
  *
  */
 
-#include 
 #include 
 #include 
 #include 
@@ -50,59 +49,41 @@
 
 #include "mlx4_en.h"
 
-static int mlx4_alloc_page(struct mlx4_en_priv *priv,
-  struct mlx4_en_rx_alloc *frag,
-  gfp_t gfp)
+static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
+   struct mlx4_en_rx_ring *ring,
+   dma_addr_t *dma,
+   unsigned int node, gfp_t gfp)
 {
struct page *page;
-   dma_addr_t dma;
-
-   page = alloc_page(gfp);
-   if (unlikely(!page))
-   return -ENOMEM;
-   dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
-   if (unlikely(dma_mapping_error(priv->ddev, dma))) {
-   __free_page(page);
-   return -ENOMEM;
-   }
-   frag->page = page;
-   frag->dma = dma;
-   frag->page_offset = priv->rx_headroom;
-   return 0;
-}
 
-static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
-  struct mlx4_en_rx_ring *ring,
-  struct mlx4_en_rx_desc *rx_desc,
-  struct mlx4_en_rx_alloc *frags,
-  gfp_t gfp)
-{
-   int i;
+   if (unlikely(!ring->pre_allocated_count)) {
+   unsigned int order = READ_ONCE(ring->rx_alloc_order);
 
-   for (i = 0; i < priv->num_frags; i++, frags++) {
-   if (!frags->page) {
-   if (mlx4_alloc_page(priv, frags, gfp))
-   return -ENOMEM;
-   ring->rx_alloc_pages++;
+   page = __alloc_pages_node(node, gfp | __GFP_NOMEMALLOC |
+   __GFP_NOWARN | __GFP_NORETRY,
+ order);
+   if (page) {
+   split_page(page, order);
+   ring->pre_allocated_count = 1U << order;
+   } else {
+   if (order > 1)
+   ring->rx_alloc_order--;
+   page = __alloc_pages_node(node, gfp, 0);
+   if (unlikely(!page))
+   return NULL;
+   ring->pre_allocated_count = 1U;