Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-23 Thread Jesper Dangaard Brouer
On Sun, 22 Apr 2018 13:22:58 +0200
Paolo Abeni  wrote:

> On Fri, 2018-04-20 at 15:48 +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet  
> > wrote:  
> > > On 04/19/2018 12:40 AM, Paolo Abeni wrote:  
> > > > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
> > > > > On 04/18/2018 10:15 AM, Paolo Abeni wrote:  
> > 
> > [...]  
> > > > 
> > > > Any suggestions for better results are more than welcome!
> > > 
> > > Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> > > Seoul (netdev conference). Not tied to UDP, but a generic solution.  
> > 
> > Yes, I remember.  I think... was it the idea, where you basically
> > wanted to queue back SKBs to the CPU that allocated them, right?
> > 
> > Freeing an SKB on the same CPU that allocated it, have multiple
> > advantages. (1) the SLUB allocator can use a non-atomic
> > "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> > the SKB stay local.  (3) the atomic SKB refcnt/users stay local.  
> 
> By the time the skb is returned to the ingress cpu, isn't that skb most
> probably out of the cache?

This is a too simplistic view.  You have to look at the cache
coherence state[1] of the individual cache lines (SKB consist of 4
cache-lines). And newer Intel CPUs [2] can "Forward(F)" cache-lines
between caches.  The SKB cache-line that have atomic refcnt/users
important to analyze (Read For Ownership (RFO) case).  Analyzing the
other cache-lines is actually more complicated due to techniques like
"Store Buffer" and "Invalidate Queues".

[1] https://en.wikipedia.org/wiki/MESI_protocol
[2] https://en.wikipedia.org/wiki/MESIF_protocol

There is also a lot of detail in point (1) about how the SLUB
alloactor works internally, and how it avoids bouncing the struct-page
cache-line.  Some of the performance benefit from you current patch
also comes from this...


> > We just have to avoid that queue back SKB's mechanism, doesn't cost
> > more than the operations we expect to save.  Bulk transfer is an
> > obvious approach.  For storing SKBs until they are returned, we already
> > have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> > which SLUB/SLAB-bulk free to amortize cost (1).
> > 
> > I guess, the missing information is that we don't know what CPU the SKB
> > were created on...
> > 
> > Where to store this CPU info?
> > 
> > (a) In struct sk_buff, in a cache-line that is already read on remote
> > CPU in UDP code?
> > 
> > (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
> > basis, we could have SLUB store a hint about the CPU it was allocated
> > on, and bet on returning to that CPU ? (might be bad to read the
> > struct-page cache-line)  
> 
> Bulking would be doable only for connected sockets, elsewhere would be
> difficult to assemble a burst long enough to amortize the handshake
> with the remote CPU (spinlock + ipi needed ?!?)

We obviously need some level of bulking.

I would likely try to avoid any explicit IPI calls, but instead use a
queue like the ptr_ring queue, because it have good separation between
cache-lines used by consumer and producer (but it might be overkill
for this use-case).

 
> Would be good enough for unconnected sockets sending a whole skb burst
> back to one of the (several) ingress CPU? e.g. peeking the CPU
> associated with the first skb inside the burst, we would somewhat
> balance the load between the ingress CPUs.

See, Willem de Bruijn suggestions...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-23 Thread Tariq Toukan



On 20/04/2018 4:48 PM, Jesper Dangaard Brouer wrote:


On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet  wrote:

On 04/19/2018 12:40 AM, Paolo Abeni wrote:

On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:

On 04/18/2018 10:15 AM, Paolo Abeni wrote:

[...]


Any suggestions for better results are more than welcome!


Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
Seoul (netdev conference). Not tied to UDP, but a generic solution.


Yes, I remember.  I think... was it the idea, where you basically
wanted to queue back SKBs to the CPU that allocated them, right?

Freeing an SKB on the same CPU that allocated it, have multiple
advantages. (1) the SLUB allocator can use a non-atomic
"cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

We just have to avoid that queue back SKB's mechanism, doesn't cost
more than the operations we expect to save.  Bulk transfer is an
obvious approach.  For storing SKBs until they are returned, we already
have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
which SLUB/SLAB-bulk free to amortize cost (1).

I guess, the missing information is that we don't know what CPU the SKB
were created on...

Where to store this CPU info?

(a) In struct sk_buff, in a cache-line that is already read on remote
CPU in UDP code?

(b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
basis, we could have SLUB store a hint about the CPU it was allocated
on, and bet on returning to that CPU ? (might be bad to read the
struct-page cache-line)



I'm in favor of (a).

Pages of an SKB originates on the same cpu (guaranteed by NAPI).
So a single field in SKB is good for all of its fragments, no need to 
read this info from every single page. This also keeps the change local 
to the networking subsystem.


Best if we find a hole in struct SKB (for u16?), or union it with a 
mutually exclusive field.


Regards,
Tariq


Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-22 Thread Paolo Abeni
On Fri, 2018-04-20 at 15:48 +0200, Jesper Dangaard Brouer wrote:
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet  
> wrote:
> > On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> > > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:  
> > > > On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> 
> [...]
> > > 
> > > Any suggestions for better results are more than welcome!  
> > 
> > Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> > Seoul (netdev conference). Not tied to UDP, but a generic solution.
> 
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
> 
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

By the time the skb is returned to the ingress cpu, isn't that skb most
probably out of the cache?

> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
> 
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...
> 
> Where to store this CPU info?
> 
> (a) In struct sk_buff, in a cache-line that is already read on remote
> CPU in UDP code?
> 
> (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
> basis, we could have SLUB store a hint about the CPU it was allocated
> on, and bet on returning to that CPU ? (might be bad to read the
> struct-page cache-line)

Bulking would be doable only for connected sockets, elsewhere would be
difficult to assemble a burst long enough to amortize the handshake
with the remote CPU (spinlock + ipi needed ?!?)

Would be good enough for unconnected sockets sending a whole skb burst
back to one of the (several) ingress CPU? e.g. peeking the CPU
associated with the first skb inside the burst, we would somewhat
balance the load between the ingress CPUs.

Cheers,

Paolo


Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-21 Thread Eric Dumazet


On 04/21/2018 08:54 AM, Willem de Bruijn wrote:
> On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
>  wrote:
>>
>> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet  
>> wrote:
>>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
 On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
>> [...]

 Any suggestions for better results are more than welcome!
>>>
>>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>>
>> Yes, I remember.  I think... was it the idea, where you basically
>> wanted to queue back SKBs to the CPU that allocated them, right?
>>
>> Freeing an SKB on the same CPU that allocated it, have multiple
>> advantages. (1) the SLUB allocator can use a non-atomic
>> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
>> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>>
>> We just have to avoid that queue back SKB's mechanism, doesn't cost
>> more than the operations we expect to save.  Bulk transfer is an
>> obvious approach.  For storing SKBs until they are returned, we already
>> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
>> which SLUB/SLAB-bulk free to amortize cost (1).
>>
>> I guess, the missing information is that we don't know what CPU the SKB
>> were created on...
> 
> For connected sockets, sk->sk_incoming_cpu has this data. It
> records BH cpu on enqueue to udp socket, so one caveat is that
> it may be wrong with rps/rfs.
> 
> Another option is to associate not with source cpu but napi struct
> and have the device driver free in the context of its napi processing.
> This has the additional benefit that skb->napi_id is already stored
> per skb, so this also works for unconnected sockets.
> 
> Third, the skb->napi_id field is unused after setting sk->sk_napi_id
> on sk enqueue, so the BH cpu could be stored here after that,
> essentially extending sk_incoming_cpu to unconnected sockets.

We use at Google something named TXCS, which is what I mentioned to Jesper and 
Tariq.

(In our case, we wanted to not perform skb destructor/freeing on the cpu 
handling the TX queue,
but on cpus that originally cooked the skb (running TCP stack))

To accommodate generic needs (both RX and TX), I do not believe we can union 
any existing fields,
without a lot of pain/bugs.




Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-21 Thread Willem de Bruijn
On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
 wrote:
>
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet  
> wrote:
>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>> > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>> >> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> [...]
>> >
>> > Any suggestions for better results are more than welcome!
>>
>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
>
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>
> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
>
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...

For connected sockets, sk->sk_incoming_cpu has this data. It
records BH cpu on enqueue to udp socket, so one caveat is that
it may be wrong with rps/rfs.

Another option is to associate not with source cpu but napi struct
and have the device driver free in the context of its napi processing.
This has the additional benefit that skb->napi_id is already stored
per skb, so this also works for unconnected sockets.

Third, the skb->napi_id field is unused after setting sk->sk_napi_id
on sk enqueue, so the BH cpu could be stored here after that,
essentially extending sk_incoming_cpu to unconnected sockets.


Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-20 Thread Jesper Dangaard Brouer

On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet  wrote:
> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:  
> >> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
[...]
> > 
> > Any suggestions for better results are more than welcome!  
> 
> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> Seoul (netdev conference). Not tied to UDP, but a generic solution.

Yes, I remember.  I think... was it the idea, where you basically
wanted to queue back SKBs to the CPU that allocated them, right?

Freeing an SKB on the same CPU that allocated it, have multiple
advantages. (1) the SLUB allocator can use a non-atomic
"cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

We just have to avoid that queue back SKB's mechanism, doesn't cost
more than the operations we expect to save.  Bulk transfer is an
obvious approach.  For storing SKBs until they are returned, we already
have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
which SLUB/SLAB-bulk free to amortize cost (1).

I guess, the missing information is that we don't know what CPU the SKB
were created on...

Where to store this CPU info?

(a) In struct sk_buff, in a cache-line that is already read on remote
CPU in UDP code?

(b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
basis, we could have SLUB store a hint about the CPU it was allocated
on, and bet on returning to that CPU ? (might be bad to read the
struct-page cache-line)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-19 Thread Eric Dumazet


On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> Hi,
> 
> On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>>
>> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
>> is not appealing to me :/
>>>
>>> Thank you for the feedback.
>>> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
>>> the above tests are leveraging it.
>>>
>>> That 5% is on top of that 300%.
>>
>> Then there is something wrong.
>>
>> Adding copies should not increase performance.
> 
> The skb and data are copied into the UDP skb cache only if the socket
> is under memory pressure, and that happens if and only if the receiver
> is slower than the BH/IP receive path.

Which is going to happen under attack.

Bimodal behavior is dangerous for system stability..

> 
> The copy slows down the RX path - which was dropping packets - and
> makes the udp_recvmsg() considerably faster, as consuming skb becomes
> almost a no-op.
> 
> AFAICS, this is similar to the strategy you used in:
> 
> ommit c8c8b127091b758f5768f906bcdeeb88bc9951ca
> Author: Eric Dumazet 
> Date:   Wed Dec 7 09:19:33 2016 -0800
> 
> udp: under rx pressure, try to condense skbs
> 
> with the difference that with the UDP skb cache there is an hard limit
> to the amount of memory the BH is allowed to copy.
>

Very different strategy really.

We do not copy 500 bytes per skb :/

and the total amount of memory is tunable (socket rcvbuf)
instead of hard coded in the kernel :/

 
>> If it does, there is certainly another way, reaching 10% instead of 5%
> 
> I benchmarked vs a DNS server to test and verify that we get measurable
> benefits in real life scenario. The measured performance gain for the
> RX path with reasonable configurations is ~20%.

Then we probably can make +40% without copies.



> 
> Any suggestions for better results are more than welcome!


Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in Seoul 
(netdev conference)

Not tied to UDP, but a generic solution.

You are adding more and more code only that only helps in some benchmarks 
really.

UDP stack is becoming a very complex beast, while heavy duty UDP servers have 
alternatives,
and can not cope with arbitrary flood anyway.




Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-19 Thread Paolo Abeni
Hi,

On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
> 
> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> is not appealing to me :/
> > 
> > Thank you for the feedback.
> > Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> > the above tests are leveraging it.
> > 
> > That 5% is on top of that 300%.
> 
> Then there is something wrong.
> 
> Adding copies should not increase performance.

The skb and data are copied into the UDP skb cache only if the socket
is under memory pressure, and that happens if and only if the receiver
is slower than the BH/IP receive path.

The copy slows down the RX path - which was dropping packets - and
makes the udp_recvmsg() considerably faster, as consuming skb becomes
almost a no-op.

AFAICS, this is similar to the strategy you used in:

ommit c8c8b127091b758f5768f906bcdeeb88bc9951ca
Author: Eric Dumazet 
Date:   Wed Dec 7 09:19:33 2016 -0800

udp: under rx pressure, try to condense skbs

with the difference that with the UDP skb cache there is an hard limit
to the amount of memory the BH is allowed to copy.

> If it does, there is certainly another way, reaching 10% instead of 5%

I benchmarked vs a DNS server to test and verify that we get measurable
benefits in real life scenario. The measured performance gain for the
RX path with reasonable configurations is ~20%.

Any suggestions for better results are more than welcome!

Cheers,

Paolo


Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-18 Thread Eric Dumazet


On 04/18/2018 10:15 AM, Paolo Abeni wrote:
is not appealing to me :/
> 
> Thank you for the feedback.
> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> the above tests are leveraging it.
> 
> That 5% is on top of that 300%.

Then there is something wrong.

Adding copies should not increase performance.

If it does, there is certainly another way, reaching 10% instead of 5%




Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-18 Thread Paolo Abeni
Hi,

On Wed, 2018-04-18 at 09:56 -0700, Eric Dumazet wrote:
> 
> On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> > This changeset extends the idea behind commit c8c8b127091b ("udp:
> > under rx pressure, try to condense skbs"), trading more BH cpu
> > time and memory bandwidth to decrease the load on the user space
> > receiver.
> > 
> > At boot time we allocate a limited amount of skbs with small
> > data buffer, storing them in per cpu arrays. Such skbs are never
> > freed.
> > 
> > At run time, under rx pressure, the BH tries to copy the current
> > skb contents into the cache - if the current cache skb is available,
> > and the ingress skb is small enough and without any head states.
> > 
> > When using the cache skb, the ingress skb is dropped by the BH
> > - while still hot on cache - and the cache skb is inserted into
> > the rx queue, after increasing its usage count. Also, the cache
> > array index is moved to the next entry.
> > 
> > The receive side is unmodified: in udp_rcvmsg() the usage skb
> > usage count is decreased and the skb is _not_ freed - since the
> > cache keeps usage > 0. Since skb->usage is hot in the cache of the
> > receiver at consume time - the receiver has just read skb->data,
> > which lies in the same cacheline - the whole skb_consume_udp() becomes
> > really cheap.
> > 
> > UDP receive performances under flood improve as follow:
> > 
> > NR RX queuesKppsKppsDelta (%)
> > Before  After
> > 
> > 1   225223052
> > 2   2151256919
> > 4   2033239617
> > 8   1969232918
> > 
> > Overall performances of knotd DNS server under real traffic flood
> > improves as follow:
> > 
> > KppsKppsDelta (%)
> > Before  After
> > 
> > 377739815
> 
> 
> It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
> adding this bloat to the kernel ?
> 
> Sorry, 5% improvement while you easily can get 300% improvement with no 
> kernel change
> is not appealing to me :/

Thank you for the feedback.
Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
the above tests are leveraging it.

That 5% is on top of that 300%.

Cheers,

Paolo




Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-18 Thread Eric Dumazet


On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> This changeset extends the idea behind commit c8c8b127091b ("udp:
> under rx pressure, try to condense skbs"), trading more BH cpu
> time and memory bandwidth to decrease the load on the user space
> receiver.
> 
> At boot time we allocate a limited amount of skbs with small
> data buffer, storing them in per cpu arrays. Such skbs are never
> freed.
> 
> At run time, under rx pressure, the BH tries to copy the current
> skb contents into the cache - if the current cache skb is available,
> and the ingress skb is small enough and without any head states.
> 
> When using the cache skb, the ingress skb is dropped by the BH
> - while still hot on cache - and the cache skb is inserted into
> the rx queue, after increasing its usage count. Also, the cache
> array index is moved to the next entry.
> 
> The receive side is unmodified: in udp_rcvmsg() the usage skb
> usage count is decreased and the skb is _not_ freed - since the
> cache keeps usage > 0. Since skb->usage is hot in the cache of the
> receiver at consume time - the receiver has just read skb->data,
> which lies in the same cacheline - the whole skb_consume_udp() becomes
> really cheap.
> 
> UDP receive performances under flood improve as follow:
> 
> NR RX queues  KppsKppsDelta (%)
>   Before  After
> 
> 1 225223052
> 2 2151256919
> 4 2033239617
> 8 1969232918
> 
> Overall performances of knotd DNS server under real traffic flood
> improves as follow:
> 
>   KppsKppsDelta (%)
>   Before  After
> 
>   377739815


It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
adding this bloat to the kernel ?

Sorry, 5% improvement while you easily can get 300% improvement with no kernel 
change
is not appealing to me :/