Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets

2017-09-25 Thread Paolo Abeni
On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC vanilla patched delta
> > rx queues   (kpps)  (kpps)  (%)
> > [ipv4]
> > 1   2177241410
> > 2   2527289214
> > 3   3050373322
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC vanilla patched delta
rx queues   (kpps)  (kpps)  (%)
[ipv4]
2   3020366321
3   4352517919
4   5318619416
5   6258758321
6   7376855816

[ipv6]
2   2446394961
3   3099509264
4   3698661178
5   4382785279
6   5116885173

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
[ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
udp_sink  --reuseport $USE_BPF --recvfrom --count 1000 --port 9 &
taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo


Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets

2017-09-22 Thread Eric Dumazet
On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> This series refactor the UDP early demux code so that:
> 
> * full socket lookup is performed for unicast packets
> * a sk is grabbed even for unconnected socket match
> * a dst cache is used even in such scenario
> 
> To perform this tasks a couple of facilities are added:
> 
> * noref socket references, scoped inside the current RCU section, to be
>   explicitly cleared before leaving such section
> * a dst cache inside the inet and inet6 local addresses tables, caching the
>   related local dst entry
> 
> The measured performance gain under small packet UDP flood is as follow:
> 
> ingress NIC   vanilla patched delta
> rx queues (kpps)  (kpps)  (%)
> [ipv4]
> 1 2177241410
> 2 2527289214
> 3 3050373322


This is a clear sign your program is not using latest SO_REUSEPORT +
[ec]BPF filter [1]

return socket[RX_QUEUE# | or CPU#];

If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
based on a lazy hash, meaning that you do not have proper siloing.

return socket[hash(skb)];

Multiple cpus can then :
 - compete on grabbing same socket refcount
 - compete on grabbing the receive queue lock
 - compete for releasing lock and socket refcount
 - skb freeing done on different cpus than where allocated.

You are adding complexity to the kernel because you are using a
sub-optimal user space program, favoring false sharing.

First solve the false sharing issue.

Performance with 2 rx queues should be almost twice the performance with
1 rx queue.

Then we can see if the gains you claim are still applicable.

Thanks

PS: Wei Wan is about to release the IPV6 changes so that the big
differences you showed are going to disappear soon.

Refs [1]

tools/testing/selftests/net/reuseport_bpf.c

6a5ef90c58daada158ba16ba330558efc3471491 Merge branch 'faster-soreuseport'
3ca8e4029969d40ab90e3f1ecd83ab1cadd60fbb soreuseport: BPF selection functional 
test
538950a1b7527a0a52ccd9337e3fcd304f027f13 soreuseport: setsockopt 
SO_ATTACH_REUSEPORT_[CE]BPF
e32ea7e747271a0abcd37e265005e97cc81d9df5 soreuseport: fast reuseport UDP socket 
selection
ef456144da8ef507c8cf504284b6042e9201a05c soreuseport: define reuseport groups




[RFC PATCH 00/11] udp: full early demux for unconnected sockets

2017-09-22 Thread Paolo Abeni
This series refactor the UDP early demux code so that:

* full socket lookup is performed for unicast packets
* a sk is grabbed even for unconnected socket match
* a dst cache is used even in such scenario

To perform this tasks a couple of facilities are added:

* noref socket references, scoped inside the current RCU section, to be
  explicitly cleared before leaving such section
* a dst cache inside the inet and inet6 local addresses tables, caching the
  related local dst entry

The measured performance gain under small packet UDP flood is as follow:

ingress NIC vanilla patched delta
rx queues   (kpps)  (kpps)  (%)
[ipv4]
1   2177241410
2   2527289214
3   3050373322
4   3918464318
5   5074569912
6   5654686921

[ipv6]
1   2002282140
2   2087314850
3   2583400855
4   3072496361
5   3719599261
6   4314691060

The number of user space process in use is equal to the number of
NIC rx queue; when multiple user space processes the SO_REUSEPORT 
options is used, as described below:

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
udp_sink  --reuse-port --recvfrom --count 10 --port 9 $1 &
taskset -p $((MASK << ($I + $n) )) $!
done

Paolo Abeni (11):
  net: add support for noref skb->sk
  net: allow early demux to fetch noref socket
  udp: do not touch socket refcount in early demux
  net: add simple socket-like dst cache helpers
  udp: perform full socket lookup in early demux
  ip/route: factor out helper for local route creation
  ipv6/addrconf: add an helper for inet6 address lookup
  net: implement local route cache inside ifaddr
  route: add ipv4/6 helpers to do partial route lookup vs local dst
  IP: early demux can return an error code
  udp: dst lookup in early demux for unconnected sockets

 include/linux/inetdevice.h   |   4 ++
 include/linux/skbuff.h   |  31 +++
 include/linux/udp.h  |   2 +
 include/net/addrconf.h   |   3 ++
 include/net/dst.h|  20 +++
 include/net/if_inet6.h   |   4 ++
 include/net/ip6_route.h  |   1 +
 include/net/protocol.h   |   4 +-
 include/net/route.h  |   4 ++
 include/net/tcp.h|   2 +-
 include/net/udp.h|   2 +-
 net/core/dst.c   |  12 +
 net/core/sock.c  |   7 +++
 net/ipv4/devinet.c   |  29 ++-
 net/ipv4/ip_input.c  |  33 
 net/ipv4/netfilter/nf_dup_ipv4.c |   3 ++
 net/ipv4/route.c |  73 +++---
 net/ipv4/tcp_ipv4.c  |   9 ++--
 net/ipv4/udp.c   |  95 +++---
 net/ipv6/addrconf.c  | 109 +++
 net/ipv6/ip6_input.c |   4 ++
 net/ipv6/netfilter/nf_dup_ipv6.c |   3 ++
 net/ipv6/route.c |  13 +
 net/ipv6/udp.c   |  72 ++
 net/netfilter/nf_queue.c |   3 ++
 25 files changed, 383 insertions(+), 159 deletions(-)

-- 
2.13.5