Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote: > On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote: > > This series refactor the UDP early demux code so that: > > > > * full socket lookup is performed for unicast packets > > * a sk is grabbed even for unconnected socket match > > * a dst cache is used even in such scenario > > > > To perform this tasks a couple of facilities are added: > > > > * noref socket references, scoped inside the current RCU section, to be > > explicitly cleared before leaving such section > > * a dst cache inside the inet and inet6 local addresses tables, caching the > > related local dst entry > > > > The measured performance gain under small packet UDP flood is as follow: > > > > ingress NIC vanilla patched delta > > rx queues (kpps) (kpps) (%) > > [ipv4] > > 1 2177241410 > > 2 2527289214 > > 3 3050373322 > > > This is a clear sign your program is not using latest SO_REUSEPORT + > [ec]BPF filter [1] > > return socket[RX_QUEUE# | or CPU#]; > > If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is > based on a lazy hash, meaning that you do not have proper siloing. > > return socket[hash(skb)]; > > Multiple cpus can then : > - compete on grabbing same socket refcount > - compete on grabbing the receive queue lock > - compete for releasing lock and socket refcount > - skb freeing done on different cpus than where allocated. > > You are adding complexity to the kernel because you are using a > sub-optimal user space program, favoring false sharing. > > First solve the false sharing issue. > > Performance with 2 rx queues should be almost twice the performance with > 1 rx queue. > > Then we can see if the gains you claim are still applicable. Here are the performance results using a BPF filter to distribute the ingress packet to the reuseport socket with the same id of the ingress CPU - we have 1 to 1 mapping between the ingress receive queue and the destination socket: ingress NIC vanilla patched delta rx queues (kpps) (kpps) (%) [ipv4] 2 3020366321 3 4352517919 4 5318619416 5 6258758321 6 7376855816 [ipv6] 2 2446394961 3 3099509264 4 3698661178 5 4382785279 6 5116885173 Sone notes: - figures obtained with: ethtool -L em2 combined $n MASK=1 for I in `seq 0 $((n - 1))`; do [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF="" udp_sink --reuseport $USE_BPF --recvfrom --count 1000 --port 9 & taskset -p $((MASK << ($I + $n) )) $! done - in the IPv6 routing code we currently have a relevant bottle-neck in ip6_pol_route(), I see a lot of contention on a dst refcount, so without early demux the performances do not scale well there. - For maximum performances BH and user space sink need to run on difference CPUs - yes we have some more cacheline misses and a little contention on the receive queue spin lock, but a lot less icache misses and more CPU cycles available, the overall tput is a lot higher than binding on the same CPU where the BH is running. > PS: Wei Wan is about to release the IPV6 changes so that the big > differences you showed are going to disappear soon. Interesting, looking forward to that! Cheers, Paolo
Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote: > This series refactor the UDP early demux code so that: > > * full socket lookup is performed for unicast packets > * a sk is grabbed even for unconnected socket match > * a dst cache is used even in such scenario > > To perform this tasks a couple of facilities are added: > > * noref socket references, scoped inside the current RCU section, to be > explicitly cleared before leaving such section > * a dst cache inside the inet and inet6 local addresses tables, caching the > related local dst entry > > The measured performance gain under small packet UDP flood is as follow: > > ingress NIC vanilla patched delta > rx queues (kpps) (kpps) (%) > [ipv4] > 1 2177241410 > 2 2527289214 > 3 3050373322 This is a clear sign your program is not using latest SO_REUSEPORT + [ec]BPF filter [1] return socket[RX_QUEUE# | or CPU#]; If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is based on a lazy hash, meaning that you do not have proper siloing. return socket[hash(skb)]; Multiple cpus can then : - compete on grabbing same socket refcount - compete on grabbing the receive queue lock - compete for releasing lock and socket refcount - skb freeing done on different cpus than where allocated. You are adding complexity to the kernel because you are using a sub-optimal user space program, favoring false sharing. First solve the false sharing issue. Performance with 2 rx queues should be almost twice the performance with 1 rx queue. Then we can see if the gains you claim are still applicable. Thanks PS: Wei Wan is about to release the IPV6 changes so that the big differences you showed are going to disappear soon. Refs [1] tools/testing/selftests/net/reuseport_bpf.c 6a5ef90c58daada158ba16ba330558efc3471491 Merge branch 'faster-soreuseport' 3ca8e4029969d40ab90e3f1ecd83ab1cadd60fbb soreuseport: BPF selection functional test 538950a1b7527a0a52ccd9337e3fcd304f027f13 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF e32ea7e747271a0abcd37e265005e97cc81d9df5 soreuseport: fast reuseport UDP socket selection ef456144da8ef507c8cf504284b6042e9201a05c soreuseport: define reuseport groups
[RFC PATCH 00/11] udp: full early demux for unconnected sockets
This series refactor the UDP early demux code so that: * full socket lookup is performed for unicast packets * a sk is grabbed even for unconnected socket match * a dst cache is used even in such scenario To perform this tasks a couple of facilities are added: * noref socket references, scoped inside the current RCU section, to be explicitly cleared before leaving such section * a dst cache inside the inet and inet6 local addresses tables, caching the related local dst entry The measured performance gain under small packet UDP flood is as follow: ingress NIC vanilla patched delta rx queues (kpps) (kpps) (%) [ipv4] 1 2177241410 2 2527289214 3 3050373322 4 3918464318 5 5074569912 6 5654686921 [ipv6] 1 2002282140 2 2087314850 3 2583400855 4 3072496361 5 3719599261 6 4314691060 The number of user space process in use is equal to the number of NIC rx queue; when multiple user space processes the SO_REUSEPORT options is used, as described below: ethtool -L em2 combined $n MASK=1 for I in `seq 0 $((n - 1))`; do udp_sink --reuse-port --recvfrom --count 10 --port 9 $1 & taskset -p $((MASK << ($I + $n) )) $! done Paolo Abeni (11): net: add support for noref skb->sk net: allow early demux to fetch noref socket udp: do not touch socket refcount in early demux net: add simple socket-like dst cache helpers udp: perform full socket lookup in early demux ip/route: factor out helper for local route creation ipv6/addrconf: add an helper for inet6 address lookup net: implement local route cache inside ifaddr route: add ipv4/6 helpers to do partial route lookup vs local dst IP: early demux can return an error code udp: dst lookup in early demux for unconnected sockets include/linux/inetdevice.h | 4 ++ include/linux/skbuff.h | 31 +++ include/linux/udp.h | 2 + include/net/addrconf.h | 3 ++ include/net/dst.h| 20 +++ include/net/if_inet6.h | 4 ++ include/net/ip6_route.h | 1 + include/net/protocol.h | 4 +- include/net/route.h | 4 ++ include/net/tcp.h| 2 +- include/net/udp.h| 2 +- net/core/dst.c | 12 + net/core/sock.c | 7 +++ net/ipv4/devinet.c | 29 ++- net/ipv4/ip_input.c | 33 net/ipv4/netfilter/nf_dup_ipv4.c | 3 ++ net/ipv4/route.c | 73 +++--- net/ipv4/tcp_ipv4.c | 9 ++-- net/ipv4/udp.c | 95 +++--- net/ipv6/addrconf.c | 109 +++ net/ipv6/ip6_input.c | 4 ++ net/ipv6/netfilter/nf_dup_ipv6.c | 3 ++ net/ipv6/route.c | 13 + net/ipv6/udp.c | 72 ++ net/netfilter/nf_queue.c | 3 ++ 25 files changed, 383 insertions(+), 159 deletions(-) -- 2.13.5