Re: rip sbappendaddr() with inpcb table mutex

Alexander Bluhm Tue, 22 Mar 2022 10:36:25 -0700

On Tue, Mar 22, 2022 at 04:42:45PM +0100, Claudio Jeker wrote:
> No but you push this layer into a specifc direction and by that make it
> harder to fix the PCB tables in a different way. I just see people
> changing the NET_ASSERT_WLOCKED() without realizing the actual reason for
> the exclusive netlock use.


Of course MP for PCB could be implemented differently.  SRP, SMR,
whatever.  But mutex is easy and we should start with that.  As we
currently run only on one CPU, it does not matter.

There is much more to be done like ref counting and protecting the
PCB fields.  But I want to go in small steps.  This NET_ASSERT_WLOCKED()
beside SIMPLEQ_INIT() makes it quite obvious where the next unlocking
problem is.  Look in netinet/ip_ipsp.c tdb_walk(), there is another
one.  When they will be only left in the slow path a lot is gained.
And if not, we have to fix them step by step.

> Looking at the pcb hash problem, I have to wonder if this reinserting of
> PCBs is actually resulting in a measurable performance difference. The
> hash table should be large enough to keep the number of PCB per bucket low. 

The reinsertion is done for PCB notify, UDP multicast, Raw IPv4 and
IPv6.  I don't have benchmarks for these cases and I doubt that
others will feel much difference there.

There is one thing that might make things slower.  The in_pcbref(inp)
in_pcbunref(inp) is not strictly necessary, we have exclusive net
lock.  But I put it there so we will not forget it when unlocking.
Maybe it costs a bit of performance, but who cares about multicast
and rip.

I have measured the PCB mutex diff.  But as usual benchmarks have
to be explained.

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/perform.html

Left column is baseline, middle column is as mistake, where I only
applied the IPv4 part.  Right columns is the full PCB diff, but
without UDP Multicast and Raw IP queuing, which I missed before.

In TCP Perfomance graph (IPv4 only), the right column looks slow.
But the code difference between middle and right only affects IPv6.
How can that be?  The answer is in this row:

kernel name list                +53 -52         +40054 -40054

Mapping the kernel object files to different pages affects throughput
a lot, more than most diffs.  So I sort and align them and compare
the nm /bsd output.  When you click on it you see how 40000 symbol
addresses move around.

So the more or less correct numbers are in the middle column but
only for IPv4.

Look at the kstack output of one TCP benchmark.

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/iperf3_-c10.3.45.35_-w1m_-t10_-R-btrace-kstack.0.svg

Search for mtx in the right top field.  Then you see mutex contension.
They are not in PCB lockup as the pf state links to the socket.

When receiving short UDP packets you can see the affected code:

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg

We are 4.3% in PCB lookup.  And in that part you find 10% in mutex.

Compare it to the orignal code, this mutex is not there:

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/2022-03-10T00%3A00%3A00Z/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg

But UDP thoughput numbers do not change.


> One comment below.
> At least the one bit that can fail can be moved outside of this loop:

This is a very good idea and can be done upfront.

ok?

bluhm

Index: netinet6/raw_ip6.c
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/raw_ip6.c,v
retrieving revision 1.145
diff -u -p -r1.145 raw_ip6.c
--- netinet6/raw_ip6.c  21 Mar 2022 09:12:34 -0000      1.145
+++ netinet6/raw_ip6.c  22 Mar 2022 16:22:43 -0000
@@ -125,10 +125,19 @@ rip6_input(struct mbuf **mp, int *offp, 
        struct in6_addr *key;
        struct sockaddr_in6 rip6src;
        struct mbuf *opts = NULL;
+       uint8_t type;
 
        KASSERT(af == AF_INET6);
 
-       if (proto != IPPROTO_ICMPV6)
+       if (proto == IPPROTO_ICMPV6) {
+               struct icmp6_hdr *icmp6;
+
+               IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp,
+                   sizeof(*icmp6));
+               if (icmp6 == NULL)
+                       return IPPROTO_DONE;
+               type = icmp6->icmp6_type;
+       } else
                rip6stat_inc(rip6s_ipackets);
 
        bzero(&rip6src, sizeof(rip6src));
@@ -177,16 +186,7 @@ rip6_input(struct mbuf **mp, int *offp, 
                    !IN6_ARE_ADDR_EQUAL(&in6p->inp_faddr6, &ip6->ip6_src))
                        continue;
                if (proto == IPPROTO_ICMPV6 && in6p->inp_icmp6filt) {
-                       struct icmp6_hdr *icmp6;
-
-                       IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp,
-                           sizeof(*icmp6));
-                       if (icmp6 == NULL) {
-                               mtx_leave(&rawin6pcbtable.inpt_mtx);
-                               return IPPROTO_DONE;
-                       }
-                       if (ICMP6_FILTER_WILLBLOCK(icmp6->icmp6_type,
-                           in6p->inp_icmp6filt))
+                       if (ICMP6_FILTER_WILLBLOCK(type, in6p->inp_icmp6filt))
                                continue;
                }
                if (proto != IPPROTO_ICMPV6 && in6p->inp_cksum6 != -1) {

Re: rip sbappendaddr() with inpcb table mutex

Reply via email to