Re: rip sbappendaddr() with inpcb table mutex

Claudio Jeker Tue, 22 Mar 2022 11:05:44 -0700

On Tue, Mar 22, 2022 at 06:35:47PM +0100, Alexander Bluhm wrote:
> On Tue, Mar 22, 2022 at 04:42:45PM +0100, Claudio Jeker wrote:
> > No but you push this layer into a specifc direction and by that make it
> > harder to fix the PCB tables in a different way. I just see people
> > changing the NET_ASSERT_WLOCKED() without realizing the actual reason for
> > the exclusive netlock use.
> 
> Of course MP for PCB could be implemented differently.  SRP, SMR,
> whatever.  But mutex is easy and we should start with that.  As we
> currently run only on one CPU, it does not matter.


Mutexes are easy but they also tend to cause deadlocks and lock order
issues when used in more complex situation. It makes sense to use them for
PCBs but maybe something else is needed for the PCB lookup tables. The
locking dance done in FreeBSD in that area of code is rather complicated.
 
> There is much more to be done like ref counting and protecting the
> PCB fields.  But I want to go in small steps.  This NET_ASSERT_WLOCKED()
> beside SIMPLEQ_INIT() makes it quite obvious where the next unlocking
> problem is.  Look in netinet/ip_ipsp.c tdb_walk(), there is another
> one.  When they will be only left in the slow path a lot is gained.
> And if not, we have to fix them step by step.

Agreed. There is a lot of work to be done and not enough hands to actually
work on them.
 
> > Looking at the pcb hash problem, I have to wonder if this reinserting of
> > PCBs is actually resulting in a measurable performance difference. The
> > hash table should be large enough to keep the number of PCB per bucket low. 
> 
> The reinsertion is done for PCB notify, UDP multicast, Raw IPv4 and
> IPv6.  I don't have benchmarks for these cases and I doubt that
> others will feel much difference there.
> 
> There is one thing that might make things slower.  The in_pcbref(inp)
> in_pcbunref(inp) is not strictly necessary, we have exclusive net
> lock.  But I put it there so we will not forget it when unlocking.
> Maybe it costs a bit of performance, but who cares about multicast
> and rip.

Oh, I was talking about the mail you referenced where in_pcbhashlookup()
does this:
                        if (inp != LIST_FIRST(head)) {
                                LIST_REMOVE(inp, inp_hash);
                                LIST_INSERT_HEAD(head, inp, inp_hash);
                        }

I highly doubt that this improves throughput in most cases (even with e.g.
100k active connections).
 
> I have measured the PCB mutex diff.  But as usual benchmarks have
> to be explained.
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/perform.html
> 
> Left column is baseline, middle column is as mistake, where I only
> applied the IPv4 part.  Right columns is the full PCB diff, but
> without UDP Multicast and Raw IP queuing, which I missed before.
> 
> In TCP Perfomance graph (IPv4 only), the right column looks slow.
> But the code difference between middle and right only affects IPv6.
> How can that be?  The answer is in this row:
> 
> kernel name list              +53 -52         +40054 -40054
> 
> Mapping the kernel object files to different pages affects throughput
> a lot, more than most diffs.  So I sort and align them and compare
> the nm /bsd output.  When you click on it you see how 40000 symbol
> addresses move around.
> 
> So the more or less correct numbers are in the middle column but
> only for IPv4.
> 
> Look at the kstack output of one TCP benchmark.
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/iperf3_-c10.3.45.35_-w1m_-t10_-R-btrace-kstack.0.svg
> 
> Search for mtx in the right top field.  Then you see mutex contension.
> They are not in PCB lockup as the pf state links to the socket.
> 
> When receiving short UDP packets you can see the affected code:
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg
> 
> We are 4.3% in PCB lookup.  And in that part you find 10% in mutex.
> 
> Compare it to the orignal code, this mutex is not there:
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/2022-03-10T00%3A00%3A00Z/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg
> 
> But UDP thoughput numbers do not change.
 
I do not expect this diff to make a big difference for performance and
your analysis shows this. To get forward we need to accept that some
benchmarks will suffer before they get better again. A lot of code needs
to be changed and not all of it can happen in one go.

> > One comment below.
> > At least the one bit that can fail can be moved outside of this loop:
> 
> This is a very good idea and can be done upfront.
> 
> ok?

OK.
 
> bluhm
> 
> Index: netinet6/raw_ip6.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/raw_ip6.c,v
> retrieving revision 1.145
> diff -u -p -r1.145 raw_ip6.c
> --- netinet6/raw_ip6.c        21 Mar 2022 09:12:34 -0000      1.145
> +++ netinet6/raw_ip6.c        22 Mar 2022 16:22:43 -0000
> @@ -125,10 +125,19 @@ rip6_input(struct mbuf **mp, int *offp, 
>       struct in6_addr *key;
>       struct sockaddr_in6 rip6src;
>       struct mbuf *opts = NULL;
> +     uint8_t type;
>  
>       KASSERT(af == AF_INET6);
>  
> -     if (proto != IPPROTO_ICMPV6)
> +     if (proto == IPPROTO_ICMPV6) {
> +             struct icmp6_hdr *icmp6;
> +
> +             IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp,
> +                 sizeof(*icmp6));
> +             if (icmp6 == NULL)
> +                     return IPPROTO_DONE;
> +             type = icmp6->icmp6_type;
> +     } else
>               rip6stat_inc(rip6s_ipackets);
>  
>       bzero(&rip6src, sizeof(rip6src));
> @@ -177,16 +186,7 @@ rip6_input(struct mbuf **mp, int *offp, 
>                   !IN6_ARE_ADDR_EQUAL(&in6p->inp_faddr6, &ip6->ip6_src))
>                       continue;
>               if (proto == IPPROTO_ICMPV6 && in6p->inp_icmp6filt) {
> -                     struct icmp6_hdr *icmp6;
> -
> -                     IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp,
> -                         sizeof(*icmp6));
> -                     if (icmp6 == NULL) {
> -                             mtx_leave(&rawin6pcbtable.inpt_mtx);
> -                             return IPPROTO_DONE;
> -                     }
> -                     if (ICMP6_FILTER_WILLBLOCK(icmp6->icmp6_type,
> -                         in6p->inp_icmp6filt))
> +                     if (ICMP6_FILTER_WILLBLOCK(type, in6p->inp_icmp6filt))
>                               continue;
>               }
>               if (proto != IPPROTO_ICMPV6 && in6p->inp_cksum6 != -1) {
> 

-- 
:wq Claudio

Re: rip sbappendaddr() with inpcb table mutex

Reply via email to