On Tue, Mar 22, 2022 at 04:42:45PM +0100, Claudio Jeker wrote: > No but you push this layer into a specifc direction and by that make it > harder to fix the PCB tables in a different way. I just see people > changing the NET_ASSERT_WLOCKED() without realizing the actual reason for > the exclusive netlock use.
Of course MP for PCB could be implemented differently. SRP, SMR, whatever. But mutex is easy and we should start with that. As we currently run only on one CPU, it does not matter. There is much more to be done like ref counting and protecting the PCB fields. But I want to go in small steps. This NET_ASSERT_WLOCKED() beside SIMPLEQ_INIT() makes it quite obvious where the next unlocking problem is. Look in netinet/ip_ipsp.c tdb_walk(), there is another one. When they will be only left in the slow path a lot is gained. And if not, we have to fix them step by step. > Looking at the pcb hash problem, I have to wonder if this reinserting of > PCBs is actually resulting in a measurable performance difference. The > hash table should be large enough to keep the number of PCB per bucket low. The reinsertion is done for PCB notify, UDP multicast, Raw IPv4 and IPv6. I don't have benchmarks for these cases and I doubt that others will feel much difference there. There is one thing that might make things slower. The in_pcbref(inp) in_pcbunref(inp) is not strictly necessary, we have exclusive net lock. But I put it there so we will not forget it when unlocking. Maybe it costs a bit of performance, but who cares about multicast and rip. I have measured the PCB mutex diff. But as usual benchmarks have to be explained. http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/perform.html Left column is baseline, middle column is as mistake, where I only applied the IPv4 part. Right columns is the full PCB diff, but without UDP Multicast and Raw IP queuing, which I missed before. In TCP Perfomance graph (IPv4 only), the right column looks slow. But the code difference between middle and right only affects IPv6. How can that be? The answer is in this row: kernel name list +53 -52 +40054 -40054 Mapping the kernel object files to different pages affects throughput a lot, more than most diffs. So I sort and align them and compare the nm /bsd output. When you click on it you see how 40000 symbol addresses move around. So the more or less correct numbers are in the middle column but only for IPv4. Look at the kstack output of one TCP benchmark. http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/iperf3_-c10.3.45.35_-w1m_-t10_-R-btrace-kstack.0.svg Search for mtx in the right top field. Then you see mutex contension. They are not in PCB lockup as the pf state links to the socket. When receiving short UDP packets you can see the affected code: http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg We are 4.3% in PCB lookup. And in that part you find 10% in mutex. Compare it to the orignal code, this mutex is not there: http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/2022-03-10T00%3A00%3A00Z/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg But UDP thoughput numbers do not change. > One comment below. > At least the one bit that can fail can be moved outside of this loop: This is a very good idea and can be done upfront. ok? bluhm Index: netinet6/raw_ip6.c =================================================================== RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/raw_ip6.c,v retrieving revision 1.145 diff -u -p -r1.145 raw_ip6.c --- netinet6/raw_ip6.c 21 Mar 2022 09:12:34 -0000 1.145 +++ netinet6/raw_ip6.c 22 Mar 2022 16:22:43 -0000 @@ -125,10 +125,19 @@ rip6_input(struct mbuf **mp, int *offp, struct in6_addr *key; struct sockaddr_in6 rip6src; struct mbuf *opts = NULL; + uint8_t type; KASSERT(af == AF_INET6); - if (proto != IPPROTO_ICMPV6) + if (proto == IPPROTO_ICMPV6) { + struct icmp6_hdr *icmp6; + + IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp, + sizeof(*icmp6)); + if (icmp6 == NULL) + return IPPROTO_DONE; + type = icmp6->icmp6_type; + } else rip6stat_inc(rip6s_ipackets); bzero(&rip6src, sizeof(rip6src)); @@ -177,16 +186,7 @@ rip6_input(struct mbuf **mp, int *offp, !IN6_ARE_ADDR_EQUAL(&in6p->inp_faddr6, &ip6->ip6_src)) continue; if (proto == IPPROTO_ICMPV6 && in6p->inp_icmp6filt) { - struct icmp6_hdr *icmp6; - - IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp, - sizeof(*icmp6)); - if (icmp6 == NULL) { - mtx_leave(&rawin6pcbtable.inpt_mtx); - return IPPROTO_DONE; - } - if (ICMP6_FILTER_WILLBLOCK(icmp6->icmp6_type, - in6p->inp_icmp6filt)) + if (ICMP6_FILTER_WILLBLOCK(type, in6p->inp_icmp6filt)) continue; } if (proto != IPPROTO_ICMPV6 && in6p->inp_cksum6 != -1) {
