On 2021/11/13 22:41, Stuart Henderson wrote: > On 2021/11/13 21:49, Stuart Henderson wrote: > > On 2021/11/13 18:04, Alexander Bluhm wrote: > > > Hi, > > > > > > To make IPsec MP safe we need refcounting for the tdb. The diff > > > below is part of something bigger we have at genua. Although it > > > does not cover timeouts and the tdb reaper yet, I want to get this > > > in as a frist step. > > > > > > It passes regress but there are setups that are not covered. Bridge > > > and pfsync with IPsec and TCP signature need special care. > > > > > > When testing, please check for tdb leaks. The vmstat -m tdb in use > > > culumn must be 0. It it is not, try ipsecctl -F. If ipsecctl -sa > > > shows nothing, but vmstat -m shows a used tdb, then it is a bug. > > > > > > Name Size Requests Fail InUse Pgreq Pgrel Npage Hiwat Minpg > > > Maxpg Idle > > > tdb 904 56 0 0 7 7 0 7 0 > > > 8 0 > > > > Tested with a box running isakmpd at home. Config is: > > > > isakmpd_flags=-Kv -D0=29 -D1=49 -D2=10 -D3=30 -D5=20 -D6=30 -D8=30 -D9=30 > > -D10=20 > > > > ike passive esp \ > > from xxx.xx.xxx.1 (xxx.xx.xxx.0/27) to 0.0.0.0/0 \ > > local xxx.xx.xxx.1 \ > > main auth hmac-sha2-256 enc aes group ecp256 \ > > quick enc aes-128-gcm group ecp256 \ > > srcid xxx > > > > isakmpd starts at boot, 3 devices connect in and tunnels come up, > > FLOWS and SAD show in ipsecctl -sa. > > > > isakmpd crashes shortly after starting (as described in previous mails; > > this is not new, it happens very often on the first start at boot), > > FLOWS and SAD remain showing. At this point they still show InUse: > > > > Name Size Requests Fail InUse Pgreq Pgrel Npage Hiwat Minpg Maxpg > > Idle > > tdb 1088 9 0 6 1 0 1 1 0 8 > > 0 > > > > After ipsecctl -F they are freed: > > > > tdb 1088 9 0 0 1 0 1 1 0 8 > > 1 > > > > Restart isakmpd, ipsecctl -f /etc/ipsec.conf, devices connect back in: > > > > tdb 1088 18 0 6 2 1 1 1 0 8 > > 0 > > > > (And at this point isakmpd is usually stable). > > > > I will leave a shell loop running vmstat -m every few minutes to check > > if it grows. > > > > (Bad timing for testing TCPMD5 because over the last couple of days I have > > just removed bgpd+ospfd from the machines that would have been good > > choices to run test diffs..I might be able to get some simple setup > > back up and running though) > > Ah, that was interesting: > > Sat Nov 13 21:44:17 GMT 2021 > tdb 1088 18 0 6 2 1 1 1 0 8 > 0 > > Sat Nov 13 21:54:44 GMT 2021 > tdb 1088 18 0 6 2 1 1 1 0 8 > 0 > > Sat Nov 13 22:04:44 GMT 2021 > tdb 1088 27 0 12 2 1 1 1 0 8 > 0 > > and then shortly after, boom. Unfortunately I didn't get > much from ddb before it hanged after I tried changing cpu. > > [playback] > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7ffffb2d20, count: 248 > End of stack trace. > splassert: tdb_free: want 2 have 0 > Starting stack trace... > tdb_free(ffff800000f40b88) at tdb_free+0x47 > pfkeyv2_send(fffffd80603603b8,ffff800000f44c00,170) at pfkeyv2_send+0x656 > pfkeyv2_output(fffffd8064cb5e00,fffffd80603603b8,0,0) at pfkeyv2_output+0x8a > pfkeyv2_usrreq(fffffd80603603b8,9,fffffd8064cb5e00,0,0,ffff8000ffff9508) at > pfkeyv2_usrreq+0x1b0 > sosend(fffffd80603603b8,0,ffff80001fd8f6f0,0,0,0) at sosend+0x3a9 > > dofilewritev(ffff8000ffff9508,7,ffff80001fd8f6f0,0,ffff80001fd8f7f0) at > dofilewritev+0x14d > sys_writev(ffff8000ffff9508,ffff80001fd8f790,ffff80001fd8f7f0) at > sys_writev+0xd2 > syscall(ffff80001fd8f860) at syscall+0x3a9 > > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7ffffb2d20, count: 248 > End of stack trace. > splassert: tdb_free: want 2 have 0 > Starting stack trace... > tdb_free(ffff800000f40308) at tdb_free+0x47 > pfkeyv2_send(fffffd80603603b8,ffff800000eff300,50) at pfkeyv2_send+0x656 > pfkeyv2_output(fffffd8062506e00,fffffd80603603b8,0,0) at pfkeyv2_output+0x8a > pfkeyv2_usrreq(fffffd80603603b8,9,fffffd8062506e00,0,0,ffff8000ffff9508) at > pfkeyv2_usrreq+0x1b0 > sosend(fffffd80603603b8,0,ffff80001fd8f6f0,0,0,0) at sosend+0x3a9 > > dofilewritev(ffff8000ffff9508,7,ffff80001fd8f6f0,0,ffff80001fd8f7f0) at > dofilewritev+0x14d > sys_writev(ffff8000ffff9508,ffff80001fd8f790,ffff80001fd8f7f0) at > sys_writev+0xd2 > syscall(ffff80001fd8f860) at syscall+0x3a9 > > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7ffffc3170, count: 248 > End of stack trace. > splassert: ipsp_ids_free: want 2 have 0 > Starting stack trace... > ipsp_ids_free(ffff800000f30880) at ipsp_ids_free+0x4e > tdb_free(ffff800000f40308) at tdb_free+0x13c > pfkeyv2_send(fffffd80603603b8,ffff800000eff300,50) at pfkeyv2_send+0x656 > pfkeyv2_output(fffffd8062506e00,fffffd80603603b8,0,0) at pfkeyv2_output+0x8a > pfkeyv2_usrreq(fffffd80603603b8,9,fffffd8062506e00,0,0,ffff8000ffff9508) at > pfkeyv2_usrreq+0x1b0 > sosend(fffffd80603603b8,0,ffff80001fd8f6f0,0,0,0) at sosend+0x3a9 > > dopfainlice:wr iketernv(eflf ffd8i0a0gn0offsftfic9 5a08ss,7er,ftfiofnf > 8"0r00ef1cfdnt8 f6!=f 0,~00",f ffaiflfe80d:0 01fiflde8 f7"f/s0ys)/ akter > n/kern_ > sdyofncilh.ecw"ri,t elivn+e0 x8124d4 > > Stopped at db_enter+0x10: popq %rbp > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 185954 5669 68 0x10 0 1 isakmpd > 67015 91815 0 0x14000 0x200 3 softnet > *318814 6181 0 0x14000 0x40000200 0 softclock > db_enter() at db_enter+0x10 > panic(ffffffff81e4a37b) at panic+0xbf > __assert(ffffffff81eb3b74,ffffffff81e1e19f,338,ffffffff81e4f078) at > __assert+0x > 25 > > refcnt_rele(ffff800000f40330) at refcnt_rele+0x6f > tdb_delete(ffff800000f40308) at tdb_delete+0x84 > tdb_timeout(ffff800000f40308) at tdb_timeout+0x39 > softclock_thread(ffff8000fffff500) at softclock_thread+0x16e > end trace frame: 0x0, count: 8 > https://www.openbsd.org/ddb.html describes the minimum info required in bug > reports. Insufficient info makes it difficult to find and fix bugs. > ddb{0}> [-- sthen@localhost attached -- Sat Nov 13 22:23:17 2021] > tr > db_enter() at db_enter+0x10 > panic(ffffffff81e4a37b) at panic+0xbf > __assert(ffffffff81eb3b74,ffffffff81e1e19f,338,ffffffff81e4f078) at > __assert+0x > 25 > > refcnt_rele(ffff800000f40330) at refcnt_rele+0x6f > tdb_delete(ffff800000f40308) at tdb_delete+0x84 > tdb_timeout(ffff800000f40308) at tdb_timeout+0x39 > softclock_thread(ffff8000fffff500) at softclock_thread+0x16e > end trace frame: 0x0, count: -7 > ddb{0}> sh reg > rdi 0 > rsi 0x14 > rbp 0xffff80001fbdc6e0 > rbx 0 > rdx 0xde00000003ff0239 > rcx 0x202 > rax 0x62 > r8 0x101010101010101 > r9 0 > r10 0 > r11 0x254de177953953f9 > r12 0xffffffff82196a00 cpu_info_full_primary+0x2a00 > r13 0xffffffff82275110 timeout_proc > r14 0 > r15 0xffffffff81e4a37b cmd0646_9_tim_udma+0x2e00b > rip 0xffffffff81dad9f0 db_enter+0x10 > cs 0x8 > rflags 0x206 > rsp 0xffff80001fbdc6e0 > ss 0x10 > db_enter+0x10: popq %rbp > ddb{0}> ps > PID TID PPID UID S FLAGS WAIT COMMAND > 20274 32066 93874 1000 3 0x100083 nanoslp sleep > 93874 445500 10192 1000 3 0x10008b sigsusp ksh > 10192 370220 36870 1000 3 0x98 kqread sshd > 36870 222408 98260 0 3 0x92 kqread sshd > 5669 185954 75457 68 7 0x10 isakmpd > 75457 276857 1 0 3 0x80 netio isakmpd > 21598 237571 1 0 3 0x100083 ttyin getty > 4506 488964 1 0 3 0x100098 kqread cron > 39797 52192 15626 846 3 0x100098 kqread dhcpcd > 59318 229873 15626 846 3 0x100098 kqread dhcpcd > 7830 517562 15626 0 3 0x88 kqread dhcpcd > 15626 323417 1 846 3 0x100098 kqread dhcpcd > 19237 354049 75741 720 3 0x90 kqread lldpd > 75741 66119 1 0 3 0x80 netio lldpd > 61311 177515 1 99 3 0x100090 kqread sndiod > 11783 282500 1 110 3 0x100090 kqread sndiod > 82651 136045 37762 0 3 0x100090 kqread tftp-proxy > 37762 379684 1 108 3 0x90 kqread tftp-proxy > 1745 324688 1 109 3 0x90 kqread ftp-proxy > 39681 116848 88999 95 3 0x100092 kqread smtpd > 25384 256665 88999 103 3 0x100092 kqread smtpd > 92345 176419 88999 95 3 0x100092 kqread smtpd > 55442 232201 88999 95 3 0x100092 kqread smtpd > 77443 215878 88999 95 3 0x100092 kqread smtpd > 69162 261319 88999 95 3 0x100092 kqread smtpd > 88999 111572 1 0 3 0x100080 kqread smtpd > 58046 516005 58609 94 3 0x100092 kqread rad > 90680 269647 58609 94 3 0x100092 kqread rad > 58609 188614 1 0 3 0x100080 kqread rad > 3419 227743 1 77 3 0x100090 kqread dhcpd > 61767 205183 1 0 3 0x100080 kqread snmpd > 51987 257332 1 91 3 0x92 kqread snmpd > 98260 56979 1 0 3 0x88 kqread sshd > 13528 58314 1 0 3 0x100080 kqread ntpd > 18801 320473 51204 83 3 0x100092 kqread ntpd > 51204 405642 1 83 3 0x100092 kqread ntpd > 53483 209302 30989 53 3 0x90 kqread unbound > 30989 499872 1 53 3 0x90 kqread unbound > 16457 234403 32406 74 3 0x100092 bpf pflogd > 32406 88142 1 0 3 0x80 netio pflogd > 80238 516045 79768 73 3 0x100090 kqread syslogd > 79768 67472 1 0 3 0x100082 netio syslogd > 43550 409362 1 0 3 0x100080 kqread resolvd > 442 379277 99464 77 3 0x100092 kqread dhcpleased > 47451 452762 99464 77 3 0x100092 kqread dhcpleased > 99464 433528 1 0 3 0x80 kqread dhcpleased > 21886 318205 0 0 3 0x14200 bored wg_crypt > 97050 229388 0 0 3 0x14200 bored wg_crypt > 4440 5494 0 0 3 0x14200 bored wg_crypt > 72609 120590 0 0 3 0x14200 bored wg_crypt > 20025 29295 0 0 3 0x14200 bored wg_handshake > 43757 478463 0 0 3 0x14200 bored wg_handshake > 55892 346923 51675 115 3 0x100092 kqread slaacd > 76843 163407 51675 115 3 0x100092 kqread slaacd > 51675 404409 1 0 3 0x100080 kqread slaacd > 28362 202426 0 0 3 0x14200 bored smr > 82207 376534 0 0 3 0x14200 pgzero zerothread > 1524 51196 0 0 3 0x14200 aiodoned aiodoned > 22770 341109 0 0 3 0x14200 syncer update > 33241 337755 0 0 3 0x14200 cleaner cleaner > 27893 102676 0 0 3 0x14200 reaper reaper > 79368 439400 0 0 3 0x14200 pgdaemon pagedaemon > 98877 12806 0 0 3 0x14200 bored sensors > 71274 388091 0 0 3 0x14200 mmctsk sdmmc0 > 33311 377928 0 0 3 0x14200 usbtsk usbtask > 83240 116348 0 0 3 0x14200 usbatsk usbatsk > 45461 344150 0 0 3 0x40014200 acpi0 acpi0 > 94708 135825 0 0 3 0x40014200 idle3 > 46611 496404 0 0 7 0x40014200 idle2 > 26280 26936 0 0 3 0x40014200 idle1 > 91815 67015 0 0 7 0x14200 softnet > 90110 476121 0 0 3 0x14200 bored systqmp > 12475 134606 0 0 3 0x14200 bored systq > * 6181 318814 0 0 7 0x40014200 softclock > 38662 5703 0 0 3 0x40014200 idle0 > 1 335266 0 0 3 0x82 wait init > 0 0 -1 0 3 0x10200 scheduler swapper > ddb{0}> ps /o > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 185954 5669 68 0x10 0 1 isakmpd > 67015 91815 0 0x14000 0x200 3 softnet > *318814 6181 0 0x14000 0x40000200 0 softclock > ddb{0}> tr /t 0t185954 > ffff800000f40308(ffffffffffffffff,ffffffffffffffff,ffff800000f402e8,0,0,0) at > 0 > xffff800000f40308 > > end trace frame: 0x0, count: -1 > ddb{0}> mach ddbcpu 1 > > <ddb hanged here> > > However, looking at the earlier console log, there were splasserts > starting from before it went multiuser:
I am now running bsd.gdb with MP_LOCKDEBUG and WITNESS. Here is an alternative version of the splassert traces from during boot with line numbers. starting early daemons: syslogd pflogd unbound ntpd isakmpd. starting RPC daemons:. savecore: no core dump checking quotas: done. clearing /tmp kern.securelevel: 0 -> 1 creating runtime link editor directory cache. preserving editor files. starting network daemons: sshd snmpd dhcpd rad smtpd ftpproxy ftpproxy6(failed) tftpproxy sndiod. starting package daemons: lldpd dhcpcdsplassert: tdb_free: want 2 have 0 Starting stack trace... tdb_free(ffff800000f2db80) at tdb_free+0x47 [/sys/netinet/ip_ipsp.c:909] pfkeyv2_send(fffffd805e21e3f8,ffff800000f27e00,170) at pfkeyv2_send+0x656 [/sys/net/pfkeyv2.c:2141] pfkeyv2_output(fffffd806731ea00,fffffd805e21e3f8,0,0) at pfkeyv2_output+0x8a [/sys/net/pfkeyv2.c:442] pfkeyv2_usrreq(fffffd805e21e3f8,9,fffffd806731ea00,0,0,ffff80001f6aefc8) at pfkeyv2_usrreq+0x1b0 [/sys/net/pfkeyv2.c:394] sosend(fffffd805e21e3f8,0,ffff80001f6fdd90,0,0,0) at sosend+0x3a9 [/sys/kern/uipc_socket.c:597] dofilewritev(ffff80001f6aefc8,7,ffff80001f6fdd90,0,ffff80001f6fde90) at dofilewritev+0x14d [/sys/kern/sys_generic.c:382] sys_writev(ffff80001f6aefc8,ffff80001f6fde30,ffff80001f6fde90) at sys_writev+0xd2 [/sys/kern/sys_generic.c:328] syscall(ffff80001f6fdf00) at syscall+0x3a9 [/sys/sys/syscall_mi.h:102] Xsyscall() at Xsyscall+0x128 end of kernel end trace frame: 0x7f7ffffb31b0, count: 248 End of stack trace. splassert: tdb_free: want 2 have 0 Starting stack trace... tdb_free(ffff800000f2db80) at tdb_free+0x47 [/sys/netinet/ip_ipsp.c:909] pfkeyv2_send(fffffd805e21e3f8,ffff800000f27e00,190) at pfkeyv2_send+0x656 [/sys/net/pfkeyv2.c:2141] pfkeyv2_output(fffffd8061acef00,fffffd805e21e3f8,0,0) at pfkeyv2_output+0x8a [/sys/net/pfkeyv2.c:442] pfkeyv2_usrreq(fffffd805e21e3f8,9,fffffd8061acef00,0,0,ffff80001f6aefc8) at pfkeyv2_usrreq+0x1b0 [/sys/net/pfkeyv2.c:394] sosend(fffffd805e21e3f8,0,ffff80001f6fdd90,0,0,0) at sosend+0x3a9 [/sys/kern/uipc_socket.c:597] dofilewritev(ffff80001f6aefc8,7,ffff80001f6fdd90,0,ffff80001f6fde90) at dofilewritev+0x14d [/sys/kern/sys_generic.c:382] sys_writev(ffff80001f6aefc8,ffff80001f6fde30,ffff80001f6fde90) at sys_writev+0xd2 [/sys/kern/sys_generic.c:328] syscall(ffff80001f6fdf00) at syscall+0x3a9 [/sys/sys/syscall_mi.h:102] Xsyscall() at Xsyscall+0x128 end of kernel end trace frame: 0x7f7ffffb31b0, count: 248 End of stack trace. splassert: tdb_free: want 2 have 0 Starting stack trace... tdb_free(ffff800000f2db80) at tdb_free+0x47 [/sys/netinet/ip_ipsp.c:909] pfkeyv2_send(fffffd805e21e3f8,ffff800000f27200,170) at pfkeyv2_send+0x656 [/sys/net/pfkeyv2.c:2141] pfkeyv2_output(fffffd8060d5f800,fffffd805e21e3f8,0,0) at pfkeyv2_output+0x8a [/sys/net/pfkeyv2.c:442] pfkeyv2_usrreq(fffffd805e21e3f8,9,fffffd8060d5f800,0,0,ffff80001f6aefc8) at pfkeyv2_usrreq+0x1b0 [/sys/net/pfkeyv2.c:394] sosend(fffffd805e21e3f8,0,ffff80001f6fdd90,0,0,0) at sosend+0x3a9 [/sys/kern/uipc_socket.c:597] dofilewritev(ffff80001f6aefc8,7,ffff80001f6fdd90,0,ffff80001f6fde90) at dofilewritev+0x14d [/sys/kern/sys_generic.c:382] sys_writev(ffff80001f6aefc8,ffff80001f6fde30,ffff80001f6fde90) at sys_writev+0xd2 [/sys/kern/sys_generic.c:328] syscall(ffff80001f6fdf00) at syscall+0x3a9 [/sys/sys/syscall_mi.h:102] Xsyscall() at Xsyscall+0x128 end of kernel end trace frame: 0x7f7ffffb31b0, count: 248 End of stack trace. . starting local daemons: cron. Sat Nov 13 22:57:42 GMT 2021 OpenBSD/amd64 (jodrell.spacehopper.org) (tty00) login: reorder_kernel: failed -- see /usr/share/relink/kernel/GENERIC.MP/relink.log [-- sthen@localhost attached -- Sat Nov 13 22:55:14 2021] isakmpd crashed; I flushed pfkey, restarted isakmpd and reloaded ipsec.conf, then saw one more "splassert: tdb_free: want 2 have 0" after every new SA Now waiting for another kernel crash.
