On 25.12.2021. 18:52, Alexander Bluhm wrote: > On Sat, Dec 25, 2021 at 09:24:07AM +0100, Hrvoje Popovski wrote: >> On 24.12.2021. 0:55, Alexander Bluhm wrote: >>> I think we can remove the ipsec_in_use workaround now. The IPsec >>> path is protected with the kernel lock. >>> >>> There are some issues left: >>> - npppd l2pt ipsecflowinfo is not MP safe >>> - the acquire SA feature is not MP safe >>> - Hrvoje has seen a panic with sasync >>> >>> If you use one of these, the diff below should trigger crashes faster. >>> If you use only regular IPsec or forwarding, I hope it is stable. >> Hi, >> >> after hitting sasyncd setup with this diff for some time i've run >> ipsecctl -sa just to see what's going on and box panic > According to ddb output ipsecctl is running in userland and the > panic is triggered by a kernel timeout. Could it be coincidence? > When I run with 4 softnet threads, I see userland starvation. Maybe > both timeout and ipsecctl wait for the exclusive netlock and are > executed shortly after each other. > >> r620-1# ipsecctl -sa >> uvm_fault(0xffffffff82200c18, 0x417, 0, 2) -> e >> kernel: page fault trap, code=0 >> Stopped at pfsync_delete_tdb+0x84: movq %rcx,0x8(%rsi) >> TID PID UID PRFLAGS PFLAGS CPU COMMAND >> 290490 40316 0 0x3 0 3 ipsecctl >> 10869 22801 68 0x10 0 5 sasyncd >> 504041 13202 68 0x10 0x80 1 isakmpd >> 476980 6400 0 0x100000 0 2 ntpd >> 224100 72648 0 0x14000 0x200 4 reaper >> * 75659 10211 0 0x14000 0x40000200 0K softclock >> pfsync_delete_tdb(ffff8000012e8008) at pfsync_delete_tdb+0x84 >> tdb_free(ffff8000012e8008) at tdb_free+0x67 >> tdb_timeout(ffff8000012e8008) at tdb_timeout+0x7e >> softclock_thread(ffff8000fffff260) at softclock_thread+0x13b >> end trace frame: 0x0, count: 11 >> https://www.openbsd.org/ddb.html describes the minimum info required in >> bug reports. Insufficient info makes it difficult to find and fix bugs. >> ddb{0}> > /home/bluhm/openbsd/cvs/src/sys/net/if_pfsync.c:2519 > 5f30: 49 8b 86 10 04 00 00 mov 0x410(%r14),%rax > 5f37: 49 8b 8e 18 04 00 00 mov 0x418(%r14),%rcx > 5f3e: 49 8d 94 24 e8 09 00 lea 0x9e8(%r12),%rdx > 5f45: 00 > 5f46: 48 8d b0 10 04 00 00 lea 0x410(%rax),%rsi > 5f4d: 48 85 c0 test %rax,%rax > * 5f50: 48 0f 44 f2 cmove %rdx,%rsi > 5f54: 48 89 4e 08 mov %rcx,0x8(%rsi) > 5f58: 49 8b 86 10 04 00 00 mov 0x410(%r14),%rax > 5f5f: 49 8b 8e 18 04 00 00 mov 0x418(%r14),%rcx > 5f66: 48 89 01 mov %rax,(%rcx) > 5f69: 49 c7 86 18 04 00 00 movq $0xffffffffffffffff,0x418(%r14) > 5f70: ff ff ff ff > 5f74: 49 c7 86 10 04 00 00 movq $0xffffffffffffffff,0x410(%r14) > 5f7b: ff ff ff ff > /home/bluhm/openbsd/cvs/src/sys/net/if_pfsync.c:2520 > > 2508 void > 2509 pfsync_delete_tdb(struct tdb *t) > 2510 { > 2511 struct pfsync_softc *sc = pfsyncif; > 2512 size_t nlen; > 2513 > 2514 if (sc == NULL || !ISSET(t->tdb_flags, TDBF_PFSYNC)) > 2515 return; > 2516 > 2517 mtx_enter(&sc->sc_tdb_mtx); > 2518 > * 2519 TAILQ_REMOVE(&sc->sc_tdb_q, t, tdb_sync_entry); > 2520 CLR(t->tdb_flags, TDBF_PFSYNC); > 2521 > 2522 nlen = sizeof(struct pfsync_tdb); > 2523 if (TAILQ_EMPTY(&sc->sc_tdb_q)) > 2524 nlen += sizeof(struct pfsync_subheader); > 2525 atomic_sub_long(&sc->sc_len, nlen); > 2526 > 2527 mtx_leave(&sc->sc_tdb_mtx); > 2528 } > > Most sc_tdb_q are protected by sc_tdb_mtx. But pfsync_drop_snapshot() > and pfsync_sendout() do not have it. > >> rsi 0x40f >> rcx 0xffffffffffffffff >> rax 0xffffffffffffffff > tqe_next and tqe_prev are -1. Looks like a double remove. Can > pfsync_delete_tdb() be called twice? The tdb_refcnt should enforce > that tdb_free() is only called once per TDB. > >> flags: d1040<DELETED,TUNNELING,USEDTUNNEL,PFSYNC,PFSYNC_RPL> > Flags look reasonable. But a problem could be that they are not > protected in if_pfsync. Then set or cleared bits may be lost. > > Fix both missing mutexes . Also put the tdb_sync_entry > modification and TDBF_PFSYNC flag in the same sc_tdb_mtx. > > I still have no pf sync setup. Hrvoje, can you try this?
Hi, i think that this panic is coincidence, but i can't prove it because it's not so easy to reproduce. I'm hitting sasync setup with this diff, for 2 days and i can panic boxes... If i try to reproduce panic without this diff, would it give you some more information or hints regarding ipsec and mp forwarding?