Hello David, I'm still not sure if your change is ultimate fix, or just significantly minimizes risk of the bug. If I understand things right, the problem we are trying to solve:
DIOCGETSTATES we have in current, grabs NET_LOCK() and pf_state_lock as a reader. it then walks through the whole state list and copies out (copyout(9f)) data for each state into ioctl(2) buffer provided by calling process. we may trip assert down in copyout(9f): > panic: acquiring blockable sleep lock with spinlock or critical section > held (rwlock) vmmaplk > Stopped at db_enter+0x10: popq %rbp > TID PID UID PRFLAGS PFLAGS CPU COMMAND > *512895 28841 0 0x3 0 3K pfctl > db_enter() at db_enter+0x10 > panic(ffffffff81e19411) at panic+0x12a > witness_checkorder(fffffd83b09b4d18,1,0) at witness_checkorder+0xbce > rw_enter_read(fffffd83b09b4d08) at rw_enter_read+0x38 > uvmfault_lookup(ffff8000238e3418,0) at uvmfault_lookup+0x8a > uvm_fault_check(ffff8000238e3418,ffff8000238e3450,ffff8000238e3478) at > uvm_fault_check+0x32 > uvm_fault(fffffd83b09b4d00,e36553c000,0,2) at uvm_fault+0xfc > kpageflttrap(ffff8000238e3590,e36553c000) at kpageflttrap+0x131 > kerntrap(ffff8000238e3590) at kerntrap+0x91 > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b > copyout() at copyout+0x53 I'm just afraid that although your change significantly reduces the risk we will die with similar call stack as the one above, the new code is not bullet proof. We still do copyout() while holding pf_state_list.pfs_rwl as a reader (in pf_states_get() from your diff). I agree packets do not grab pf_state_list.pfs_rwl at all. Your fix solves this problem in this respect. let's take a look at this part of pf_purge_expired_states() from your diff: 1543 NET_LOCK(); 1544 rw_enter_write(&pf_state_list.pfs_rwl); 1545 PF_LOCK(); 1546 PF_STATE_ENTER_WRITE(); 1547 SLIST_FOREACH(st, &gcl, gc_list) { 1548 if (st->timeout != PFTM_UNLINKED) 1549 pf_remove_state(st); 1550 1551 pf_free_state(st); 1552 } 1553 PF_STATE_EXIT_WRITE(); 1554 PF_UNLOCK(); 1555 rw_exit_write(&pf_state_list.pfs_rwl); at line 1543 we grab NET_LOCK(), at line 1544 we are trying to grab new lock (pf_state_list.pfs_rwl) exclusively. with your change we might be running into situation, where we do copyout() as a reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to acquire pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who might be doing uvm_fault() in copyout(9f). I'm just worried we may be trading one bug for another bug. may be my concern is just a false alarm here. I don't know. anyway there are few more nits in your diff. </snip> > Index: pf.c > =================================================================== > RCS file: /cvs/src/sys/net/pf.c,v > retrieving revision 1.1118 > diff -u -p -r1.1118 pf.c > --- pf.c 1 Jun 2021 09:57:11 -0000 1.1118 > +++ pf.c 3 Jun 2021 06:24:48 -0000 > @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void) > void > pf_purge_timeout(void *unused) > { > - task_add(net_tq(0), &pf_purge_task); > + /* XXX move to systqmp to avoid KERNEL_LOCK */ > + task_add(systq, &pf_purge_task); > } I would just clean up the comment. looks like we should be able to get pf's ioctl operations out of KERNEL_LOCK completely. I'll take a further look at it, while be working in pf_ioctl.c </snip> > @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops) > * Fragments don't require PF_LOCK(), they use their own lock. > */ > if ((*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) { > - pf_purge_expired_fragments(); > + pfgpurge_expired_fragment(s); > *nloops = 0; > } > NET_UNLOCK(); > - KERNEL_UNLOCK(); > > timeout_add_sec(&pf_purge_to, 1); > } I guess 'pfgpurge_expired_fragment(s);' is unintentional change, right? </snip> > Index: pfvar_priv.h > =================================================================== > RCS file: /cvs/src/sys/net/pfvar_priv.h,v > retrieving revision 1.6 > diff -u -p -r1.6 pfvar_priv.h > --- pfvar_priv.h 9 Feb 2021 14:06:19 -0000 1.6 > +++ pfvar_priv.h 3 Jun 2021 06:24:48 -0000 > @@ -38,6 +38,115 @@ > #ifdef _KERNEL > > #include <sys/rwlock.h> > +#include <sys/mutex.h> > + > +/* > + > +states are linked into a global list to support the following > +functionality: > + > + - garbage collection > + - pfsync bulk send operations > + - bulk state fetches via the DIOCGETSTATES ioctl > + - bulk state clearing via the DIOCCLRSTATES ioctl > + > +states are inserted into the global pf_state_list once it has also > +been successfully added to the various trees that make up the state > +table. states are only removed from the pf_state_list by the garbage > +collection process. > + the multiline comment does not match style, I think. I kind of got stuck with pf.c/pf_ioctl.c I have not seen if_pfsync.c yet. will do it later today/early tomorrow. thanks and regards sashan