Re: [External] : Re: parallel forwarding vs. bridges
Hello, new diff still looks good to me. I could just catch typos in comment. > > I've been running versions of this diff in production at work, and have > hit a few panics and asserts. All the issues we've hit should be > addressed in this diff. > > The first issue was that pfsync could be in the processing of sending a > deferred packet while it's being removed from the state tree. Part of > that removal process is stripping the state keys from the state, which > pfsync uses to determine the address family so it knows which ip output > routine to use. The quick and dirty fix to this is to have pfsync check > if timeout state to see if the state is unlinked or not. This currently > relies on pfsync undefer and pf being serialised by the NET_LOCK. > > The second is that the timeout member on a state can change while the > purge task is looking at it. We hit this assert in pf_state_expires: > > KASSERT(state->timeout != PFTM_UNLINKED); > > pf_state_expires was called from the purge code like this: > > if ((cur->timeout == PFTM_UNLINKED) || > (pf_state_expires(cur) <= getuptime())) > SLIST_INSERT_HEAD(, cur, gc_list); > I see. I've missed those in earlier review. the pf(4) we have in tree uses PF_STATE_ lock to keep cur->timeout consistent. pf_purge_expired_states() executes the check under PF_STATE_ENTER_READ(). > > With my new locking scheme here, the state purge code is called without > any of the locks that would serialise access the state->timeout > variable. I think I found a solution to this without having to > reintroduce extra locking, which should allow us to keep the purge scan > running concurrently with pf actually handling packets. I think it should work, because we make sure pf_remove_state() gets called only once. I've found two typos in comment, see below. otherwise looks good. OK sashan > + /* > + * pf_state_expires is used by the state purge task to > + * decide if a state is a candidate for cleanup, and by the > + * pfsync state export code to populate an expiry time. > + * > + * this function may be called by the state purge task while > + * the state is being modified. avoid inconsistent reads of > + * state->timeout by having the caller do the read (and any > + * chacks it needs to do on the same variable) and then pass s/chacks/checks > + * their view of the timeout in here for this function to use. > + * the only consequnce of using a stale timeout value is s/consequnce/consequence OK sashan
Re: [External] : Re: parallel forwarding vs. bridges
On Wed, Jun 16, 2021 at 02:59:19AM +0200, Alexandr Nedvedicky wrote: > Hello, > > > > > > as above, copyout with a sleeping lock is fine. > > > > the whole point of my change is to give us the ability to lock in the > > forwarding path separately to locking in the ioctl path. half of that is > > so we can copyout safely. the other half is to avoid letting the ioctl > > path block packet processing if we can avoid it as an alternative to > > having the network stack having to yield the cpu. > > I see. my confusion came from fact I've forgot pflock got turned to mutex, > when we saw the crash. > > > > > > let's take a look at this part of pf_purge_expired_states() > > > from your diff: > > > > > > 1543 NET_LOCK(); > > > 1544 rw_enter_write(_state_list.pfs_rwl); > > > 1545 PF_LOCK(); > > > 1546 PF_STATE_ENTER_WRITE(); > > > 1547 SLIST_FOREACH(st, , gc_list) { > > > 1548 if (st->timeout != PFTM_UNLINKED) > > > 1549 pf_remove_state(st); > > > 1550 > > > 1551 pf_free_state(st); > > > 1552 } > > > 1553 PF_STATE_EXIT_WRITE(); > > > 1554 PF_UNLOCK(); > > > 1555 rw_exit_write(_state_list.pfs_rwl); > > > > > > at line 1543 we grab NET_LOCK(), at line 1544 we are trying > > > to grab new lock (pf_state_list.pfs_rwl) exclusively. > > > > > > with your change we might be running into situation, where we do > > > copyout() as a > > > reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to > > > acquire > > > pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who > > > might be > > > doing uvm_fault() in copyout(9f). > > > > > > I'm just worried we may be trading one bug for another bug. may be my > > > concern > > > is just a false alarm here. I don't know. > > > > no, these things are all worth discussing. > > > > it's definitely possible there's bugs in here, but im pretty confident > > it's not the copyout one. > > > > it seems to work. I'm running your diff with bluhm's parallel diff > and do occasional pfctl -Fs/pfctl -ss under a load. so far so good. > > > > > > I guess 'pfgpurge_expired_fragment(s);' is unintentional change, > > > right? > > > > yeah, i dont know how i did that. vi is hard? > > sure it is... > thanks for fixing the nits. > > > > > Index: if_pfsync.c > > === > > RCS file: /cvs/src/sys/net/if_pfsync.c,v > > retrieving revision 1.292 > > diff -u -p -r1.292 if_pfsync.c > > --- if_pfsync.c 15 Jun 2021 10:10:22 - 1.292 > > +++ if_pfsync.c 15 Jun 2021 11:21:20 - > > @@ -2545,22 +2545,34 @@ pfsync_bulk_start(void) > > { > > struct pfsync_softc *sc = pfsyncif; > > have not spot anything suspicious in if_pfsync.c > > the new diff reads fine to me. > > OK sashan I've been running versions of this diff in production at work, and have hit a few panics and asserts. All the issues we've hit should be addressed in this diff. The first issue was that pfsync could be in the processing of sending a deferred packet while it's being removed from the state tree. Part of that removal process is stripping the state keys from the state, which pfsync uses to determine the address family so it knows which ip output routine to use. The quick and dirty fix to this is to have pfsync check if timeout state to see if the state is unlinked or not. This currently relies on pfsync undefer and pf being serialised by the NET_LOCK. The second is that the timeout member on a state can change while the purge task is looking at it. We hit this assert in pf_state_expires: KASSERT(state->timeout != PFTM_UNLINKED); pf_state_expires was called from the purge code like this: if ((cur->timeout == PFTM_UNLINKED) || (pf_state_expires(cur) <= getuptime())) SLIST_INSERT_HEAD(, cur, gc_list); With my new locking scheme here, the state purge code is called without any of the locks that would serialise access the state->timeout variable. I think I found a solution to this without having to reintroduce extra locking, which should allow us to keep the purge scan running concurrently with pf actually handling packets. Index: pf.c === RCS file: /cvs/src/sys/net/pf.c,v retrieving revision 1.1118 diff -u -p -r1.1118 pf.c --- pf.c1 Jun 2021 09:57:11 - 1.1118 +++ pf.c22 Jun 2021 01:34:36 - @@ -259,6 +259,7 @@ void pf_state_key_link_inpcb(struct p voidpf_state_key_unlink_inpcb(struct pf_state_key *); voidpf_inpcb_unlink_state_key(struct inpcb *); voidpf_pktenqueue_delayed(void *); +int32_t pf_state_expires(const struct pf_state *, uint8_t); #if NPFLOG > 0 void
Re: [External] : Re: parallel forwarding vs. bridges
Hello, re-sending the same diff updated to current if_pfsync.c. diff avoids a recursion to PF_LOCK(), which is indicated by panic stack here: > > production firewalls panicked with the pf lock trying to lock against > > itself a couple of nights ago: > > > > db_enter() at db_enter+0x5 > > panic(81d47212) at panic+0x12a > > rw_enter(82060e70,1) at rw_enter+0x261 > > pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c > > ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at > > ip6_output+0xd33 > > nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at > > nd6_ns > > _output+0x3e2 > > nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220 > > ,800024d040d8) at nd6_resolve+0x29d > > ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a > > e0,800024d040d8) at ether_resolve+0x127 > > ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae > > 0) at ether_output+0x2a > > ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180 > > pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac > > pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d > > pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe > > pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f > > 0,800024d045d8,800024d045fe) at pf_test_rule+0x693 > > pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1 > > ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829 > > ip_forward(fd8065a2,81541800,fd817a7722d0,0) at > > ip_forward+ > > 0x27a > > ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at > > ip_input > > _if+0x5fd > > ipv4_input(81541800,fd8065a2) at ipv4_input+0x37 > > carp_input(8158f000,fd8065a2,5e000158) at > > carp_input+0x1ac > > ether_input(8158f000,fd8065a2) at ether_input+0x1c0 > > vlan_input(8152f000,fd8065a2) at vlan_input+0x19a > > ether_input(8152f000,fd8065a2) at ether_input+0x76 > > if_input_process(801df048,800024d04ca8) at > > if_input_process+0x5a > > ifiq_process(801dbe00) at ifiq_process+0x6f > > taskq_thread(8002b080) at taskq_thread+0x7b > > end trace frame: 0x0, count: -26 > > > the recursion to PF_LOCK() is avoided by postponing the call to pfsync_undefer() to very last phase of pf_test(), where we run without holding PF_LOCK(). I believe we will be able to revert diff below once scope of PF_LOCK() in pf_test() will be reduced in future. However future is not here yet. To reduce PF_LOCK() scope in pf_test() we must give some love to pf's tables first. comments? OKs? thanks and regards sashan 8<---8<---8<--8< diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c index f1d292d67ff..9e7dc5064b9 100644 --- a/sys/net/if_pfsync.c +++ b/sys/net/if_pfsync.c @@ -1931,7 +1931,7 @@ pfsync_insert_state(struct pf_state *st) } int -pfsync_defer(struct pf_state *st, struct mbuf *m) +pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral **ppd) { struct pfsync_softc *sc = pfsyncif; struct pfsync_deferral *pd; @@ -1944,21 +1944,30 @@ pfsync_defer(struct pf_state *st, struct mbuf *m) m->m_flags & (M_BCAST|M_MCAST)) return (0); + pd = pool_get(>sc_pool, M_NOWAIT); + if (pd == NULL) + return (0); + + /* +* deferral queue grows faster, than timeout can consume, +* we have to ask packet (caller) to help timer and dispatch +* one deferral for us. +* +* We wish to call pfsync_undefer() here. Unfortunately we can't, +* because pfsync_undefer() will be calling to ip_output(), +* which in turn will call to pf_test(), which would then attempt +* to grab PF_LOCK() we currently hold. +*/ if (sc->sc_deferred >= 128) { mtx_enter(>sc_deferrals_mtx); - pd = TAILQ_FIRST(>sc_deferrals); - if (pd != NULL) { - TAILQ_REMOVE(>sc_deferrals, pd, pd_entry); + *ppd = TAILQ_FIRST(>sc_deferrals); + if (*ppd != NULL) { + TAILQ_REMOVE(>sc_deferrals, *ppd, pd_entry); sc->sc_deferred--; } mtx_leave(>sc_deferrals_mtx); - if (pd != NULL) - pfsync_undefer(pd, 0); - } - - pd = pool_get(>sc_pool, M_NOWAIT); - if (pd == NULL) - return (0); + } else + *ppd = NULL; m->m_pkthdr.pf.flags |= PF_TAG_GENERATED; SET(st->state_flags, PFSTATE_ACK); diff --git a/sys/net/if_pfsync.h b/sys/net/if_pfsync.h index 2520c87a36c..037219537e8 100644 --- a/sys/net/if_pfsync.h +++
Re: [External] : Re: parallel forwarding vs. bridges
Hello, > > as above, copyout with a sleeping lock is fine. > > the whole point of my change is to give us the ability to lock in the > forwarding path separately to locking in the ioctl path. half of that is > so we can copyout safely. the other half is to avoid letting the ioctl > path block packet processing if we can avoid it as an alternative to > having the network stack having to yield the cpu. I see. my confusion came from fact I've forgot pflock got turned to mutex, when we saw the crash. > > > let's take a look at this part of pf_purge_expired_states() > > from your diff: > > > > 1543 NET_LOCK(); > > 1544 rw_enter_write(_state_list.pfs_rwl); > > 1545 PF_LOCK(); > > 1546 PF_STATE_ENTER_WRITE(); > > 1547 SLIST_FOREACH(st, , gc_list) { > > 1548 if (st->timeout != PFTM_UNLINKED) > > 1549 pf_remove_state(st); > > 1550 > > 1551 pf_free_state(st); > > 1552 } > > 1553 PF_STATE_EXIT_WRITE(); > > 1554 PF_UNLOCK(); > > 1555 rw_exit_write(_state_list.pfs_rwl); > > > > at line 1543 we grab NET_LOCK(), at line 1544 we are trying > > to grab new lock (pf_state_list.pfs_rwl) exclusively. > > > > with your change we might be running into situation, where we do copyout() > > as a > > reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to > > acquire > > pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who > > might be > > doing uvm_fault() in copyout(9f). > > > > I'm just worried we may be trading one bug for another bug. may be my > > concern > > is just a false alarm here. I don't know. > > no, these things are all worth discussing. > > it's definitely possible there's bugs in here, but im pretty confident > it's not the copyout one. > it seems to work. I'm running your diff with bluhm's parallel diff and do occasional pfctl -Fs/pfctl -ss under a load. so far so good. > > I guess 'pfgpurge_expired_fragment(s);' is unintentional change, right? > > yeah, i dont know how i did that. vi is hard? sure it is... thanks for fixing the nits. > Index: if_pfsync.c > === > RCS file: /cvs/src/sys/net/if_pfsync.c,v > retrieving revision 1.292 > diff -u -p -r1.292 if_pfsync.c > --- if_pfsync.c 15 Jun 2021 10:10:22 - 1.292 > +++ if_pfsync.c 15 Jun 2021 11:21:20 - > @@ -2545,22 +2545,34 @@ pfsync_bulk_start(void) > { > struct pfsync_softc *sc = pfsyncif; have not spot anything suspicious in if_pfsync.c the new diff reads fine to me. OK sashan
Re: [External] : Re: parallel forwarding vs. bridges
On Tue, Jun 08, 2021 at 06:54:25PM +0200, Alexandr Nedvedicky wrote: > Hello David, > > I'm still not sure if your change is ultimate fix, or just significantly > minimizes risk of the bug. If I understand things right, the problem we are > trying to solve: > DIOCGETSTATES we have in current, grabs NET_LOCK() and pf_state_lock > as a reader. > > it then walks through the whole state list and copies out (copyout(9f)) > data for each state into ioctl(2) buffer provided by calling process. yep. it's not an ultimate fix. it's main goal is to allow the ioctl side of things to read the states safely without blocking the packet processing side of things. > we may trip assert down in copyout(9f): > > panic: acquiring blockable sleep lock with spinlock or critical section > > held (rwlock) vmmaplk this is about taking an rwlock (vmmaplk) while holding a mutex (pflock). you can copyout while holding sleeping locks (eg, NET_LOCK or the rwlock version of the pflocks). the pfs_rwl is a sleeping lock, so it's safe to copyout while holding it. that's why this diff works. > > Stopped at db_enter+0x10: popq%rbp > > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > > *512895 28841 0 0x3 03K pfctl > > db_enter() at db_enter+0x10 > > panic(81e19411) at panic+0x12a > > witness_checkorder(fd83b09b4d18,1,0) at witness_checkorder+0xbce > > rw_enter_read(fd83b09b4d08) at rw_enter_read+0x38 > > uvmfault_lookup(8000238e3418,0) at uvmfault_lookup+0x8a > > uvm_fault_check(8000238e3418,8000238e3450,8000238e3478) at > > uvm_fault_check+0x32 > > uvm_fault(fd83b09b4d00,e36553c000,0,2) at uvm_fault+0xfc > > kpageflttrap(8000238e3590,e36553c000) at kpageflttrap+0x131 > > kerntrap(8000238e3590) at kerntrap+0x91 > > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b > > copyout() at copyout+0x53 > > > I'm just afraid that although your change significantly reduces the risk we > will die with similar call stack as the one above, the new code is not bullet > proof. We still do copyout() while holding pf_state_list.pfs_rwl as a reader > (in pf_states_get() from your diff). I agree packets do not grab > pf_state_list.pfs_rwl at all. Your fix solves this problem in this respect. as above, copyout with a sleeping lock is fine. the whole point of my change is to give us the ability to lock in the forwarding path separately to locking in the ioctl path. half of that is so we can copyout safely. the other half is to avoid letting the ioctl path block packet processing if we can avoid it as an alternative to having the network stack having to yield the cpu. > let's take a look at this part of pf_purge_expired_states() > from your diff: > > 1543 NET_LOCK(); > 1544 rw_enter_write(_state_list.pfs_rwl); > 1545 PF_LOCK(); > 1546 PF_STATE_ENTER_WRITE(); > 1547 SLIST_FOREACH(st, , gc_list) { > 1548 if (st->timeout != PFTM_UNLINKED) > 1549 pf_remove_state(st); > 1550 > 1551 pf_free_state(st); > 1552 } > 1553 PF_STATE_EXIT_WRITE(); > 1554 PF_UNLOCK(); > 1555 rw_exit_write(_state_list.pfs_rwl); > > at line 1543 we grab NET_LOCK(), at line 1544 we are trying > to grab new lock (pf_state_list.pfs_rwl) exclusively. > > with your change we might be running into situation, where we do copyout() as > a > reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to > acquire > pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who might > be > doing uvm_fault() in copyout(9f). > > I'm just worried we may be trading one bug for another bug. may be my concern > is just a false alarm here. I don't know. no, these things are all worth discussing. it's definitely possible there's bugs in here, but im pretty confident it's not the copyout one. > anyway there are few more nits in your diff. yep. > > > > Index: pf.c > > === > > RCS file: /cvs/src/sys/net/pf.c,v > > retrieving revision 1.1118 > > diff -u -p -r1.1118 pf.c > > --- pf.c1 Jun 2021 09:57:11 - 1.1118 > > +++ pf.c3 Jun 2021 06:24:48 - > > @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void) > > void > > pf_purge_timeout(void *unused) > > { > > - task_add(net_tq(0), _purge_task); > > + /* XXX move to systqmp to avoid KERNEL_LOCK */ > > + task_add(systq, _purge_task); > > } > > I would just clean up the comment. looks like we should be > able to get pf's ioctl operations out of KERNEL_LOCK completely. > I'll take a further look at it, while be working in pf_ioctl.c this isnt the ioctl side though, this is the periodic gc task. if the pf locking is right then we could move the task now. > > > @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops) > > * Fragments don't require PF_LOCK(), they use their own lock. > >
Re: [External] : Re: parallel forwarding vs. bridges
Hello, > > it seems to me we may leak `pd` we took at line 1944. The leak > > happens in case timeout_del() return zero. > > > > diff below ignores the return value from timeout_del() and makes > > sure we always call pfsync_undefefer() > > > > I have not seen such panic in my test environment. Would you be able > > to give my diff a try? > > i hit a different fault in this code recently, so i've been looking > at the code again. my fix for the second fault will conflict with this > one. i'll send it out to tech@ separately though. > I've just sent my OK to your diff. > your diff looks like it moves the handling of undefer until after pf > drops its locks. i was just going to make pfsync schedule a task or > softint immediately if the queue was getting long, and then stop doing > deferrals if it was way too long. that's also alternative I was considering. I think I still have it in somewhere in history. Both approaches are fine I think. Scheduling a task looks less ugly. However as I was typing the code I've started to realize it might be better to let packet to dispatch the deferral. because this will share resources in more natural way. from my point of view the story goes like that: timeout does not dispatch deferrals fast enough, queue grows packet needs to insert new deferral to queue new deferral does not fit to queue, options are: remove one deferral from queue to make space for new one. packet then must dispatch the old deferral. this conserves resources because packet (network subsystem) pays cost (memory & cpu) for deferral processing right away. schedule yet another task to dispatch deferral. My concern here is that it allows network subsystem to allocate more and more resources. If deferral does not fit into regular timeout queue, we just let pfsync to create an 'express task' to dispatch old deferral. I see both those changes as 'interim fix'. We may be able to restore original code we currently have in pfsync_defer() as soon as we will shrink scope of PF_LOCK. Once PF_LOCK will be in better shape, we will be able to make pfsync_defer() nice and simple again. thanks and regards sashan
Re: [External] : Re: parallel forwarding vs. bridges
On Sat, Jun 05, 2021 at 01:09:01PM +0200, Alexandr Nedvedicky wrote: > Hello David, > > > > > the scope of the pf locks likely needs reduction anyway. one of my > > I agree. pf_lock covers too much in PF currently. it protects, > all rules, tables and fragment caches. > > > production firewalls panicked with the pf lock trying to lock against > > itself a couple of nights ago: > > > > db_enter() at db_enter+0x5 > > panic(81d47212) at panic+0x12a > > rw_enter(82060e70,1) at rw_enter+0x261 > > pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c > > ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at > > ip6_output+0xd33 > > nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at > > nd6_ns > > _output+0x3e2 > > nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220 > > ,800024d040d8) at nd6_resolve+0x29d > > ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a > > e0,800024d040d8) at ether_resolve+0x127 > > ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae > > 0) at ether_output+0x2a > > ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180 > > pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac > > pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d > > pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe > > pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f > > 0,800024d045d8,800024d045fe) at pf_test_rule+0x693 > > pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1 > > ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829 > > ip_forward(fd8065a2,81541800,fd817a7722d0,0) at > > ip_forward+ > > 0x27a > > ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at > > ip_input > > _if+0x5fd > > ipv4_input(81541800,fd8065a2) at ipv4_input+0x37 > > carp_input(8158f000,fd8065a2,5e000158) at > > carp_input+0x1ac > > ether_input(8158f000,fd8065a2) at ether_input+0x1c0 > > vlan_input(8152f000,fd8065a2) at vlan_input+0x19a > > ether_input(8152f000,fd8065a2) at ether_input+0x76 > > if_input_process(801df048,800024d04ca8) at > > if_input_process+0x5a > > ifiq_process(801dbe00) at ifiq_process+0x6f > > taskq_thread(8002b080) at taskq_thread+0x7b > > end trace frame: 0x0, count: -26 > > > > diff below postpones call to pfsync_undefer() to moment, when PF_LOCK() is > released. I believe this should fix the panic you see. the change does not > look nice. and can be later reverted, when pf_lock scope will be reduced. > > I took a closer look at pfsync_defer() here: > > 1941 > 1942 if (sc->sc_deferred >= 128) { > 1943 mtx_enter(>sc_deferrals_mtx); > 1944 pd = TAILQ_FIRST(>sc_deferrals); > 1945 TAILQ_REMOVE(>sc_deferrals, pd, pd_entry); > 1946 sc->sc_deferred--; > 1947 mtx_leave(>sc_deferrals_mtx); > 1948 if (timeout_del(>pd_tmo)) > 1949 pfsync_undefer(pd, 0); > 1950 } > 1951 > 1952 pd = pool_get(>sc_pool, M_NOWAIT); > 1953 if (pd == NULL) > 1954 return (0); > > it seems to me we may leak `pd` we took at line 1944. The leak > happens in case timeout_del() return zero. > > diff below ignores the return value from timeout_del() and makes > sure we always call pfsync_undefefer() > > I have not seen such panic in my test environment. Would you be able > to give my diff a try? i hit a different fault in this code recently, so i've been looking at the code again. my fix for the second fault will conflict with this one. i'll send it out to tech@ separately though. your diff looks like it moves the handling of undefer until after pf drops its locks. i was just going to make pfsync schedule a task or softint immediately if the queue was getting long, and then stop doing deferrals if it was way too long. > 8<---8<---8<--8< > diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c > index b90aa934de4..a5e5e1ae0f3 100644 > --- a/sys/net/if_pfsync.c > +++ b/sys/net/if_pfsync.c > @@ -272,7 +272,6 @@ void pfsync_syncdev_state(void *); > void pfsync_ifdetach(void *); > > void pfsync_deferred(struct pf_state *, int); > -void pfsync_undefer(struct pfsync_deferral *, int); > void pfsync_defer_tmo(void *); > > void pfsync_cancel_full_update(struct pfsync_softc *); > @@ -1927,7 +1926,7 @@ pfsync_insert_state(struct pf_state *st) > } > > int > -pfsync_defer(struct pf_state *st, struct mbuf *m) > +pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral > **ppd) > { > struct pfsync_softc *sc = pfsyncif; > struct
Re: [External] : Re: parallel forwarding vs. bridges
Hello David, I'm still not sure if your change is ultimate fix, or just significantly minimizes risk of the bug. If I understand things right, the problem we are trying to solve: DIOCGETSTATES we have in current, grabs NET_LOCK() and pf_state_lock as a reader. it then walks through the whole state list and copies out (copyout(9f)) data for each state into ioctl(2) buffer provided by calling process. we may trip assert down in copyout(9f): > panic: acquiring blockable sleep lock with spinlock or critical section > held (rwlock) vmmaplk > Stopped at db_enter+0x10: popq%rbp > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > *512895 28841 0 0x3 03K pfctl > db_enter() at db_enter+0x10 > panic(81e19411) at panic+0x12a > witness_checkorder(fd83b09b4d18,1,0) at witness_checkorder+0xbce > rw_enter_read(fd83b09b4d08) at rw_enter_read+0x38 > uvmfault_lookup(8000238e3418,0) at uvmfault_lookup+0x8a > uvm_fault_check(8000238e3418,8000238e3450,8000238e3478) at > uvm_fault_check+0x32 > uvm_fault(fd83b09b4d00,e36553c000,0,2) at uvm_fault+0xfc > kpageflttrap(8000238e3590,e36553c000) at kpageflttrap+0x131 > kerntrap(8000238e3590) at kerntrap+0x91 > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b > copyout() at copyout+0x53 I'm just afraid that although your change significantly reduces the risk we will die with similar call stack as the one above, the new code is not bullet proof. We still do copyout() while holding pf_state_list.pfs_rwl as a reader (in pf_states_get() from your diff). I agree packets do not grab pf_state_list.pfs_rwl at all. Your fix solves this problem in this respect. let's take a look at this part of pf_purge_expired_states() from your diff: 1543 NET_LOCK(); 1544 rw_enter_write(_state_list.pfs_rwl); 1545 PF_LOCK(); 1546 PF_STATE_ENTER_WRITE(); 1547 SLIST_FOREACH(st, , gc_list) { 1548 if (st->timeout != PFTM_UNLINKED) 1549 pf_remove_state(st); 1550 1551 pf_free_state(st); 1552 } 1553 PF_STATE_EXIT_WRITE(); 1554 PF_UNLOCK(); 1555 rw_exit_write(_state_list.pfs_rwl); at line 1543 we grab NET_LOCK(), at line 1544 we are trying to grab new lock (pf_state_list.pfs_rwl) exclusively. with your change we might be running into situation, where we do copyout() as a reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to acquire pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who might be doing uvm_fault() in copyout(9f). I'm just worried we may be trading one bug for another bug. may be my concern is just a false alarm here. I don't know. anyway there are few more nits in your diff. > Index: pf.c > === > RCS file: /cvs/src/sys/net/pf.c,v > retrieving revision 1.1118 > diff -u -p -r1.1118 pf.c > --- pf.c 1 Jun 2021 09:57:11 - 1.1118 > +++ pf.c 3 Jun 2021 06:24:48 - > @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void) > void > pf_purge_timeout(void *unused) > { > - task_add(net_tq(0), _purge_task); > + /* XXX move to systqmp to avoid KERNEL_LOCK */ > + task_add(systq, _purge_task); > } I would just clean up the comment. looks like we should be able to get pf's ioctl operations out of KERNEL_LOCK completely. I'll take a further look at it, while be working in pf_ioctl.c > @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops) >* Fragments don't require PF_LOCK(), they use their own lock. >*/ > if ((*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) { > - pf_purge_expired_fragments(); > + pfgpurge_expired_fragment(s); > *nloops = 0; > } > NET_UNLOCK(); > - KERNEL_UNLOCK(); > > timeout_add_sec(_purge_to, 1); > } I guess 'pfgpurge_expired_fragment(s);' is unintentional change, right? > Index: pfvar_priv.h > === > RCS file: /cvs/src/sys/net/pfvar_priv.h,v > retrieving revision 1.6 > diff -u -p -r1.6 pfvar_priv.h > --- pfvar_priv.h 9 Feb 2021 14:06:19 - 1.6 > +++ pfvar_priv.h 3 Jun 2021 06:24:48 - > @@ -38,6 +38,115 @@ > #ifdef _KERNEL > > #include > +#include > + > +/* > + > +states are linked into a global list to support the following > +functionality: > + > + - garbage collection > + - pfsync bulk send operations > + - bulk state fetches via the DIOCGETSTATES ioctl > + - bulk state clearing via the DIOCCLRSTATES ioctl > + > +states are inserted into the global pf_state_list once it has also > +been successfully added to the various trees that make up the state > +table. states are only removed from the pf_state_list by the garbage > +collection process. > + the multiline comment does not match style, I think. I kind of
Re: [External] : Re: parallel forwarding vs. bridges
On 7.6.2021. 9:25, Alexandr Nedvedicky wrote: > Hello, > > On Sun, Jun 06, 2021 at 09:54:50PM +0200, Hrvoje Popovski wrote: > >> >> this one? > yes, exactly this one. Ok, great .. this and similar panics are products of accidental stp loops. I have two same boxes with the same network setup connected to the same switch. At one point i've tested bluhm@ and sashan@ diffs and forget that i have tpmr(4) or veb(4) configured on both boxes, which is loop ... So now, when i understood what i've done i can easily reproduce panic with yours and dlg@ diff ... i just start to send traffic over one box and other box panic :) all those panics are without pf and with tpmr or veb here's few panics .. i can send details if you find some panics interesting .. login: papnapiniacc:n: i kc e: r n e l di a g n o s ti ca s s e r t i o n " r v e b e = =N UL L "f a i l e d :f i l e" p /okseyoslr/_ndenole t_ /piu ft : _ e t h e rb r i dg e . c " , li n e22 6e b t_ r e p la c ee b 0 x f f f f 8 0e0bd0eipa0lg0n6:8 o2 s te i6 c8n e b e 0 x f f f ff d 8 3 ab 7 7 b d 9 8 r v e b e 0 x f f f ff d 8 3 a b 7 7 b e 1 0 Stopped at db_enter+0x10: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND 91272 9 0 0x14000 0x2001 softnet *410720 68055 0 0x14000 0x2000 softnet 444532 44174 0 0x14000 0x2002 softnet 149723 5636 0 0x14000 0x2004 softnet db_enter() at db_enter+0x10 panic(81e85119) at panic+0xbf etherbridge_map(80682e68,8016ad00,90e2ba33b4a0) at etherbridge_map+0x37d veb_port_input(80082048,fd80bcb8f600,90e2ba33b4a1,8016ad00) at veb_port_input+0x2fa ether_input(80082048,fd80bcb8f600) at ether_input+0xf5 if_input_process(80082048,800023871988) at if_input_process+0x6f ifiq_process(80080f00) at ifiq_process+0x69 taskq_thread(80030200) at taskq_thread+0x9f end trace frame: 0x0, count: 7 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{0}> show panic cpu0: kernel diagnostic assertion "rvebe == NULL" failed: file "/sys/net/if_etherbridge.c", line 226 ebt_replace eb 0x80682e68 nebe 0xfd83ab77bd98 rvebe 0xfd83ab77be10 cpu2: pool_do_put: ebepl: double pool_put: 0xfd83ab77b8e8 *cpu1: kernel diagnostic assertion "smr->smr_func == NULL" failed: file "/sys/kern/kern_smr.c", line 247 ddb{0}> r620-1# papnainicc: : k er n eld ia g n o st i c a s s e r ti o n " r e f c n t ! =~0 "f a i le d : f il e" /s y s / k e r n/kkeernrenl_ s yn c h . c ",l i ne 8 2 6 Stopped at db_enter+0x10: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND 73487 28509 0 0x14000 0x2001 softnet 99803 26490 0 0x14000 0x2005 softnet *372376 10060 0 0x14000 0x2004 softnet 128194 68537 0 0x14000 0x2003 softnet 105333 38464 0 0x14000 0x42000 softclock db_enter() at db_enter+0x10 panic(81e07adc) at panic+0xbf __assert(81e7558b,81de2e62,33a,81e13185) at __assert+0x2b refcnt_rele(fd839484c768) at refcnt_rele+0x6f etherbridge_map(80682e68,812c6f00,90e2ba33b4a0) at etherbridge_map+0x1ca veb_port_input(80087048,fd80a6545200,90e2ba33b4a1,812c6f00) at veb_port_input+0x2fa ether_input(80087048,fd80a6545200) at ether_input+0xf5 if_input_process(80087048,80002386bba8) at if_input_process+0x6f ifiq_process(80086d00) at ifiq_process+0x69 taskq_thread(80030100) at taskq_thread+0x9f end trace frame: 0x0, count: 5 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{4}> show panic cpu4: kernel diagnostic assertion "refcnt != ~0" failed: file "/sys/kern/kern_synch.c", line 826 *cpu1: kernel diagnostic assertion "smr->smr_func == NULL" failed: file "/sys/kern/kern_smr.c", line 247 ddb{4}> r620-1# papppnpppannnanniiniiciicc::c:cc : :: poo l_d o _ge t: mc l 2k 2:p a ge emp ty Stopped at db_enter+0x10: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND * 13147 52870 0 0x14000 0x2002 softnet db_enter() at db_enter+0x10 panic(81e0a795) at panic+0xbf pool_do_get(821f6d98,2,800023864bb4) at pool_do_get+0x309 pool_get(821f6d98,2) at pool_get+0x9d m_clget(0,2,802) at
Re: [External] : Re: parallel forwarding vs. bridges
Hello, On Sun, Jun 06, 2021 at 09:54:50PM +0200, Hrvoje Popovski wrote: > > > this one? yes, exactly this one. > > > ddb{5}> show panic > cpu5: kernel diagnostic assertion "rvebe == NULL" failed: file > "/sys/net/if_etherbridge.c", line 226 ebt_replace eb 0x80682e68 > nebe 0xfd8391e91630 rvebe 0xfd8391e914c8 > *cpu2: kernel diagnostic assertion "smr->smr_func == NULL" failed: file > "/sys/kern/kern_smr.c", line 247 > ddb{5}> now I see things are actually more colorful, thanks to work to make panic handling mp-friendly. In earlier panic I could see 'smr->smr_func == NULL' was the only assert on fire. With mp panic handling I see there are actually more asserts on fire. This is neat. it looks like packets are trying to ride through port, which is just being destroyed. thanks and regards sashan
Re: [External] : Re: parallel forwarding vs. bridges
On 5.6.2021. 18:43, Alexandr Nedvedicky wrote: > According to early tests it works well. Currently there is a one > mysterious panic, which Hrvoje may be able to comment on > how to trigger it. The stack looks as follows: > > kernel diagnostic assertion "smr->smr_func == NULL" failed: file > "/sys/kern/kern_smr.c", line 247 > panic() at panic+0x12a > __assert() at > __assert+0x2b > smr_call_impl() at smr_call_impl+0xd4 > veb_port_input() at veb_port_input+0x2fa > ether_input() at ether_input+0x10b > if_input_process() at if_input_process+0x6f > ifiq_process() at ifiq_process+0x69 > taskq_thread() at taskq_thread+0x81 this one? r620-1# ppaanniicc: : k e r ne ld ia g n o s t icas s e r t io n" rv e b e = = N U LL "f ai l e d : f i l e " / s y s/ n e tk/eirnfe_el t he r b r i d g e. c " , l i ne 2 2 6eb t _ r ep l a ce e b 0 x f f f f 80 0 0 00 6 8 2 e6 8n eb e 0 x f fdifaffgdno8s39t1iec 9 16 3 0r v eb e 0 x f ff f f d 83 9 1 e 9 1 4c 8 Stopped at db_enter+0x10: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND 218522 97075 0 0x14000 0x2001 softnet *210501 76472 0 0x14000 0x2005 softnet 192964556 0 0x14000 0x2002 softnet 312196 98971 0 0x14000 0x2003 softnet db_enter() at db_enter+0x10 panic(81e84b7f) at panic+0xbf etherbridge_map(80682e68,8016a000,90e2ba33b4a0) at etherbridge_map+0x37d veb_port_input(80082048,fd80a6ec8800,90e2ba33b4a1,8016a000) at veb_port_input+0x2fa ether_input(80082048,fd80a6ec8800) at ether_input+0xf5 if_input_process(80082048,800023871bf8) at if_input_process+0x6f ifiq_process(80086300) at ifiq_process+0x69 taskq_thread(80030200) at taskq_thread+0x9f end trace frame: 0x0, count: 7 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{5}> show panic cpu5: kernel diagnostic assertion "rvebe == NULL" failed: file "/sys/net/if_etherbridge.c", line 226 ebt_replace eb 0x80682e68 nebe 0xfd8391e91630 rvebe 0xfd8391e914c8 *cpu2: kernel diagnostic assertion "smr->smr_func == NULL" failed: file "/sys/kern/kern_smr.c", line 247 ddb{5}>
Re: [External] : Re: parallel forwarding vs. bridges
Hello David, > > my best argument for mutexes in pf is that currently we use smr critical > sections around things like the bridge and aggr port selection, which > are very obviously read-only workloads. pf may be a largely read-only > workload, but where it is at the moment means it's not going to get run > concurrently anyway. > I'm glad you mention bridges and smr. with bluhm's parallel forwarding diff [1] Hrvoje noticed panics with various bridges: db_enter() at db_enter+0x10 panic() at panic+0x12a __assert() at __assert+0x2b assertwaitok() at assertwaitok+0xdc mi_switch() at mi_switch+0x40 sleep_finish() at sleep_finish+0x11c rw_enter() at rw_enter+0x229 pf_test() at pf_test+0x1055 tpmr_pf() at tpmr_pf+0x7e tpmr_input() at tpmr_input+0x124 ether_input() at ether_input+0xf5 if_input_process() at if_input_process+0x6f ifiq_process() at ifiq_process+0x69 the things start to go downhill right here in ether_input(): 426 427 smr_read_enter(); 428 eb = SMR_PTR_GET(>ac_brport); 429 if (eb != NULL) { 430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); 431 if (m == NULL) { 432 smr_read_leave(); 433 return; 434 } 435 } 436 smr_read_leave(); 437 we panic, because PF is going to sleep on one of its locks. this is the case of 1 of 40 packets. The one, which needs to create state. ether_input() can not expect that ->eb_input(), won't block. May be someday one day in future, but we are not there yet. we don't see panic in current tree, because of exclusive net_lock. The ether_input() runs exclusively as a writer on netlock, thus all further locks down the stack are grabbed without blocking. My personal though on this: non-blocking ->eb_input() is pretty hard requirement. I don't mind if packet processing gives up a CPU to wait for another packet to get its job done. This gives other process chance to use cpu, be it another packet or any process in system. lack of sleeping points may lead to less responsive system, when network will get more busy. diff below combines smr_read... with reference counting using traditional atomic ops. The idea is to adjust code found in ether_input(): 426 427 smr_read_enter(); 428 eb = SMR_PTR_GET(>ac_brport); 429 if (eb != NULL) 430 eb->eb_port_take(eb->eb_port); 431 smr_read_leave(); 432 if (eb != NULL) { 433 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); 434 eb->eb_port_rele(eb->eb_port); 435 if (m == NULL) { 436 return; 437 } 438 } 439 According to early tests it works well. Currently there is a one mysterious panic, which Hrvoje may be able to comment on how to trigger it. The stack looks as follows: kernel diagnostic assertion "smr->smr_func == NULL" failed: file "/sys/kern/kern_smr.c", line 247 panic() at panic+0x12a __assert() at __assert+0x2b smr_call_impl() at smr_call_impl+0xd4 veb_port_input() at veb_port_input+0x2fa ether_input() at ether_input+0x10b if_input_process() at if_input_process+0x6f ifiq_process() at ifiq_process+0x69 taskq_thread() at taskq_thread+0x81 It's not obvious from stack above, but we die in etherbridge_map(), which is being called from here in veb_port_input(): #if 0 && defined(IPSEC) if (ISSET(ifp->if_flags, IFF_LINK2) && (m = veb_ipsec_in(ifp0, m)) == NULL) return (NULL); #endif eh = mtod(m, struct ether_header *); if (ISSET(p->p_bif_flags, IFBIF_LEARNING)) etherbridge_map(>sc_eb, p, src); CLR(m->m_flags, M_BCAST|M_MCAST); SET(m->m_flags, M_PROTO1); the panic itself must be triggered by 'ebe_rele(oebe);' found at the end of etherbridge_map(). I've spent few days with it, and have no idea how it could happen. looks like result of memory corruption/use-after-free bug. anyway I would like to know if it would make sense to go with diff below. This will allow us to keep hunting for other bugs sitting in parallel forwarding path. thanks and sashan [1] paralle://marc.info/?l=openbsd-tech=161903387904923=2 8<---8<---8<--8< diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c index 703be01648f..c41f3c93668 100644 --- a/sys/net/if_bridge.c +++ b/sys/net/if_bridge.c @@ -138,6
Re: [External] : Re: parallel forwarding vs. bridges
Hello David, > the scope of the pf locks likely needs reduction anyway. one of my I agree. pf_lock covers too much in PF currently. it protects, all rules, tables and fragment caches. > production firewalls panicked with the pf lock trying to lock against > itself a couple of nights ago: > > db_enter() at db_enter+0x5 > panic(81d47212) at panic+0x12a > rw_enter(82060e70,1) at rw_enter+0x261 > pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c > ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at > ip6_output+0xd33 > nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at > nd6_ns > _output+0x3e2 > nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220 > ,800024d040d8) at nd6_resolve+0x29d > ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a > e0,800024d040d8) at ether_resolve+0x127 > ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae > 0) at ether_output+0x2a > ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180 > pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac > pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d > pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe > pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f > 0,800024d045d8,800024d045fe) at pf_test_rule+0x693 > pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1 > ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829 > ip_forward(fd8065a2,81541800,fd817a7722d0,0) at > ip_forward+ > 0x27a > ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at > ip_input > _if+0x5fd > ipv4_input(81541800,fd8065a2) at ipv4_input+0x37 > carp_input(8158f000,fd8065a2,5e000158) at > carp_input+0x1ac > ether_input(8158f000,fd8065a2) at ether_input+0x1c0 > vlan_input(8152f000,fd8065a2) at vlan_input+0x19a > ether_input(8152f000,fd8065a2) at ether_input+0x76 > if_input_process(801df048,800024d04ca8) at > if_input_process+0x5a > ifiq_process(801dbe00) at ifiq_process+0x6f > taskq_thread(8002b080) at taskq_thread+0x7b > end trace frame: 0x0, count: -26 > diff below postpones call to pfsync_undefer() to moment, when PF_LOCK() is released. I believe this should fix the panic you see. the change does not look nice. and can be later reverted, when pf_lock scope will be reduced. I took a closer look at pfsync_defer() here: 1941 1942 if (sc->sc_deferred >= 128) { 1943 mtx_enter(>sc_deferrals_mtx); 1944 pd = TAILQ_FIRST(>sc_deferrals); 1945 TAILQ_REMOVE(>sc_deferrals, pd, pd_entry); 1946 sc->sc_deferred--; 1947 mtx_leave(>sc_deferrals_mtx); 1948 if (timeout_del(>pd_tmo)) 1949 pfsync_undefer(pd, 0); 1950 } 1951 1952 pd = pool_get(>sc_pool, M_NOWAIT); 1953 if (pd == NULL) 1954 return (0); it seems to me we may leak `pd` we took at line 1944. The leak happens in case timeout_del() return zero. diff below ignores the return value from timeout_del() and makes sure we always call pfsync_undefefer() I have not seen such panic in my test environment. Would you be able to give my diff a try? thanks a lot regards sashan 8<---8<---8<--8< diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c index b90aa934de4..a5e5e1ae0f3 100644 --- a/sys/net/if_pfsync.c +++ b/sys/net/if_pfsync.c @@ -272,7 +272,6 @@ voidpfsync_syncdev_state(void *); void pfsync_ifdetach(void *); void pfsync_deferred(struct pf_state *, int); -void pfsync_undefer(struct pfsync_deferral *, int); void pfsync_defer_tmo(void *); void pfsync_cancel_full_update(struct pfsync_softc *); @@ -1927,7 +1926,7 @@ pfsync_insert_state(struct pf_state *st) } int -pfsync_defer(struct pf_state *st, struct mbuf *m) +pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral **ppd) { struct pfsync_softc *sc = pfsyncif; struct pfsync_deferral *pd; @@ -1939,19 +1938,29 @@ pfsync_defer(struct pf_state *st, struct mbuf *m) m->m_flags & (M_BCAST|M_MCAST)) return (0); + pd = pool_get(>sc_pool, M_NOWAIT); + if (pd == NULL) + return (0); + + /* +* deferral queue grows faster, than timeout can consume, +* we have to ask packet (caller) to help timer and dispatch +* one deferral for us. +* +* We wish to call pfsync_undefer() here. Unfortunately we can't, +* because pfsync_undefer() will be calling to ip_output(), +* which in turn will call to pf_test(), which would then attempt +
Re: [External] : Re: parallel forwarding vs. bridges
On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote: > > pf_purge_expired_states(u_int32_t maxcheck) > > { > > > - cur = pf_state_ref(next); > > + do { > > + if ((cur->timeout == PFTM_UNLINKED) || > > + (pf_state_expires(cur) <= getuptime())) { > > + SLIST_INSERT_HEAD(, pf_state_ref(cur), gc_list); > > I wonder: is the extra reference you are chasing for coming from here? > I suspect pf_state_ref(cur) is being called two times, as macro expands. I think you're right :D :D This is an updated diff with a fix for this and the explanation. It also reduces the scope of the NET_LOCK so scanning the state list shouldn't block the network stack. Index: pf.c === RCS file: /cvs/src/sys/net/pf.c,v retrieving revision 1.1118 diff -u -p -r1.1118 pf.c --- pf.c1 Jun 2021 09:57:11 - 1.1118 +++ pf.c3 Jun 2021 06:24:48 - @@ -308,7 +308,7 @@ static __inline void pf_set_protostate(s struct pf_src_tree tree_src_tracking; struct pf_state_tree_id tree_id; -struct pf_state_queue state_list; +struct pf_state_list pf_state_list = PF_STATE_LIST_INITIALIZER(pf_state_list); RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare); RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key); @@ -440,6 +440,37 @@ pf_check_threshold(struct pf_threshold * return (threshold->count > threshold->limit); } +void +pf_state_list_insert(struct pf_state_list *pfs, struct pf_state *st) +{ + /* +* we can always put states on the end of the list. +* +* things reading the list should take a read lock, then +* the mutex, get the head and tail pointers, release the +* mutex, and then they can iterate between the head and tail. +*/ + + pf_state_ref(st); /* get a ref for the list */ + + mtx_enter(>pfs_mtx); + TAILQ_INSERT_TAIL(>pfs_list, st, entry_list); + mtx_leave(>pfs_mtx); +} + +void +pf_state_list_remove(struct pf_state_list *pfs, struct pf_state *st) +{ + /* states can only be removed when the write lock is held */ + rw_assert_wrlock(>pfs_rwl); + + mtx_enter(>pfs_mtx); + TAILQ_REMOVE(>pfs_list, st, entry_list); + mtx_leave(>pfs_mtx); + + pf_state_unref(st); /* list no longer references the state */ +} + int pf_src_connlimit(struct pf_state **state) { @@ -986,7 +1017,7 @@ pf_state_insert(struct pfi_kif *kif, str PF_STATE_EXIT_WRITE(); return (-1); } - TAILQ_INSERT_TAIL(_list, s, entry_list); + pf_state_list_insert(_state_list, s); pf_status.fcounters[FCNT_STATE_INSERT]++; pf_status.states++; pfi_kif_ref(kif, PFI_KIF_REF_STATE); @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void) void pf_purge_timeout(void *unused) { - task_add(net_tq(0), _purge_task); + /* XXX move to systqmp to avoid KERNEL_LOCK */ + task_add(systq, _purge_task); } void @@ -1255,9 +1287,6 @@ pf_purge(void *xnloops) { int *nloops = xnloops; - KERNEL_LOCK(); - NET_LOCK(); - /* * process a fraction of the state table every second * Note: @@ -1268,6 +1297,8 @@ pf_purge(void *xnloops) pf_purge_expired_states(1 + (pf_status.states / pf_default_rule.timeout[PFTM_INTERVAL])); + NET_LOCK(); + PF_LOCK(); /* purge other expired types every PFTM_INTERVAL seconds */ if (++(*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) { @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops) * Fragments don't require PF_LOCK(), they use their own lock. */ if ((*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) { - pf_purge_expired_fragments(); + pfgpurge_expired_fragment(s); *nloops = 0; } NET_UNLOCK(); - KERNEL_UNLOCK(); timeout_add_sec(_purge_to, 1); } @@ -1447,7 +1477,7 @@ pf_free_state(struct pf_state *cur) } pf_normalize_tcp_cleanup(cur); pfi_kif_unref(cur->kif, PFI_KIF_REF_STATE); - TAILQ_REMOVE(_list, cur, entry_list); + pf_state_list_remove(_state_list, cur); if (cur->tag) pf_tag_unref(cur->tag); pf_state_unref(cur); @@ -1458,53 +1488,77 @@ pf_free_state(struct pf_state *cur) void pf_purge_expired_states(u_int32_t maxcheck) { + /* +* this task/thread/context/whatever is the only thing that +* removes states from the pf_state_list, so the cur reference +* it holds between calls is guaranteed to still be in the +* list. +*/ static struct pf_state *cur = NULL; - struct pf_state *next; - SLIST_HEAD(pf_state_gcl, pf_state) gcl; + + struct pf_state *head, *tail; +
Re: [External] : Re: parallel forwarding vs. bridges
On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote: > Hello, > > looks like my earlier mail got eaten by SPAM filter, let's try again... > > > > > > moving pf locks to mutexes makes sense to me, but like you say, > > this will need testing and experimentation. one of the issues > > I'm not entirely convinced that trading rwlock for mutexes > is a good thing. IMO majority of packets are being processed > as a readers on state lock. If packet does not match state, > then it needs to be pushed through rules, where we run to pf_lock > and state lock. Both those locks are grabbed exclusively on > this path. However I think we can relax pf_lock to reader > in a future. there's obviously plusses and minuses for trading rwlocks for mutexes. here's the states info from one of my firewalls at work: State Table Total Rate current entries 789418 half-open tcp 19101626 searches 2835612773525 221420.0/s inserts 37030235256 2891.5/s removals 37029445838 2891.5/s the ratio of searches to state table updates is like 40 to 1, so yes, letting readers through concurrently would be useful. > I'd like to better understand why mutexes (spin locks) are better in > network subsystem? I'm concerned that removing all sleeping points from > network stack will create life harder for scheduler. If such box will > be hammered by network, the scheduler won't allow other components to > run, because the whole network subsystem got turned to one giant > non-interruptible block. I admit my concern may come from lack > of understanding of overall concept. as i said, there's plusses and minuses for mutexes as pflocks. the minus is that you lose the ability to run pf concurrently. however, i'm pretty sure that there's a lot of mutable state inside pf that gets updated on every packet that almost certainly benefits from being run exclusively. im mostly thinking about the packet and byte counters on rules and states. my best argument for mutexes in pf is that currently we use smr critical sections around things like the bridge and aggr port selection, which are very obviously read-only workloads. pf may be a largely read-only workload, but where it is at the moment means it's not going to get run concurrently anyway. i think the current scope of the locks in pf needs to be looked at anyway. we might be able to reduce the contention and allow concurrency with mutexes by doing something more complicated than just swapping the lock type around. eg, could the ruleset be wired up with SMR pointers for traversal? could we split the state trees into bucks and hash packets into them for lookup? stoeplitz is probably cheap enough to make that possible... the scope of the pf locks likely needs reduction anyway. one of my production firewalls panicked with the pf lock trying to lock against itself a couple of nights ago: db_enter() at db_enter+0x5 panic(81d47212) at panic+0x12a rw_enter(82060e70,1) at rw_enter+0x261 pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at ip6_output+0xd33 nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at nd6_ns _output+0x3e2 nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220 ,800024d040d8) at nd6_resolve+0x29d ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a e0,800024d040d8) at ether_resolve+0x127 ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae 0) at ether_output+0x2a ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180 pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f 0,800024d045d8,800024d045fe) at pf_test_rule+0x693 pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1 ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829 ip_forward(fd8065a2,81541800,fd817a7722d0,0) at ip_forward+ 0x27a ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at ip_input _if+0x5fd ipv4_input(81541800,fd8065a2) at ipv4_input+0x37 carp_input(8158f000,fd8065a2,5e000158) at carp_input+0x1ac ether_input(8158f000,fd8065a2) at ether_input+0x1c0 vlan_input(8152f000,fd8065a2) at vlan_input+0x19a ether_input(8152f000,fd8065a2) at ether_input+0x76 if_input_process(801df048,800024d04ca8) at if_input_process+0x5a ifiq_process(801dbe00) at ifiq_process+0x6f
Re: [External] : Re: parallel forwarding vs. bridges
On Wed, May 19, 2021 at 01:48:26AM +0200, Alexandr Nedvedicky wrote: > Hello, > > just for the record... > > > > > > in current tree the ether_input() is protected by NET_LOCK(), which is > > > grabbed > > > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so > > > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has > > > implications on smr read section above. The ting is the call to > > > eb->eb_input() > > > can sleep now. This is something what needs to be avoided within smr > > > section. > > > > Is the new sleeping point introduced by the fact the PF_LOCK() is a > > rwlock? Did you consider using a mutex, at least for the time being, > > in order to not run in such issues? > > below is a diff, which trades both pf(4) rw-locks for mutexes. > > diff compiles, it still needs testing/experimenting. hi. i'm trying to get my head back into this space, so i'm trying to have a look at these diffs properly. moving pf locks to mutexes makes sense to me, but like you say, this will need testing and experimentation. one of the issues identified in another part of this thread (and on icb?) is that the ioctl path does some stuff which can sleep, but you can't sleep while holding a mutex which would now be protecting the data structures you're trying to look at. a more specific example is if you're trying to dump the state table. like you say, the state table is currently protected by the pf_state_lock. iterating over the state table and giving up this lock to copyout the entries is... not fun. the diff below is a work in progress attempt to address this. it works by locking the pf state list separately so it can support traversal without (long) holds of locks needed in the forwarding path. the very short explanation is that the TAILQ holding the states is locked separately to the links between the states in the list. a much longer explanataion is in the diff in the pfvar_priv.h chunk. inserting states into the TAILQ while processing packets uses a mutex. iterating over the list in the pf purge processing, ioctl paths, and pfsync can largely rely on an rwlock. the rwlock would also allow concurrent access to the list of states, ie, you can dump the list of states while the gc thread is looking for states to collect, also while pfsync is sending out a bulk update. it also converts the ioctl handling for flushing states to using the list instead of one of the RB trees holding states. i'm sure everyone wants an omgoptimised state flushing mechanism. this is a work in progress because i've screwed up the pf_state reference counting somehow. states end up with an extra reference, and i can't see where that's coming from. anyway. apart from the refcounting thing it seems to be working well, and would take us a step closer to using mutexes for pf locks. even if we throw this away, id still like to move the pf purge processing from out of nettq0 into systq, just to avoid having a place where a nettq thread has to spin waiting for the kernel lock. Index: pf.c === RCS file: /cvs/src/sys/net/pf.c,v retrieving revision 1.1118 diff -u -p -r1.1118 pf.c --- pf.c1 Jun 2021 09:57:11 - 1.1118 +++ pf.c2 Jun 2021 07:52:21 - @@ -308,7 +308,7 @@ static __inline void pf_set_protostate(s struct pf_src_tree tree_src_tracking; struct pf_state_tree_id tree_id; -struct pf_state_queue state_list; +struct pf_state_list pf_state_list = PF_STATE_LIST_INITIALIZER(pf_state_list); RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare); RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key); @@ -440,6 +440,37 @@ pf_check_threshold(struct pf_threshold * return (threshold->count > threshold->limit); } +void +pf_state_list_insert(struct pf_state_list *pfs, struct pf_state *st) +{ + /* +* we can always put states on the end of the list. +* +* things reading the list should take a read lock, then +* the mutex, get the head and tail pointers, release the +* mutex, and then they can iterate between the head and tail. +*/ + + pf_state_ref(st); /* get a ref for the list */ + + mtx_enter(>pfs_mtx); + TAILQ_INSERT_TAIL(>pfs_list, st, entry_list); + mtx_leave(>pfs_mtx); +} + +void +pf_state_list_remove(struct pf_state_list *pfs, struct pf_state *st) +{ + /* states can only be removed when the write lock is held */ + rw_assert_wrlock(>pfs_rwl); + + mtx_enter(>pfs_mtx); + TAILQ_REMOVE(>pfs_list, st, entry_list); + mtx_leave(>pfs_mtx); + + pf_state_unref(st); /* list no longer references the state */ +} + int pf_src_connlimit(struct pf_state **state) { @@ -986,7 +1017,7 @@ pf_state_insert(struct pfi_kif *kif, str PF_STATE_EXIT_WRITE(); return (-1); } - TAILQ_INSERT_TAIL(_list, s, entry_list); +
Re: [External] : Re: parallel forwarding vs. bridges
Hello, just for the record... > > in current tree the ether_input() is protected by NET_LOCK(), which is > > grabbed > > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so > > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has > > implications on smr read section above. The ting is the call to > > eb->eb_input() > > can sleep now. This is something what needs to be avoided within smr > > section. > > Is the new sleeping point introduced by the fact the PF_LOCK() is a > rwlock? Did you consider using a mutex, at least for the time being, > in order to not run in such issues? below is a diff, which trades both pf(4) rw-locks for mutexes. diff compiles, it still needs testing/experimenting. regards sashan 8<---8<---8<--8< diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c index ae7bb008351..30c78da7d16 100644 --- a/sys/net/pf_ioctl.c +++ b/sys/net/pf_ioctl.c @@ -146,8 +146,8 @@ TAILQ_HEAD(pf_tags, pf_tagname) pf_tags = TAILQ_HEAD_INITIALIZER(pf_tags), * grab the lock as writer. Whenever packet creates state it grabs pf_lock * first then it locks pf_state_lock as the writer. */ -struct rwlock pf_lock = RWLOCK_INITIALIZER("pf_lock"); -struct rwlock pf_state_lock = RWLOCK_INITIALIZER("pf_state_lock"); +struct mutexpf_lock = MUTEX_INITIALIZER(IPL_NET); +struct mutexpf_state_lock = MUTEX_INITIALIZER(IPL_NET); #if (PF_QNAME_SIZE != PF_TAG_NAME_SIZE) #error PF_QNAME_SIZE must be equal to PF_TAG_NAME_SIZE diff --git a/sys/net/pfvar_priv.h b/sys/net/pfvar_priv.h index b1a122af409..c5ae392b468 100644 --- a/sys/net/pfvar_priv.h +++ b/sys/net/pfvar_priv.h @@ -39,7 +39,7 @@ #include -extern struct rwlock pf_lock; +extern struct mutex pf_lock; struct pf_pdesc { struct { @@ -104,52 +104,41 @@ extern struct timeout pf_purge_to; struct pf_state*pf_state_ref(struct pf_state *); voidpf_state_unref(struct pf_state *); -extern struct rwlock pf_lock; -extern struct rwlock pf_state_lock; +extern struct mutexpf_lock; +extern struct mutexpf_state_lock; #define PF_LOCK() do {\ NET_ASSERT_LOCKED();\ - rw_enter_write(_lock); \ + mtx_enter(_lock);\ } while (0) #define PF_UNLOCK()do {\ PF_ASSERT_LOCKED(); \ - rw_exit_write(_lock);\ + mtx_leave(_lock);\ } while (0) -#define PF_ASSERT_LOCKED() do {\ - if (rw_status(_lock) != RW_WRITE)\ - splassert_fail(RW_WRITE,\ - rw_status(_lock),__func__);\ - } while (0) +#define PF_ASSERT_LOCKED() (void)(0) -#define PF_ASSERT_UNLOCKED() do {\ - if (rw_status(_lock) == RW_WRITE)\ - splassert_fail(0, rw_status(_lock), __func__);\ - } while (0) +#define PF_ASSERT_UNLOCKED() (void)(0) #define PF_STATE_ENTER_READ() do {\ - rw_enter_read(_state_lock); \ + mtx_enter(_state_lock); \ } while (0) #define PF_STATE_EXIT_READ() do {\ - rw_exit_read(_state_lock); \ + mtx_leave(_state_lock); \ } while (0) #define PF_STATE_ENTER_WRITE() do {\ - rw_enter_write(_state_lock); \ + mtx_enter(_state_lock); \ } while (0) #define PF_STATE_EXIT_WRITE() do {\ PF_ASSERT_STATE_LOCKED(); \ - rw_exit_write(_state_lock); \ + mtx_leave(_state_lock); \ } while (0) -#define PF_ASSERT_STATE_LOCKED() do {\ - if (rw_status(_state_lock) != RW_WRITE)\ - splassert_fail(RW_WRITE,\ - rw_status(_state_lock), __func__);\ - } while (0) +#define PF_ASSERT_STATE_LOCKED() (void)(0) extern void pf_purge_timeout(void *); extern void pf_purge(void *);
Re: [External] : Re: parallel forwarding vs. bridges
Hello, On Mon, May 17, 2021 at 08:19:37PM +0200, Alexander Bluhm wrote: > On Mon, May 17, 2021 at 04:24:15PM +0200, Alexandr Nedvedicky wrote: > > in current tree the ether_input() is protected by NET_LOCK(), which is > > grabbed > > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so > > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has > > implications on smr read section above. The ting is the call to > > eb->eb_input() > > can sleep now. This is something what needs to be avoided within smr > > section. > > Do you think the sleeping occures in PF_LOCK()? Could it be that > this rwlock did never sleep before, but was acquired instantly? > yes, the sleep occurs on PF_LOCK(), right here in pf_test(): 7052a = s->anchor.ptr; 7053#if NPFLOG > 0 7054pd.pflog |= s->log; 7055#endif /* NPFLOG > 0 */ 7056} else if (s == NULL) { 7057PF_LOCK(); 7058have_pf_lock = 1; 7059action = pf_test_rule(, , , , , 7060); 7061s = pf_state_ref(s); (gdb) with exclusive NET_LOCK() the PF_LOCK() never blocks/sleeps, because exclusive NET_LOCK() ensures there is only single packet running in pf_test(). changing exclusive NET_LOCK() to reader lock allows more packets to run in pf_test(), therefore PF_LOCK() can block. > Having sleeps in the packet forwarding path is a bad idea. And > refcounting per packet looks like a workarund. It does not make > things faster. I'm afraid sleeps (exclusive locks) can not be avoided with current data structures we have in PF. for example if packet will be creating state, then it must grab state lock exclusively. we have to make sure the sleep operation won't occur in code, which is protected by lock. Such risk comes from blocking malloc being called by caller, which owns lock (either reader or writer). > > How about implementing PF_LOCK() with a mutex? As I've answered to mpi@ already, there are two locks, which we need to change to mutexes: PF_LOCK() and state lock > > I made such a diff a few months ago. It is not well tested yet. > There were some problems when allocating pf mutex in ioctl path. > mpi@ did not like it, as it removes the pf lock in pf ioctl and > relies on net lock for reconfigure. I've read through your diff. If I understand it right two things are being changed: PF_LOCK() is thrown away completely and we use NET_LOCK() instead state_lock is changed to mutex IMO removing PF_LOCK() is step backward. I think every subsystem (including PF) should provide its own lock and should not piggy back on NET_LOCK(). > > It is a design question. Do we want to keep net lock for reconfigure? No, we don't want NET_LOCK() for reconfigure. configuration changes should be synchronized using PF_LOCK() > If we process packets with shared netlock and reconfigure pf with > exclusive netlock this looks reasonable. > > Do we want to go this way? Then I will polish and test this diff. no I think we don't want to go that way. if it comes to pf(4) configuration is more colorful. the ioctl, which updates firewall configuration grabs lock exclusively (be it current pf_lock or net_lock). For certain operations (queue management and tables) we also allocate memory _without_ NOSLEEP. We may end up in situation when we are sleeping in malloc while holding lock, which we share with packets. Therefore I think we actually need extra rw-lock just for ioctl. this new lock will serialize all callers which will come from userland. we don't mind to sleep on malloc with ioctl-rw lock held exclusively. as soon as memory for desired operation will be acquired, then it will be time to grab PF_LOCK() as a writer and update firewall config. this way we should avoid blocking processing of packets due to memory allocations. I think Vitaliy implemented something similar for other subsystem, can't remember, which one (?pipex?). thanks and regards sashan
Re: [External] : Re: parallel forwarding vs. bridges
On Mon, May 17, 2021 at 08:27:12PM +0200, Martin Pieuchot wrote: > On 17/05/21(Mon) 19:52, Alexandr Nedvedicky wrote: > > [...] > > I don't mind to trade pf_lock and pf_state_lock for mutexes, however > > I see such step is kind of 'NO-OP'. We do have sufficient measure > > currently, which is: keep NET_LOCK() as is. May be I'm not seeing > > your idea/plan behind changing pf's rw-locks to mutexes. If you feel > > there is a benefit to go that way, then let's do it, but I'd like > > to understand where we will be going/what to expect. > > I've no idea or plan. I'm just pointing out that using rwlocks, for the > moment, add extra work. If it's easy to use mutexes then you might want > to start with that. The whole network processing path assumes it runs > without sleeping. I've no idea what can happen when this assumption > will be broken. > I see. I keep forgetting about local bound traffic, which is far more complex than forwarding scenario. > I'm well aware that using a single big pf lock is not the best for > performances, but maybe it's easier to do baby steps. > > That said, I don't want to stop or discourage anyone. If you're > confident enough that rwlocks are the way to go, then please, go > ahead. > I think we can keep mutexes as a kind of fallback plan if we decide to commit bluhm's diff. thanks and regards sashan
Re: [External] : Re: parallel forwarding vs. bridges
On 17/05/21(Mon) 19:52, Alexandr Nedvedicky wrote: > [...] > I don't mind to trade pf_lock and pf_state_lock for mutexes, however > I see such step is kind of 'NO-OP'. We do have sufficient measure > currently, which is: keep NET_LOCK() as is. May be I'm not seeing > your idea/plan behind changing pf's rw-locks to mutexes. If you feel > there is a benefit to go that way, then let's do it, but I'd like > to understand where we will be going/what to expect. I've no idea or plan. I'm just pointing out that using rwlocks, for the moment, add extra work. If it's easy to use mutexes then you might want to start with that. The whole network processing path assumes it runs without sleeping. I've no idea what can happen when this assumption will be broken. I'm well aware that using a single big pf lock is not the best for performances, but maybe it's easier to do baby steps. That said, I don't want to stop or discourage anyone. If you're confident enough that rwlocks are the way to go, then please, go ahead. Cheers, Martin
Re: parallel forwarding vs. bridges
On Mon, May 17, 2021 at 04:24:15PM +0200, Alexandr Nedvedicky wrote: > in current tree the ether_input() is protected by NET_LOCK(), which is grabbed > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has > implications on smr read section above. The ting is the call to eb->eb_input() > can sleep now. This is something what needs to be avoided within smr section. Do you think the sleeping occures in PF_LOCK()? Could it be that this rwlock did never sleep before, but was acquired instantly? Having sleeps in the packet forwarding path is a bad idea. And refcounting per packet looks like a workarund. It does not make things faster. How about implementing PF_LOCK() with a mutex? I made such a diff a few months ago. It is not well tested yet. There were some problems when allocating pf mutex in ioctl path. mpi@ did not like it, as it removes the pf lock in pf ioctl and relies on net lock for reconfigure. It is a design question. Do we want to keep net lock for reconfigure? If we process packets with shared netlock and reconfigure pf with exclusive netlock this looks reasonable. Do we want to go this way? Then I will polish and test this diff. bluhm Index: net/pf.c === RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v retrieving revision 1.1116 diff -u -p -r1.1116 pf.c --- net/pf.c27 Apr 2021 09:38:29 - 1.1116 +++ net/pf.c17 May 2021 17:38:20 - @@ -6819,6 +6819,8 @@ pf_test(sa_family_t af, int fwdir, struc if (!pf_status.running) return (PF_PASS); + NET_ASSERT_LOCKED(); + #if NCARP > 0 if (ifp->if_type == IFT_CARP && (ifp0 = if_get(ifp->if_carpdevidx)) != NULL) { Index: net/pf_ioctl.c === RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf_ioctl.c,v retrieving revision 1.363 diff -u -p -r1.363 pf_ioctl.c --- net/pf_ioctl.c 9 Feb 2021 23:37:54 - 1.363 +++ net/pf_ioctl.c 17 May 2021 17:38:20 - @@ -146,8 +146,8 @@ TAILQ_HEAD(pf_tags, pf_tagname) pf_tags * grab the lock as writer. Whenever packet creates state it grabs pf_lock * first then it locks pf_state_lock as the writer. */ -struct rwlock pf_lock = RWLOCK_INITIALIZER("pf_lock"); -struct rwlock pf_state_lock = RWLOCK_INITIALIZER("pf_state_lock"); +struct mutex pf_lock = MUTEX_INITIALIZER(IPL_SOFTNET); +struct mutex pf_state_lock = MUTEX_INITIALIZER(IPL_SOFTNET); #if (PF_QNAME_SIZE != PF_TAG_NAME_SIZE) #error PF_QNAME_SIZE must be equal to PF_TAG_NAME_SIZE @@ -1009,7 +1009,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a case DIOCSTART: NET_LOCK(); - PF_LOCK(); if (pf_status.running) error = EEXIST; else { @@ -1023,13 +1022,11 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a pf_create_queues(); DPFPRINTF(LOG_NOTICE, "pf: started"); } - PF_UNLOCK(); NET_UNLOCK(); break; case DIOCSTOP: NET_LOCK(); - PF_LOCK(); if (!pf_status.running) error = ENOENT; else { @@ -1038,7 +1035,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a pf_remove_queues(); DPFPRINTF(LOG_NOTICE, "pf: stopped"); } - PF_UNLOCK(); NET_UNLOCK(); break; @@ -1048,7 +1044,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a u_int32_tnr = 0; NET_LOCK(); - PF_LOCK(); pq->ticket = pf_main_ruleset.rules.active.ticket; /* save state to not run over them all each time? */ @@ -1058,7 +1053,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a nr++; } pq->nr = nr; - PF_UNLOCK(); NET_UNLOCK(); break; } @@ -1069,10 +1063,8 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a u_int32_tnr = 0; NET_LOCK(); - PF_LOCK(); if (pq->ticket != pf_main_ruleset.rules.active.ticket) { error = EBUSY; - PF_UNLOCK(); NET_UNLOCK(); break; } @@ -1083,12 +1075,10 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a qs = TAILQ_NEXT(qs, entries); if (qs == NULL) { error = EBUSY; - PF_UNLOCK(); NET_UNLOCK(); break; } memcpy(>queue, qs,
Re: [External] : Re: parallel forwarding vs. bridges
Hello, On Mon, May 17, 2021 at 06:59:36PM +0200, Martin Pieuchot wrote: > On 17/05/21(Mon) 16:24, Alexandr Nedvedicky wrote: > > Hrvoje, > > > > managed to trigger diagnostic panics with diff [1] sent by bluhm@ > > some time ago. The panic Hrvoje sees comes from ether_input() here: > > > > 414 > > 415 /* > > 416 * Third phase: bridge processing. > > 417 * > > 418 * Give the packet to a bridge interface, ie, bridge(4), > > 419 * switch(4), or tpmr(4), if it is configured. A bridge > > 420 * may take the packet and forward it to another port, or it > > 421 * may return it here to ether_input() to support local > > 422 * delivery to this port. > > 423 */ > > 424 > > 425 ac = (struct arpcom *)ifp; > > 426 > > 427 smr_read_enter(); > > 428 eb = SMR_PTR_GET(>ac_brport); > > 429 if (eb != NULL) { > > 430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); > > 431 if (m == NULL) { > > 432 smr_read_leave(); > > 433 return; > > 434 } > > 435 } > > 436 smr_read_leave(); > > 437 > > > > in current tree the ether_input() is protected by NET_LOCK(), which is > > grabbed > > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so > > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has > > implications on smr read section above. The ting is the call to > > eb->eb_input() > > can sleep now. This is something what needs to be avoided within smr > > section. > > Is the new sleeping point introduced by the fact the PF_LOCK() is a > rwlock? Did you consider using a mutex, at least for the time being, > in order to not run in such issues? there are two locks in pf_test(): PF_LOCK() PF_STATE_LOCK() we would have to convert both locks into mutexes. trading PF_STATE_LOCK() into mutex is step backward IMO. bluhm's parallel diff enables PF to run in parallel for packets. if packet matches existing state, it just grabs state lock as a reader. This brings huge benefit, because majority of packets just do the state look up. Only some packets need to go expensive way through rules and state creation, which involves PF_LOCK() and write lock on state rw. as readers and are done with PF. trading both locks for mutexes will do the trick too. However I'm not sure I see a benefit there (when comparing to current NET_LOCK() we have in tree). I think all safety belts we have kicked in. We learned smr read block is not an option for ether_input() when comes to bridges. perhaps alternative solution would be to introduce an RW-lock, which would protect 'ac->br_port'. Such R-lock would cover call ->eb_input(). However I'm not sure about it. I don't mind to trade pf_lock and pf_state_lock for mutexes, however I see such step is kind of 'NO-OP'. We do have sufficient measure currently, which is: keep NET_LOCK() as is. May be I'm not seeing your idea/plan behind changing pf's rw-locks to mutexes. If you feel there is a benefit to go that way, then let's do it, but I'd like to understand where we will be going/what to expect. thanks and regards sashan
Re: parallel forwarding vs. bridges
On 17.5.2021. 16:24, Alexandr Nedvedicky wrote: > Hrvoje, > > managed to trigger diagnostic panics with diff [1] sent by bluhm@ > some time ago. The panic Hrvoje sees comes from ether_input() here: > [1] https://marc.info/?l=openbsd-tech=161903387904923=2 > > [2] https://marc.info/?l=openbsd-tech=162099270106067=2 for those interested, I sent emails to a couple of developers with the following panics c/p Hi guys, i'm testing bluhm@ parallel diff with pf and tpmr/veb/bridge. if i'm sending traffic over tmpr or veb, the same second i enable pf, box panic.. see attachment .. with bridge i can't panic box, but it seems that performance is even worse than without parallel diff ... i will play with it more, maybe it's just some glitch ... pf and tpmr r620-1# panic: kernel diagnostic assertion "curcpu()->ci_schedstate.spc_smrdepth == 0" failed: file "/sys/kern/sub r_xxx.c", line 163 Stopped at db_enter+0x10: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND 11538 95266 0 0x14000 0x2002 softnet 239565 4172 0 0x14000 0x2001 softnet db_enter() at db_enter+0x10 panic(81e0198a) at panic+0x12a __assert(81e6f4ee,81e93713,a3,81ea877c) at __assert+0x2b assertwaitok() at assertwaitok+0xdc mi_switch() at mi_switch+0x40 sleep_finish(800023877740,1) at sleep_finish+0x11c rw_enter(8212e900,1) at rw_enter+0x229 pf_test(2,1,80082048,800023877978) at pf_test+0x1055 tpmr_pf(80082048,1,fd80cccb8c00) at tpmr_pf+0x7e tpmr_input(80082048,fd80cccb8c00,90e2ba1d4f89,80680f00) at tpmr_input+0x124 ether_input(80082048,fd80cccb8c00) at ether_input+0xf5 if_input_process(80082048,800023877af8) at if_input_process+0x6f ifiq_process(80086000) at ifiq_process+0x69 taskq_thread(80030300) at taskq_thread+0x9f end trace frame: 0x0, count: 1 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{0}> ddb{0}> show panic kernel diagnostic assertion "curcpu()->ci_schedstate.spc_smrdepth == 0" failed: file "/sys/kern/subr_xxx.c", line 163 ddb{0}> show locks shared rwlock netlock r = 0 (0x8212eab0) #0 witness_lock+0x339 #1 if_input_process+0x43 #2 ifiq_process+0x69 #3 taskq_thread+0x9f #4 proc_trampoline+0x1c shared rwlock softnet r = 0 (0x80030370) #0 witness_lock+0x339 #1 taskq_thread+0x92 #2 proc_trampoline+0x1c exclusive sched_lock _lock r = 0 (0x8230a100) #0 witness_lock+0x339 #1 sleep_setup+0xa5 #2 rw_enter+0x208 #3 pf_test+0x1055 #4 tpmr_pf+0x7e #5 tpmr_input+0x124 #6 ether_input+0xf5 #7 if_input_process+0x6f #8 ifiq_process+0x69 #9 taskq_thread+0x9f #10 proc_trampoline+0x1c ddb{0}> show reg rdi 0x820f1c80kprintf_mutex rsi 0x5 rbp 0x8000238775a0 rbx 0x8000238775b0 rdx 0x8000 rcx0x206 rax 0x1 r80x800023877560 r9 0 r100 r11 0x80ef1bc6ac87b49d r12 0x38 r13 0x800023877650 r140x100 r15 0x81e0198acmd0646_9_tim_udma+0x24a93 rip 0x81106070db_enter+0x10 cs 0x8 rflags 0x282 rsp 0x8000238775a0 ss 0x10 db_enter+0x10: popq%rbp ddb{0}> show mbuf mbuf 0x81106070 m_type: -13108 m_flags: c3cc m_next: 0x1d3b4c241c334c5d m_nextpkt: 0x117400fdffac m_data: 0x m_len: 3435973836 m_dat: 0x81106090 m_pktdat: 0x811060e0 ddb{0}> ps PID TID PPIDUID S FLAGS WAIT COMMAND 80810 268029 1 0 30x100083 ttyin ksh 90433 180245 1 0 30x100098 poll cron 921768747 42701 95 30x100092 kqreadsmtpd 83989 36449 42701103 30x100092 kqreadsmtpd 82331 325465 42701 95 30x100092 kqreadsmtpd 28903 422168 42701 95 30x100092 kqreadsmtpd 5781 10814 42701 95 30x100092 kqreadsmtpd 70395 59130 42701 95 30x100092 kqreadsmtpd 42701 329509 1 0 30x100080 kqreadsmtpd 47357 342015 1 0 30x80 selectsshd 91828 134960 1 0 30x100080 poll ntpd 42264 230567 72970 83 30x100092 poll ntpd 72970 150962 1 83 30x100092 poll ntpd 23464 496327 92870 73 30x100090 kqreadsyslogd 92870 181949 1 0 30x100082 netio syslogd 15150 95273
Re: parallel forwarding vs. bridges
On 17/05/21(Mon) 16:24, Alexandr Nedvedicky wrote: > Hrvoje, > > managed to trigger diagnostic panics with diff [1] sent by bluhm@ > some time ago. The panic Hrvoje sees comes from ether_input() here: > > 414 > 415 /* > 416 * Third phase: bridge processing. > 417 * > 418 * Give the packet to a bridge interface, ie, bridge(4), > 419 * switch(4), or tpmr(4), if it is configured. A bridge > 420 * may take the packet and forward it to another port, or it > 421 * may return it here to ether_input() to support local > 422 * delivery to this port. > 423 */ > 424 > 425 ac = (struct arpcom *)ifp; > 426 > 427 smr_read_enter(); > 428 eb = SMR_PTR_GET(>ac_brport); > 429 if (eb != NULL) { > 430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); > 431 if (m == NULL) { > 432 smr_read_leave(); > 433 return; > 434 } > 435 } > 436 smr_read_leave(); > 437 > > in current tree the ether_input() is protected by NET_LOCK(), which is grabbed > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has > implications on smr read section above. The ting is the call to eb->eb_input() > can sleep now. This is something what needs to be avoided within smr section. Is the new sleeping point introduced by the fact the PF_LOCK() is a rwlock? Did you consider using a mutex, at least for the time being, in order to not run in such issues?
parallel forwarding vs. bridges
Hrvoje, managed to trigger diagnostic panics with diff [1] sent by bluhm@ some time ago. The panic Hrvoje sees comes from ether_input() here: 414 415 /* 416 * Third phase: bridge processing. 417 * 418 * Give the packet to a bridge interface, ie, bridge(4), 419 * switch(4), or tpmr(4), if it is configured. A bridge 420 * may take the packet and forward it to another port, or it 421 * may return it here to ether_input() to support local 422 * delivery to this port. 423 */ 424 425 ac = (struct arpcom *)ifp; 426 427 smr_read_enter(); 428 eb = SMR_PTR_GET(>ac_brport); 429 if (eb != NULL) { 430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); 431 if (m == NULL) { 432 smr_read_leave(); 433 return; 434 } 435 } 436 smr_read_leave(); 437 in current tree the ether_input() is protected by NET_LOCK(), which is grabbed by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so ether_input() can run concurrently. Switching NET_LOCK() to r-lock has implications on smr read section above. The ting is the call to eb->eb_input() can sleep now. This is something what needs to be avoided within smr section. diff below introduces a reference counting to bridge port using atomic ops. the section above is changed to: 415 /* 416 * Third phase: bridge processing. 417 * 418 * Give the packet to a bridge interface, ie, bridge(4), 419 * switch(4), or tpmr(4), if it is configured. A bridge 420 * may take the packet and forward it to another port, or it 421 * may return it here to ether_input() to support local 422 * delivery to this port. 423 */ 424 425 ac = (struct arpcom *)ifp; 426 427 smr_read_enter(); 428 eb = SMR_PTR_GET(>ac_brport); 429 if (eb != NULL) 430 eb->eb_port_take(eb->eb_port); 431 smr_read_leave(); 432 if (eb != NULL) { 433 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); 434 eb->eb_port_rele(eb->eb_port); 435 if (m == NULL) { 436 return; 437 } 438 } 439 each bridge module must provide its own implementation for eb_port_take()/eb_port_rele(). Not sure if it is a way to go. According to testing done by Hrvoje the diagnostic panic disappears. So we can say diff solves the problem, but there might be better ways to solve it. there are other bugs to kill as pointed out here [2]. thanks and regards sashan [1] https://marc.info/?l=openbsd-tech=161903387904923=2 [2] https://marc.info/?l=openbsd-tech=162099270106067=2 8<---8<---8<--8< diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c index 7dbcf4a2ab0..2afcbb296a0 100644 --- a/sys/net/if_bridge.c +++ b/sys/net/if_bridge.c @@ -138,6 +138,8 @@ int bridge_ipsec(struct ifnet *, struct ether_header *, int, struct llc *, #endif int bridge_clone_create(struct if_clone *, int); intbridge_clone_destroy(struct ifnet *); +void bridge_take(void *); +void bridge_rele(void *); #defineETHERADDR_IS_IP_MCAST(a) \ /* struct etheraddr *a; */ \ @@ -152,6 +154,8 @@ struct if_clone bridge_cloner = const struct ether_brport bridge_brport = { bridge_input, + bridge_take, + bridge_rele, NULL, }; @@ -2049,3 +2053,15 @@ bridge_send_icmp_err(struct ifnet *ifp, dropit: m_freem(n); } + +void +bridge_take(void *unused) +{ + return; +} + +void +bridge_rele(void *unused) +{ + return; +} diff --git a/sys/net/if_ethersubr.c b/sys/net/if_ethersubr.c index c9493b7f857..4b975e2d4a1 100644 --- a/sys/net/if_ethersubr.c +++ b/sys/net/if_ethersubr.c @@ -426,14 +426,16 @@ ether_input(struct ifnet *ifp, struct mbuf *m) smr_read_enter(); eb = SMR_PTR_GET(>ac_brport); + if (eb != NULL) + eb->eb_port_take(eb->eb_port); + smr_read_leave(); if (eb != NULL) { m = (*eb->eb_input)(ifp, m, dst, eb->eb_port); + eb->eb_port_rele(eb->eb_port); if (m == NULL) { - smr_read_leave(); return; } } - smr_read_leave(); /* * Fourth phase: drop service delimited packets. diff --git a/sys/net/if_switch.c b/sys/net/if_switch.c index 22aecdc6b75..ca24597d1f7 100644 --- a/sys/net/if_switch.c +++ b/sys/net/if_switch.c @@ -110,6 +110,8 @@ struct mbuf voidswitch_flow_classifier_dump(struct switch_softc *, struct switch_flow_classify *); voidswitchattach(int); +void switch_take(void *); +void switch_rele(void *); struct